Data set | # words vocabulary | # words training | # words validation | # words test |
---|---|---|---|---|
Penn TreeBank | 10k | 900k | 70k | 80k |
WikiText | 33k | 2M | 210k | 240k |
CGN | 100k | 10M | 550k | / |
Subtitles | 100k | 45M | 310k | / |
Hyperparameter | Penn TreeBank | WikiText | CGN | Subtitles |
---|---|---|---|---|
# LSTM layers | 1 | 2 | ||
# LSTM units | 512 | |||
# steps unrolling | 35 | |||
Scale initialization | -0.05 - +0.05 | |||
Dropout % | 50 | |||
Threshold for clipping norm of gradients | 5 | |||
Optimizer | Stochastic gradient descent | |||
Initial learning rate | 1 | Learning rate decay | 0.8 | 0.8 (discourse) / 0.6 (sentence) | 0.8 | Start learning rate decay after x epochs | 6 | 2 |
Softmax | Full | Sampled | ||
Early stopping? (stop after x times no improvement) | / | 3 | 2 |