Pre-trained LSTM language models

This webpage contains LSTM language models trained with TF-LM, a language modeling toolkit based on TensorFlow.
Last update: 13th November 2017

We provide both discours-level and sentence-level models. A discourse-level model trains on batches that look like this:

owned by <unk> & <unk> co. was under contract with <unk> to make the cigarette filters <eos> the finding probably
will support those who argue that the u.s. should regulate the class of asbestos including <unk> more <unk> than the

The batches all have the same length and may contain (parts of) (multiple) sentence(s). Sentences are delimited with the end-of-sentence symbol.
Sentence-level batches look like this:

<bos> the plant which is owned by <unk> & <unk> co. was under contract with <unk> to make the cigarette filters <eos> @ @ @ ..."
<bos> the finding probably will support those who argue that the u.s. should regulate the class of asbestos including <unk> more <unk> than the common kind of asbestos <unk> found in most schools and other buildings dr. <unk> said <bos> @ @ @ ...

The batches all have the length of longest sentence in the corpus, and shorter sentences are padded (with e.g. '@'). This wastes quite some memory, but for the case of for example bidirectional LSTMs, sentence-level batches are preferered.

English Penn TreeBank

English WikiText

Dutch CGN (Corpus of Spoken Dutch)

Dutch Subtitles

Discourse LSTM

Statistics for all datasets

Data set	# words vocabulary	# words training	# words validation	# words test
Penn TreeBank	10k	900k	70k	80k
WikiText	33k	2M	210k	240k
CGN	100k	10M	550k	/
Subtitles	100k	45M	310k	/

Hyperparameters for all models

Hyperparameter	Penn TreeBank	WikiText	CGN	Subtitles
# LSTM layers	1		2
# LSTM units	512
# steps unrolling	35
Scale initialization	-0.05 - +0.05
Dropout %	50
Threshold for clipping norm of gradients	5
Optimizer	Stochastic gradient descent
Initial learning rate	1
Learning rate decay	0.8		0.8 (discourse) / 0.6 (sentence)	0.8
Start learning rate decay after x epochs	6		2
Softmax	Full		Sampled
Early stopping? (stop after x times no improvement)	/	3	2