ngram_build Train n-gram language model

Table of Contents


ngram_build [input file0] [input file1] ... -o [output file] [-p ifile] [-order int] [-smooth int] [-input_format string] [-otype string] [-sparse ] [-dense ] [-backoff int] [-floor double] [-freqsmooth int] [-trace ] [-save_compressed ] [-oov_mode string] [-oov_marker string] [-prev_tag string] [-prev_prev_tag string] [-last_tag string] [-default_tags ]

ngram_build offers basic ngram language model estimation.

Input data format

Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence:

the cat sat on the mat
would be treated as
... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag
where prev_prev_tag is the argument to -prev_prev_tag, and so on. A default set of tag names is also available. This input format is only useful for sliding-window type applications (e.g. language modelling for speech recognition). The second input format is ngram_per_line which is useful for either non-sliding-window applications, or where the user requires an alternative treatment of start/end of sentence to that provided above. Now the input file simply contains a complete ngram per line. For the same example as above (to build a trigram model) this would be:

prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag



The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, \(V^N\) becomes very large. In such cases, we cannot explicitly hold pobabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).

Getting more robust probability estimates

The common techniques for getting better estimates of the low/zero frequency ngrams are provided: namely smoothing and backing-off

Testing an ngram model

Use the



ifile filename containing word list (required)


ifile filename containing predictee word list (default is to use wordlist given by -w)


int order, 1=unigram, 2=bigram etc. (default 2)


int Good-Turing smooth the grammar up to the given frequency


string format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.


string format of output file, one of cstr_ascii cstr_bin or htk_ascii


build ngram in sparse representation


build ngram in dense representation (default)


int build backoff ngram (requires -smooth)


double frequency floor value used with some ngrams


int build frequency backed off smoothed ngram, this requires -smooth option


give verbose outout about build process


save ngram in gzipped format


string what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker


string special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :


string tag before sentence start


string all words before 'prev_tag'


string after sentence end


use default tags of !ENTER,!EXIT and !EXIT respectively