ngram_build [input file0] [input file1] ... -o [output file] [-p ifile] [-order int] [-smooth int] [-input_format string] [-otype string] [-sparse ] [-dense ] [-backoff int] [-floor double] [-freqsmooth int] [-trace ] [-save_compressed ] [-oov_mode string] [-oov_marker string] [-prev_tag string] [-prev_prev_tag string] [-last_tag string] [-default_tags ]
ngram_build offers basic ngram language model estimation.
Input data format
Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence:
|the cat sat on the mat|
|... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag|
|prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag|
The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, \(V^N\) becomes very large. In such cases, we cannot explicitly hold pobabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).
Getting more robust probability estimates
Testing an ngram model
ifile filename containing word list (required)
ifile filename containing predictee word list (default is to use wordlist given by -w)
int order, 1=unigram, 2=bigram etc. (default 2)
int Good-Turing smooth the grammar up to the given frequency
string format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.
string format of output file, one of cstr_ascii cstr_bin or htk_ascii
build ngram in sparse representation
build ngram in dense representation (default)
int build backoff ngram (requires -smooth)
double frequency floor value used with some ngrams
int build frequency backed off smoothed ngram, this requires -smooth option
give verbose outout about build process
save ngram in gzipped format
string what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker
string special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :
string tag before sentence start
string all words before 'prev_tag'
string after sentence end
use default tags of !ENTER,!EXIT and !EXIT respectively