Edinburgh Speech Tools Library
	Prev	Chapter 3. Executable Programs	Next

wagon CART building program

Table of Contents
Synopsis
OPTIONS
Building Trees

Synopsis

wagon [options] [-desc ifile] [-data ifile] [-stop int " {50}"] [-test ifile] [-frs float " {10}"] [-dlist ] [-dtree ] [-output ofile] [-o ofile] [-distmatrix ifile] [-quiet ] [-verbose ] [-predictee string] [-ignore string] [-stepwise ] [-swlimit float " {0.0}"] [-swopt string] [-balance float] [-held_out int] [-heap int " {210000}"] [-noprune ]

wagon is used to build CART tress from feature data, its basic features include:

both decisions trees and decision lists are supported
predictees can be discrete or continuous
input features may be discrete or continuous
many options for controling tree building
- fixed stop value
- balancing
- held-out data and pruning
- stepwise use of input features
- choice of optimization criteria correct/entropy (for classification and rmse/correlation (for regression)

A detailed description of building CART models can be found in the CART model overview section.

OPTIONS

-desc
ifile Field description file
-data
ifile Datafile, one vector per line
-stop
int " {50}" Minimum number of examples for leaf nodes
-test
ifile Datafile to test tree on
-frs
float " {10}" Float range split, number of partitions to split a float feature range into
-dlist
Build a decision list (rather than tree)
-dtree
Build a decision tree (rather than list) default
-output
ofile
-o
ofile File to save output tree in
-distmatrix
ifile A distance matrix for clustering
-quiet
No questions printed during building
-verbose
Lost of information printing during build
-predictee
string name of field to predict (default is first field)
-ignore
string Filename or bracket list of fields to ignore
-stepwise
Incrementally find best features
-swlimit
float " {0.0}" Percentage necessary improvement for stepwise
-swopt
string Parameter to optimize for stepwise, for classification options are correct or entropy for regression options are rmse or correlation correct and correlation are the defaults
-balance
float For derived stop size, if dataset at node, divided by balance is greater than stop it is used as stop if balance is 0 (default) always use stop as is.
-held_out
int Percent to hold out for pruning
-heap
int " {210000}" Set size of Lisp heap, should not normally need to be changed from its default, only with *very* large description files (> 1M)
-noprune
No (same class) pruning required

-desc	`ifile` Field description file
-data	`ifile` Datafile, one vector per line
-stop	`int` " {50}" Minimum number of examples for leaf nodes
-test	`ifile` Datafile to test tree on
-frs	`float` " {10}" Float range split, number of partitions to split a float feature range into
-dlist	Build a decision list (rather than tree)
-dtree	Build a decision tree (rather than list) default
-output	`ofile`
-o	`ofile` File to save output tree in
-distmatrix	`ifile` A distance matrix for clustering
-quiet	No questions printed during building
-verbose	Lost of information printing during build
-predictee	`string` name of field to predict (default is first field)
-ignore	`string` Filename or bracket list of fields to ignore
-stepwise	Incrementally find best features
-swlimit	`float` " {0.0}" Percentage necessary improvement for stepwise
-swopt	`string` Parameter to optimize for stepwise, for classification options are correct or entropy for regression options are rmse or correlation correct and correlation are the defaults
-balance	`float` For derived stop size, if dataset at node, divided by balance is greater than stop it is used as stop if balance is 0 (default) always use stop as is.
-held_out	`int` Percent to hold out for pruning
-heap	`int` " {210000}" Set size of Lisp heap, should not normally need to be changed from its default, only with very large description files (> 1M)
-noprune	No (same class) pruning required

Building Trees

To build a decision tree (or list) Wagon requires data and a desccription of it. A data file consists a set of samples, one per line each consisting of the same set of features. Features may be categorial or continuous. By default the first feature is the predictee and the others are used as preditors. A typical data file will look like this

0.399 pau sh 0 0 0 1 1 0 0 0 0 0 0 0.082 sh iy pau onset 0 1 0 0 1 1 0 0 1 0.074 iy hh sh coda 1 0 1 0 1 1 0 0 1 0.048 hh ae iy onset 0 1 0 1 1 1 0 1 1 0.062 ae d hh coda 1 0 0 1 1 1 0 1 1 0.020 d y ae coda 2 0 1 1 1 1 0 1 1 0.082 y ax d onset 0 1 0 1 1 1 1 1 1 0.082 ax r y coda 1 0 0 1 1 1 1 1 1 0.036 r d ax coda 2 0 1 1 1 1 1 1 1 ...

The data may come from any source, such as the festival script dumpfeats which allos the creation of such files easily from utetrance files.

In addition to a data file a description file is also require that gives a name and a type to each of the features in the datafile. For the above example it would look like

((segment_duration float) ( name aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau ) ( n.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau ) ( p.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau ) (position_type 0 onset coda) (pos_in_syl float) (syl_initial 0 1) (syl_final 0 1) (R:Sylstructure.parent.R:Syllable.p.syl_break float) (R:Sylstructure.parent.syl_break float) (R:Sylstructure.parent.R:Syllable.n.syl_break float) (R:Sylstructure.parent.R:Syllable.p.stress 0 1) (R:Sylstructure.parent.stress 0 1) (R:Sylstructure.parent.R:Syllable.n.stress 0 1) )

The feature names are arbitrary, but as they appear in the generated trees is most useful if the trees are to be used in prediction of an utterance that the names are features and/or pathnames.

Wagon can be used to build a tree with such files with the command

wagon -data feats.data -desc fest.desc -stop 10 -output feats.tree

A test data set may also be given which must match the given data description. If specified the built tree will be tested on the test set and results on that wil be presented on completion, without a test set the results are given with respect to the training data. However in stepwise case the test set is used in the multi-level training process thus it cannot be considered as true test data and more reasonable results should found on applying the generate tree to truely held out data (via the program wagon_test).

Prev	Home	Next
na_record Audio file recording	Up	ols Train linear regression model