A second method for labeling is also available. Here we train full acoustic HMM models on the recorded data. We build a database specific speech recognition engine and use that engine to label the data. As this method can work from recorded prompts plus orthography (and a method to produce phone strings from that orthography), this works well when you have no synthesizer to bootstrap from. However such training requires that the database has a suitable number of examples of triphones in it. Here we have an advantage. As the requirements for a speech synthesis data, that it is has a good distribution of phonemes, is the same as that require for acoustic modeling, a good speech synthesis databases should produce a good acoustic model for labeling. Although there is no neatly defined definition of what "good" is, we can say that you probably need at least 400 utterances, and at least 15,000 segments. 400 sentences all starting with "The time is now, ..." probably wont do.
Other large database synthesis techniques use the same basic techniques to not just label the database but define the units to be selected. [Donovan95] and others label there data with an acoustic model build (with Baum-Welch training) and use the defined HMM states (typically 3-5 per phoneme) as the units for selecting. [Tokuda9?] actually use the state models themselves to generate the units, but again use the same basic techniques for labeling.
For training we use Carnegie Mellon University's SphinxTrain and Sphinx speech recognition system. There are other accessible training systems out there, HTK being the most famous, but SphinxTrain is the one we are most familiar with, and we have some control over its updates so can better ensure it remains appropriate for our synthesis labeling task. As voice building is complex, acoustic model building is similarly so. SphinxTrain has been reliably used to labeling hundreds of databases in many different languages but making it utterly robust against unseen data is very hard so although we have tried to minimize the chance of things going wrong (in non-obvious ways), we will not be surprised that when you try this processing on some new database there may be some problems.
SphinxTrain (and sphinx) have a number of restrictions which we need to keep in mind when labeling a set of prompts. These a re code limitations, and may be fixed in future versions of SphinxTrain/Sphinx. For the most part the are not actually serious restrictions, just minor prompts that the setup scripts need to work around. The scripts cater for these limitations, and mostly will all go unseen by the user, unless of course something goes wrong.
Specifically, sphinx folds case on all phoneme names, so the scripts ensure that phone names are distinct irrespective of upper and lower case. This is done by prepending "CAP" in front of upper case phone names. Secondly there can only be up to 255 phones. This is likely only to be problem when SphinxTrain phones are made more elaborate than simple phones, so mostly wont be a problem. The third noted problem is limitation on the length and complexity of utterances. The transcript files has a line length limit as does the lexicon. For "nice" utterances this is never a problem but for some of our databases especially those with paragraph length utterances, the training and/or the labeling itself fails.
Sphinx2 is a real-time speech recognition system made available under a free software license. It is available from http://cmusphinx.org. The source is available from http://sourceforge.net/projects/cmusphinx/. For these tests we used version sphinx2-0.4.tar.gz. SphinxTrain is a set of programs and scripts that allow the building of acoustic models for Sphinx2 (and Sphinx3). You can download SphinxTrain from http://sourceforge.net/projects/cmusphinx/. Note that Sphinx2 must be compiled and installed while SphinxTrain can run in place. On many systems steps like these should give you working versions.
tar zxvf sphinx2-0.4.tar.gz
cd sphinx2 ./configure --prefix=$SPHINX2DIR
tar zxvf SphinxTrain-0.9.1-beta.tar.gz
Now that we have sphinx2 and SphinxTrain installed we can prepare our FestVox voice for training. Before starting the training process you must create utterance files for each of the prompts. This can be done with the conventional festival script.
This generates label files in prompt-lab/ and waveform files in prompt-wav/ which technically are not needed for this labeling process. Utterances are saved in prompt-utt/. At first it was thought that the prompt file etc/txt.done.data would be sufficient but the synthesis process is likely to resolve pronunciations in context, though post-lexical rules etc, that would make naive conversion of the words in the prompt list to phone lists wrong in general so the transcription for SphinxTrain is generated from the utterances themselves which ensures that they resulting labels can be trivially mapped back after labeling. Thus the word names generate by in this process are somewhat arbitrary though often human readable. The word names are the word themselves plus a number (to ensure uniqueness in pronunciations). Only "nice" words are printed as is, i.e. those containing only alphabetic characters, others are mapped to the word "w" with an appropriate number following. Thus hyphenated, quoted, etc words will not cause a problem for the SphinxTrain code.
festival -b festvox/build_clunits.scm '(build_prompts "etc/txt.done.data")'
After the prompt utterances are generated we can setup the SphinxTrain directory st/. All processing and output files are done within that directory until the file conversion of labels back into the voice's own phone set and put in lab/. Note this process takes a long time, at least several hours and possible several days if you have a particularly slow machine or particularly large database. Also this may require around a half a gigabyte of space.
The script ./bin/sphinxtrain does the work of converting the FestVox database into a form suitable for SphinxTrain. In all there are 6 steps: setup, building files, converting waveforms, the training itself, alignment and conversion of label files back into FestVox format. The training stage itself consist of 11 parts and by far takes the most time.
This script requires the environment variables SPHINXTRAINDIR and SPHINX2DIR to be set point to compiled versions of SphinxTrain and Sphinx2 respectively, as shown above.
The training database name will be taken from your etc/voice.defs, if you don't have one of those use
The next stage is to convert the database prompt list into a transcription file suitable for SphinxTrain,; construct a lexicon, and phone file etc. All of the generate files will be put in st/etc/. Note because of various limitations in Sphinx2 and SphinxTrain, the lexicon (.dic), and transcription (.transcription) will not have what you might thing are sensible values. The word names are take from the utts if they consist of only upper and lower case characters. A number is added to make them unique. Also if another work exists with the same pronunciation but different word it may be assigned a differ name from what you expect. The word names in the SphinxTrain files are only there to help debugging and are really referring to specific instances of words in the utterance (to ensure the pronunciations are preserved with respect to homograph disambiguation and post lexical rules. If people complain about these being confusing I will make all words simple "w" followed by a number.
The next stage is to generate the mfcc's for SphinxTrain unfortunately these must be in a different format from the mfcc's used in FestVox, also SphinxTrain only supports raw headered files, and NIST header files, so we copy the waveform files in wav/ into the st/wav/ directory converting them to NIST headers
Now we can start the training itself. This consists of eleven stages each which will be run automatically.
Module 0 checks the basic files for training. There should be no errors at this stage
Module 1 builds the vector quantization parameters.
Module 2 builds context-independent phone models. This runs Baum-Welch over the data building context-independent HMM phone models. This runs for several passes until convergences (somewhere between 4 and 15 passes). There may be some errors on some files (especially long, or badly transcribed ones), but a small number of errors here (with the identified file being "ignored" should be ok.
Module 3 makes the untied model definition.
Module 4 builds context dependent models.
Module 5a builds trees for asking questions for tied-states.
Module 5b builds trees. One for each state in each HMM. This part takes the longest time.
Module 6 prunes trees.
Module 7 retrain context dependent models with tied states.
Module 8 deleted interpolation
Module 9 convert the generated models to Sphinx2 format
Some utterances may fail to be labeled at this point, either because they are too long, or their orthography does not match the acoustics. There is not simple solution for this at present. For some you wimple not get a label file, and you can either label the utterance by hand, exclude if from the data, or split it into a smaller file. Other times Sphinx2 will crash and you'll need to remove the utterances from the st/etc/*.align and st/etc/*.ctl and run the script ./bin/sphinx_lab by hand.
festival -b festvox/build_clunits.scm '(build_utts "etc/txt.done.data")'