We currently use autolabel the recorded prompts using the alignment technique based on [malfrere97] that we discussed above for diphone labeling. Although this technique works its is not as robust for general speech as it is for carefully articulated nonsense words. Specifically this technique does not allow for alternative pronunciations which are common.
Ideally we should use a more generally speech recognition systems. In fact for labeling of unit selection database the bets results would be to train a recognition using Baum-Welch on the database until convergence. This would give the most consistent labeling. We have started initial experiments with using the open source CMU Sphinx recognition system are are likely to provide scripts to do this in later releases.
In the mean time the greatest problem in predict phone list must be the same (or very similar) to what was actually spoken. This can be achieved by a knowledge speaker, and by customizing the front end of the synthesizer so it produces more natural segments.
Two observations are worth mentioning here. First if the synthesizer makes a mistake in pronunciation and the human speaker does not we have found the things may work out anyway. For example we found the word "hwy" appeared in the prompts, which the synthesizer pronounced as "H W Y" while the speaker said "highway". The forced alignment however cause the phones in the pronunciation of the letters "H W Y" to be aligned with the phones in "highway" thus on selection appropriate, though misnamed, phones were selected. However you should depend on such compound errors.
The second observation is that festival includes prediction of vowel reduction. We are beginning to feel that such prediction is unnecessary in limited domain synthesizer or in unit selection in general. This vowel variation can itself be achieve but the clustering technique themselves and hence allows a reasonable back-off selection strategy than making firm type decisions.