Building Synthetic Voices

Alan W Black and Kevin A. Lenzo

Language Technologies Institute, Carnegie Mellon University and Cepstral, LLC
Thursday, 2nd January 2003

Full distribution: festvox-2.0-release.tar.gz document alone: bsv.ps.gz bsv.pdf
Updates and news on FestVox will be posted on http://festvox.org

NOTE: this document is incomplete



Copyright © 1999-2003 by Alan W Black & Kevin A. Lenzo


Table of Contents
I. Speech Synthesis
1. Overview of Speech Synthesis
1.1. History
1.2. Uses of Speech Synthesis
1.3. General Anatomy of a Synthesizer
2. Speech Science
3. A Practical Speech Synthesis System
3.1. Basic Use
3.2. Utterance structure
3.3. Modules
3.4. Utterance access
3.5. Utterance building
3.6. Extracting features from utterances
II. Building Synthetic Voices
4. Basic Requirements
4.1. Hardware/software requirements
4.2. Voice in a new language
4.3. Voice in an existing language
4.4. Selecting a speaker
4.5. Who owns a voice
4.6. Recording under Unix
4.7. Extracting pitchmarks from waveforms
5. Limited domain synthesis
5.1. designing the prompts
5.2. customizing the synthesizer front end
5.3. autolabeling issues
5.4. unit size and type
5.5. using limited domain synthesizers
5.6. Telling the time
5.7. Making it better
6. Text analysis
6.1. Non-standard words analysis
6.2. Token to word rules
6.3. Number pronunciation
6.4. Homograph disambiguation
6.5. TTS modes
6.6. Mark-up modes
7. Lexicons
7.1. Word pronunciations
7.2. Lexicons and addenda
7.3. Out of vocabulary words
7.4. Building letter-to-sound rules by hand
7.5. Building letter-to-sound rules automatically
7.6. Post-lexical rules
7.7. Building lexicons for new languages
8. Building prosodic models
8.1. Phrasing
8.2. Accent/Boundary Assignment
8.3. F0 Generation
8.4. Duration
8.5. Prosody Research
8.6. Prosody Walkthrough
9. Corpus developement
10. Waveform Synthesis
11. Diphone databases
11.1. Diphone introduction
11.2. Defining a diphone list
11.3. Recording the diphones
11.4. Labeling the diphones
11.5. Extracting the pitchmarks
11.6. Building LPC parameters
11.7. Defining a diphone voice
11.8. Checking and correcting diphones
11.9. Diphone check list
12. Unit selection databases
12.1. Cluster unit selection
12.2. Building a Unit Selection Cluster Voice
12.3. Diphones from general databases
13. Labeling Speech
13.1. Labeling with Dynamic Time Warping
13.2. Labeling with Full Acoustic Models
13.3. Prosodic Labeling
14. Evaluation and Improvements
14.1. Evaluation
14.2. Does it work at all?
14.3. Formal Evaluation Tests
14.4. Debugging voices
III. Interfacing and Integration
15. Markup
16. Concept-to-speech
17. Deployment
IV. Recipes
18. A Japanese Diphone Voice
19. US/UK English Diphone Synthesizer
20. ldom full example
21. Non-english ldom example
V. Concluding Remarks
22. Concluding remarks and future
23. Festival Details
24. Festival's Scheme Programming Language
24.1. Overview
24.2. Data Types
24.3. Functions
24.3.1. Core functions
24.3.2. List functions
24.3.3. Arithmetic functions
24.3.4. I/O functions
24.3.5. String functions
24.3.6. System functions
24.3.7. Utterance Functions
24.3.8. Synthesis Functions
24.4. Debugging and Help
24.5. Adding new C++ functions to Scheme
24.6. Regular Expressions
24.7. Some Examples
25. Edinburgh Speech Tools
26. Machine Learning
27. Resources
27.1. Festival resources
27.2. General speech resources
28. Tools Installation
29. English phone lists
29.1. US phoneset
29.2. UK phoneset