Building Synthetic Voices

Alan W Black and Kevin A. Lenzo

Language Technologies Institute, Carnegie Mellon University and Cepstral, LLC

Thursday, 2nd January 2003

Full distribution: festvox-2.0-release.tar.gz document alone: bsv.ps.gz bsv.pdf
Updates and news on FestVox will be posted on http://festvox.org

NOTE: this document is incomplete

Table of Contents

I. Speech Synthesis

1. Overview of Speech Synthesis

1.1. History
1.2. Uses of Speech Synthesis
1.3. General Anatomy of a Synthesizer

2. Speech Science

3. A Practical Speech Synthesis System

3.1. Basic Use
3.2. Utterance structure
3.3. Modules
3.4. Utterance access
3.5. Utterance building
3.6. Extracting features from utterances

II. Building Synthetic Voices

4. Basic Requirements

4.1. Hardware/software requirements
4.2. Voice in a new language
4.3. Voice in an existing language
4.4. Selecting a speaker
4.5. Who owns a voice
4.6. Recording under Unix
4.7. Extracting pitchmarks from waveforms

5. Limited domain synthesis

5.1. designing the prompts
5.2. customizing the synthesizer front end
5.3. autolabeling issues
5.4. unit size and type
5.5. using limited domain synthesizers
5.6. Telling the time
5.7. Making it better

6. Text analysis

6.1. Non-standard words analysis
6.2. Token to word rules
6.3. Number pronunciation
6.4. Homograph disambiguation
6.5. TTS modes
6.6. Mark-up modes

7. Lexicons

7.1. Word pronunciations
7.2. Lexicons and addenda
7.3. Out of vocabulary words
7.4. Building letter-to-sound rules by hand
7.5. Building letter-to-sound rules automatically
7.6. Post-lexical rules
7.7. Building lexicons for new languages

8. Building prosodic models

8.1. Phrasing
8.2. Accent/Boundary Assignment
8.3. F0 Generation
8.4. Duration
8.5. Prosody Research
8.6. Prosody Walkthrough

9. Corpus developement

10. Waveform Synthesis

11. Diphone databases

11.1. Diphone introduction
11.2. Defining a diphone list
11.3. Recording the diphones
11.4. Labeling the diphones
11.5. Extracting the pitchmarks
11.6. Building LPC parameters
11.7. Defining a diphone voice
11.8. Checking and correcting diphones
11.9. Diphone check list

12. Unit selection databases

12.1. Cluster unit selection
12.2. Building a Unit Selection Cluster Voice
12.3. Diphones from general databases

13. Labeling Speech

13.1. Labeling with Dynamic Time Warping
13.2. Labeling with Full Acoustic Models
13.3. Prosodic Labeling

14. Evaluation and Improvements

14.1. Evaluation
14.2. Does it work at all?
14.3. Formal Evaluation Tests
14.4. Debugging voices

III. Interfacing and Integration

15. Markup
16. Concept-to-speech
17. Deployment

IV. Recipes

18. A Japanese Diphone Voice
19. US/UK English Diphone Synthesizer
20. ldom full example
21. Non-english ldom example

V. Concluding Remarks

22. Concluding remarks and future

23. Festival Details

24. Festival's Scheme Programming Language

24.1. Overview

24.2. Data Types

24.3. Functions

24.3.1. Core functions
24.3.2. List functions
24.3.3. Arithmetic functions
24.3.4. I/O functions
24.3.5. String functions
24.3.6. System functions
24.3.7. Utterance Functions
24.3.8. Synthesis Functions

24.4. Debugging and Help

24.5. Adding new C++ functions to Scheme

24.6. Regular Expressions

24.7. Some Examples

25. Edinburgh Speech Tools

26. Machine Learning

27. Resources

27.1. Festival resources
27.2. General speech resources

28. Tools Installation

29. English phone lists

29.1. US phoneset
29.2. UK phoneset

		Next >>>
		Speech Synthesis