I. Speech Synthesis

This book is about spoken language output -- how to build a synthetic voice that can express a range of concepts in a given domain, whether a system has a small range, and so an easily describable domain, or an unlimited set of possible inputs.

The Festival Speech Synthesis System, and more specifically, the FestVox voice building tools, provide the framework for the discussion. All these tools, Festival, and all other applications and databases given in examples here are available as open source and/or free software.

This version is still a very early draft, it rondom changes from high level discussion to low level code aspects without the appropriatae introduction of the material. As it stands this version is most useful to people with some speech processing experience who want to build voices in the Festival Speech Synthesis System, and have some patience to work through these notes. These sections will become more general introductions in later versions.

Over the last several years, speech technology, as well as the computational resources needed to carry out the necessary (or popular) algrithms, have advanced considerably. We can build systems that interact through speech; they can listen to what was said using speech recognition, compute or do something, and then speak back, using spoken language generation.

Language is a fundamental part of everyday life. Whether we are using speech, sign language, or perhaps a coding system that conveys meaning through touch, we use language to express our thoughts, intentions, reactions, and experiences -- often scarcely considering that we are even speaking, but rather feeling as if we were engaging in direct manipulation of the notions with our conversation partner.

The complexity of the underlying systems involved in speech rarely enters our minds as we go about our business, and it seems as if this fosters an unspoken attitude: "It's easy to talk, it should be easy to make a machine that talks." By the time you finish this book, you should be aware of the components of the process, the relative effort and expertise involved in each component, and where the trade-offs lie in speech synthesis today. Our hope is that you will take away, even with a light reading, an appreciation of the fundamental aspects of speech communication, and how one might, with this modality, make machines more useful. speech.

The book is organized into six major parts: Overview and Use of Speech Output, Building Synthetic Voices, Interfacing and Integrating, detailed Recipes for building voices, and our concluding remarks critiquing the state of the art, and discussing a number of interesting issues that remain open in speech science and synthesis technology.

Part I is a speech science and technology overview, primarily as it relates to speech-interactive systems, and how to use speech synthesis. We will be using the Festival Speech Synthesis System throughout the book, and it is presented here. If you are familiar with speech synthesis, you might want to skip to the discussion of Festival in Chapter 3; if you're familiar with Festival, you may wish to go directly to Part II.

In Chapter 1, we start with an introduction to speech in general, the role of spoken language generation, and in particular, of the basic issues in speech synthesis: text analysis, prosody, and audio waveform generation; then in Chapter 2, we discuss speech science and technology in greater detail, introducing the basic tools, concepts, and terms involved. Chapter 3 is on using the Festival Speech Synthesis System -- as a system, its basic use, and the modules and architecture. The utterance structure we use is also discussed here.

Part II is on Building Synthetic Voices; here, start looking at the process of building a voice, and illustrate it with demonstaions and walk-throughs using the FestVox voice building tools. Chapter 5 covers limited domain synthesis, in which high quality can be achieved at the expense of flexibility, if you know what the system is going to say in advance. In Chapter 6, we then go into general synthesis, which includes Text-to-Speech (TTS); the system should be able to speak, no matter what it is given, with some expected level of quality -- even if given only a string of text. It is in this chapter that we expose most of the components of a general purpose synthesizer, such as the text processing module, prosodic models that predict tune and timing, lexicons and pronunciation modeling, and issues with data collection and recording in order to create voices.

Chapter 10 is on various techniques for audio waveform generation, once the system knows the final parameters of the utterance. Evaluation and tuning of voices is discussed in Chapter 14.

In Part III, we get to Interfacing and Integration issues. Chapter 15 covers mark-up languages, or mark-up components of standards, such as SABLE, JSML, and VoiceXML, and how these can be used and implemented for synthesis. Then, in Chapter 16 we go into some detail on Concept-to-Speech systems, and so touch on language generation and the benefits of coupling the language generation and synthesis components in terms of spoken output quality. Chapter 17 is a discussion of deployment issues, such as client/server issues, the memory and disk footprint, runtime constraints, scaling up and pruning systems down to meet the available resources, and signal compression.

Although step by step examples will be discussed in line throughout the various chapters, Part IV offers complete detailed walkthoughs of the actual steps involved in a number of larger examples. Chapter 18 gives a complete example of building a Japanese diphone synthesizer, and Chapter 19 covers UK and US English diphone synthesizer while Chapter 20 covers a the deisgn and executation of building a limited domain synthesizer for ...

We follow with Part V, with a discussion of the state of the art and the future of spoken language output, and then a set of appendices.