11.2. Defining a diphone list

Since diphones need to be cleanly articulated, various techniques have been proposed to elicit them from subjects. One technique is to use target words embedded carrier sentences to ensure that the diphones are pronounced with acceptable duration and prosody (i.e. consistently). We have typically used nonsense words that iterate through all possible combinations; the advantage of this is that you don't need to search for natural examples that have the desired diphone, the list can be more easily checked and the presentation is less prone to pronunciation errors than if real words were presented. The words look unnatural but collecting all diphones in not a particularly natural thing to do. See [isard86] or [stella83] for some more discussion on the use of nonsense words for collecting diphones.

For best results, we believe the words should be pronounced with consistent vocal effort, with as little prosodic variation as possible. In fact pronouncing them in a monotone is ideal. Our nonsense words consist of a simple carrier form with the diphones (where appropriate) being taken from a middle syllable. Except where schwa and syllabic consonants are involved that syllable should normally be a full stressed one.

Some example code is given in src/diphone/darpaschema.scm. The basic idea is to define classes of diphones, for example: vowel consonant, consonant vowel, vowel vowel and consonant consonant. Then define carrier contexts for these and list the cases. Here we use Festival's Scheme interpreter to generate the list though any scripting language is suitable. Our intention is that the diphone will come from a middle syllable of the nonsense word so it is fully articulated and minimize the articulatory effects at the start and end of the word.

For example to generate all vowel vowel diphone we define a carrier

(set! vv-carrier '((pau t aa t) (t aa pau)))

And we define a simple function that will enumerate all vowel vowel transitions

(define (list-vvs)
  (apply
   append
   (mapcar
    (lambda (v1)
      (mapcar 
       (lambda (v2) 
         (list
          (string-append v1 "-" v2)
          (append (car vv-carrier) (list v1 v2) (car (cdr vv-carrier)))))
       vowels))
    vowels)))

For those of you who aren't used to reading Lisp this simple lists all possible combinations or in some potentially more readable format (in an imaginary language)

for v1 in vowels
   for v2 in vowels
     print pau t aa t $v1 $v2 t aa pau

The actual Lisp code returns a list of diphone names and phone string. To be more efficient, the DARPAbet example produces consonant-vowel and vowel-consonant diphones in the same nonsense word, which reduces the number of words to be spoken quite significantly. Your voice talent will appreciate this.

Although the idea seems simple to begin with, simply listing all contexts and pairs, there are other constraints. Some consonants can only appear in the onset of a syllable (before the vowel), and others are restricted to the coda.

While one can collect all the diphones without considering where they fall in a syllable, it often makes sense to collect diphones in different syllabic contexts. Consonant clusters are the obvious next set to consider; thus the example DARPAbet schema includes simple consonant clusters with explicit syllable boundaries. We also include syllabic consonants though these may be harder to pronounce in all contexts. You can add other phenomena too, but this is at the cost of not only making the list longer (and making it take longer to record), but making it harder to produce. You must consider how easy it is for your voice talent to pronounce them, and how consistent they can be about it. For example, not all American speakers produce flaps (/dx/) in all of the same contexts (and some may produce them even when you ask them not to), and its quite difficult for some to pronounce them, which can lead to production/transcription mismatches.

A second and related problem is language interference, which can cause phoneme crossover. Because of the prevalence of English, especially in electronic text, how many "foreign" phone should be considered for addition? For example, should /w/ be include for German speakers, (maybe), /t-i/ for Japanese (probably) or both /b/ and /v/ for Spanish speakers ("B de burro / V de vaca"). This problem is made difficult by the fact that the people you are recording will often be fluent or nearly fluent in English, and hence already have reasonably ability in phones that are not in their native language. If you are unfamiliar with the phone set and constraints on a language, it pays off considerably to either ask someone (like a linguist!) who knows the language analytically (not just by intuition), to check the literature, or to do some research.

To the degree that they are expected to appear, regardless of their status in the target language per se, foreign phones should be considered for the inventory. Remember that in most languages, nowadays, making no attempt to accommodate foreign phones is considered ignorant at least and possibly even arrogant.

Ultimately, when more complex forms are needed, extending the "diphone" set becomes prohibitive and has diminishing returns. Obviously there are phonetic differences between onset and coda positions, co-articulatory effects which go over more then one phone, stress differences, intonational accent differences, and phrase-positional difference to name but a few. Explicitly enumerating all of these, or even deciding the relative importance of each, is a difficult research question, and arguably shouldn't be done in an abstract, linguistically generated fashion from a strict interpretation of the language. Identifying these potential differences and finding an inventory which takes into account the actual distinctions a speaker makes is far more productive and is the fundamental part of many new research directions in concatenative speech synthesis. (See the discussion in the introduction above).

However you choose to construct the diphone list, and whatever examples you choose to include, the the tools and scripts included with this document require that it be in a particular format.

Each line should contain a file id, a prompt, and a diphone name (or list of names if more than one diphone is being extracted from that file). The file id is used to in the filename for the waveform, label file, and any other parameters files associated with the nonsense word. We usually make this distinct for the particular speaker we are going to record, e.g. their initials and possibly the language they are speaking.

The prompt is presented to the speaker at recording time, and here it contains a string of the phones in the nonsense word from which the diphones will be extracted. For example the following is taken from the DARPAbet-generated list

( uk_0001 "pau t aa b aa b aa pau" ("b-aa" "aa-b") )
( uk_0002 "pau t aa p aa p aa pau" ("p-aa" "aa-p") )
( uk_0003 "pau t aa d aa d aa pau" ("d-aa" "aa-d") )
( uk_0004 "pau t aa t aa t aa pau" ("t-aa" "aa-t") )
( uk_0005 "pau t aa g aa g aa pau" ("g-aa" "aa-g") )
( uk_0006 "pau t aa k aa k aa pau" ("k-aa" "aa-k") )
...
( uk_0601 "pau t aa t ey aa t aa pau" ("ey-aa") )
( uk_0602 "pau t aa t ey ae t aa pau" ("ey-ae") )
( uk_0603 "pau t aa t ey ah t aa pau" ("ey-ah") )
( uk_0604 "pau t aa t ey ao t aa pau" ("ey-ao") )
...
( uk_0748 "pau t aa p - r aa t aa pau" ("p-r") )
( uk_0749 "pau t aa p - w aa t aa pau" ("p-w") )
( uk_0750 "pau t aa p - y aa t aa pau" ("p-y") )
( uk_0751 "pau t aa p - m aa t aa pau" ("p-m") )
...

Note the explicit syllable boundary marking - for the consonant-consonant diphones is used to distinguish them from the consonant cluster examples that appear later.

11.2.1. Synthesizing prompts

To help keep pronunciation consistent we suggest synthesizing prompts and playing them to your voice talent at collection time. This helps the speaker in two ways -- if they mimic the prompt they are more likely to keep a fixed prosodic style; it also reduces the number of errors where the speaker vocalizes the wrong diphone. Of course for new languages where a set of diphones doesn't already exist, producing prompts is not easy, however giving approximations with diphone sets from other languages may work. The problem then is that in producing prompts from a different phone set, the speaker is likely to mimic the prompts hence the diphone set will probably seem to have a foreign pronunciation, especially for vowels. Furthermore, mimicing the synthesizer too closely can remove some of the speaker's natural voice quality, which is under their (possibly subconscious) control to some degree.

Even when synthesizing prompts from an existing diphone set, you must be aware that that diphone set may contain errors or that certain examples will not be synthesized appropriately (e.g. consonant clusters). Because of this, it is still worthwhile monitoring the speaker to ensure they say things correctly.

The basic code for generating the prompts is in src/diphone/diphlist.scm, and a specific example for DARPA phone set for American English in src/diphone/us_schema.scm. The prompts can be generated from the diphone list as described above (or at the same time). The example code produces the prompts and phone labels files which can be used by the aligning tool described below.

Before synthesizing, the function Diphone_Prompt_Setup is called, if it has been defined. You should define this to set up the appropriate voices in Festival, as well as any other initialization you might need -- for example, setting the fundamental frequency (F0) for the prompts that are to be delivered in a monotone (disregarding so-called microprosody, which is another matter). This value is set through the variable FP_F0 and should be near the middle of the range for the speaker, or at least somewhere comfortable to deliver. For the DARPAbet diphone list for KAL, we have:

(define (Diphone_Prompt_Setup)
 "(Diphone_Prompt_Setup)
Called before synthesizing the prompt waveforms.  Uses the kal_dphone
voice for prompting and sets F0."
 (voice_kal_diphone)  ;; US male voice
 (set! FP_F0 90)      ;; lower F0 than ked
 )

If the function Diphone_Prompt_Word is defined, it will be called after the basic prompt-word utterance has been created, and before the actual waveform synthesis. This may be used to map phones to other phones, set durations or whatever you feel appropriate for your speaker/diphone set. For the KAL set, we redefined the syllabic consonants to their full consonant forms in the prompts, since the ked diphone database doesn't actually include syllabics. Also, in the example below, instead of using fixed (100ms) durations we make the diphones use a constant scaling factor (here, 1.2) times the average duration of the phones.

(define (Diphone_Prompt_Word utt)
  "(Diphone_Prompt_Word utt)
Specify specific modifications of the utterance before synthesis
specific to this particular phone set."
  ;; No syllabics in kal so flip them to non-syllabic form
  (mapcar
   (lambda (s)
     (let ((n (item.name s)))
       (cond
        ((string-equal n "el")
         (item.set_name s "l"))
        ((string-equal n "em")
         (item.set_name s "m"))
        ((string-equal n "en")
         (item.set_name s "n")))))
   (utt.relation.items utt 'Segment))
  (set! phoneme_durations kd_durs)
  (Parameter.set 'Duration_Stretch '1.2)
  (Duration_Averages utt))

By convention, the prompt waveforms are saved in prompt-wav/, and their labels in prompt-lab/. The prompts may be generated after the diphone list is given using the following command:

festival festvox/us_schema.scm festvox/diphlist.scm
festival> (diphone-gen-schema "us" "etc/usdiph.list")

If you already have a diphone list schema generated in the file etc/usdiphlist, you can do the following

festival festvox/us_schema.scm festvox/diphlist.scm
festival> (diphone-gen-waves "prompt-wav" "prompt-lab" "etc/usdiph.list")

Another useful example of the setup functions is to generate prompts for a language for which no synthesizer exists yet -- to "bootstrap" from one language to another. A simple mapping can be given between the target phoneset and an existing synthesizer's phone set. We don't know if this will be sufficient to actually use as prompts, but we have found that it is suitable to use these prompts for automatic alignment.

The example here is using the voice_kal_diphone speaker, a US English speaker, to produce prompts for japanese phone set, this code is in src/diphones/ja_schema.scm

The function Diphone_Prompt_Setup calls the kal (US) voice, sets a suitable F0 value, and sets the option diph_do_db_boundaries to nil. This option allows the diphone boundaries to be dumped into the prompt label files, but this doesn't work when cross-language prompting is done, as the actual phones don't match the desired ones.

(define (Diphone_Prompt_Setup)
 "(Diphone_Prompt_Setup)
Called before synthesizing the prompt waveforms.  Cross language prompts
from US male (for gaijin male)."
 (voice_kal_diphone)  ;; US male voice
 (set! FP_F0 90)
 (set! diph_do_db_boundaries nil) ;; cross-lang confuses this
 )

At synthesis time, each Japanese phone must be mapped to an equivalent (one or more) US phone. This is done though a simple table. set in nhg2radio_map which gives the closest phone or phones for the Japanese phone (those unlisted remain the same).

Our mapping table looks like this

(set! nhg2radio_map
      '((a aa)
(i iy)
(o ow)
(u uw)
(e eh)
(ts t s)
(N n)
(h hh)
(Qk k)
(Qg g)
(Qd d)
(Qt t)
(Qts t s)
(Qch t ch)
(Qj jh)
(j jh)
(Qs s)
(Qsh sh)
(Qz z)
(Qp p)
(Qb b)
(Qky k y)
(Qshy sh y)
(Qchy ch y)
(Qpy p y ))
(ky k y)
(gy g y)
(jy jh y)
(chy ch y)
(shy sh y)
(hy hh y)
(py p y)
(by b y)
(my m y)
(ny n y)
(ry r y)))

Phones that are not explicitly mentioned map to themselves (e.g. most of the consonants).

Finally we define Diphone_Prompt_Word to actually do the mapping. Where the mapping involves more than one US phone we add an extra segment to the Segment (defined in the Festival manual) relation and split the duration equally between them. The basic function looks like

(define (Diphone_Prompt_Word utt)
  "(Diphone_Prompt_Word utt)
Specify specific modifications of the utterance before synthesis
specific to this particular phone set."
  (mapcar
   (lambda (s)
     (let ((n (item.name s))
   (newn (cdr (assoc_string (item.name s) nhg2radio_map))))
       (cond
((cdr newn)  ;; its a dual one
 (let ((newi (item.insert s (list (car (cdr newn))) 'after)))
   (item.set_feat newi "end" (item.feat s "end"))
   (item.set_feat s "end"
  (/ (+ (item.feat s "segment_start")
(item.feat s "end"))
     2))
   (item.set_name s (car newn))))
(newn
 (item.set_name s (car newn)))
(t
 ;; as is
 ))))
   (utt.relation.items utt 'Segment))
  utt)

The label file produced from this will have the original desired language phones, while the acoustic waveform will actually consist of phones in the target language. Although this may seem like cheating, we have found this to work for Korean and Japanese from English, and is likely to work over many other language combination pairs. For autolabeling as the nonse word phone names are pre-defined alignment just needs to be the best matching path and as long as the phones are distinctive from the ones around them this alignment method is likely to work.