NOTE: use of this diphone synthesis is depricated and it
will probably be removed from future versions, all of its functionality
has been replaced by the UniSyn synthesizer. It is not
compiled by default, if required add
ALSO_INCLUDE += diphone
to your `festival/config/config' file.
A basic diphone synthesizer offers a method for making speech from segments, durations and intonation targets. This module was mostly written by Alistair Conkie but the base diphone format is compatible with previous CSTR diphone synthesizers.
The synthesizer offers residual excited LPC based synthesis (hunt89) and PSOLA (TM) (moulines90) (PSOLA is not available for distribution).
A diphone database consists of a dictionary file, a set of waveform files, and a set of pitch mark files. These files are the same format as the previous CSTR (Osprey) synthesizer.
The dictionary file consist of one entry per line. Each entry consists of five fields: a diphone name of the form P1-P2, a filename (without extension), a floating point start position in the file in milliseconds, a mid position in milliseconds (change in phone), and an end position in milliseconds. Lines starting with a semi-colon and blank lines are ignored. The list may be in any order.
For example a partial list of phones may look like.
ch-l r021 412.035 463.009 518.23 jh-l d747 305.841 382.301 446.018 h-l d748 356.814 403.54 437.522 #[email protected] d404 233.628 297.345 331.327 @-# d001 836.814 938.761 1002.48
Waveform files may be in any form, as long as every file is the same type, headered or unheadered as long as the format is supported the speech tools wave reading functions. These may be standard linear PCM waveform files in the case of PSOLA or LPC coefficients and residual when using the residual LPC synthesizer. section 21.2 LPC databases
Pitch mark files consist a simple list of positions in milliseconds (plus places after the point) in order, one per line of each pitch mark in the file. For high quality diphone synthesis these should be derived from laryngograph data. During unvoiced sections pitch marks should be artificially created at reasonable intervals (e.g. 10 ms). In the current format there is no way to determine the "real" pitch marks from the "unvoiced" pitch marks.
It is normal to hold a diphone database in a directory with a number of sub-directories namely `dic/' contain the dictionary file, `wave/' for the waveform files, typically of whole nonsense words (sometimes this directory is called `vox/' for historical reasons) and `pm/' for the pitch mark files. The filename in the dictionary entry should be the same for waveform file and the pitch mark file (with different extensions).
The standard method for diphone resynthesis in the released system is residual excited LPC (hunt89). The actual method of resynthesis isn't important to the database format, but if residual LPC synthesis is to be used then it is necessary to make the LPC coefficient files and their corresponding residuals.
Previous versions of the system used a "host of hacky little scripts" to this but now that the Edinburgh Speech Tools supports LPC analysis we can provide a walk through for generating these.
We assume that the waveform file of nonsense words are in a directory called `wave/'. The LPC coefficients and residuals will be, in this example, stored in `lpc16k/' with extensions `.lpc' and `.res' respectively.
Before starting it is worth considering power normalization. We have
found this important on all of the databases we have collected so far.
ch_wave program, part of the speech tools, with the optional
-scaleN 0.4 may be used if a more complex method is not
The following shell command generates the files
for i in wave/*.wav do fname=`basename $i .wav` echo $i lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \ -r lpc16k/$fname.res -otype htk -rtype nist $i done
It is said that the LPC order should be sample rate divided by one thousand plus 2. This may or may not be appropriate and if you are particularly worried about the database size it is worth experimenting.
The program `lpc_analysis', found in `speech_tools/bin', can be used to generate the lpc coefficients and residual. Note these should be reflection coefficients so they may be quantised (as they are in group files).
The coefficients and residual files produced by different LPC analysis
programs may start at different offsets. For example the Entropic's ESPS
functions generate LPC coefficients that are offset by one frame shift
(e.g. 0.01 seconds). Our own `lpc_analysis' routine has no offset.
Diphone_Init parameter list allows these offsets to be
specified. Using the above function to generate the LPC files the
description parameters should include
(lpc_frame_offset 0) (lpc_res_offset 0.0)
While when generating using ESPS routines the description should be
(lpc_frame_offset 1) (lpc_res_offset 0.01)
The defaults actually follow the ESPS form, that is
is 1 and
lpc_res_offset is equal to the frame shift, if they are
not explicitly mentioned.
Note the biggest problem we have in implementing the residual excited LPC resynthesizer was getting the right part of the residual to line up with the right LPC coefficients describing the pitch mark. Making errors in this degrades the synthesized waveform notably, but not seriously, making it difficult to determine if it is an offset problem or some other bug.
Although we have started investigating if extracting pitch synchronous LPC parameters rather than fixed shift parameters gives better performance, we haven't finished this work. `lpc_analysis' supports pitch synchronous analysis but the raw "ungrouped" access method does not yet. At present the LPC parameters are extracted at a particular pitch mark by interpolating over the closest LPC parameters. The "group" files hold these interpolated parameters pitch synchronously.
The American English voice `kd' was created using the speech tools `lpc_analysis' program and its set up should be looked at if you are going to copy it. The British English voice `rb' was constructed using ESPS routines.
Databases may be accessed directly but this is usually too inefficient for any purpose except debugging. It is expected that group files will be built which contain a binary representation of the database. A group file is a compact efficient representation of the diphone database. Group files are byte order independent, so may be shared between machines of different byte orders and word sizes. Certain information in a group file may be changed at load time so a database name, access strategy etc. may be changed from what was set originally in the group file.
A group file contains the basic parameters, the diphone index, the signal (original waveform or LPC residual), LPC coefficients, and the pitch marks. It is all you need for a run-time synthesizer. Various compression mechanisms are supported to allow smaller databases if desired. A full English LPC plus residual database at 8k ulaw is about 3 megabytes, while a full 16 bit version at 16k is about 8 megabytes.
Group files are created with the
Diphone.group command which
takes a database name and an output filename as an argument. Making
group files can take some time especially if they are large. The
group_type parameter specifies
for encoding signal files. This can significantly reduce the size
Group files may be partially loaded (see access strategies) at run time for quicker start up and to minimise run-time memory requirements.
The basic method for describing a database is through the
command. This function takes a single argument, a list of
pairs of parameter name and value. The parameters are
pcm, but for distributed voices this is always
Examples of both general set up, making group files and general use are in
Note that in group files pitch marks (and LPC coefficients) are
always fully loaded (cf.
direct), as they are typically
smaller. Only signals (waveform files or residuals) are potentially
The appropriate diphone is selected based on the name of the phone identified in the segment stream. However for better diphone synthesis it is useful to augment the diphone database with other diphones in addition to the ones directly from the phoneme set. For example dark and light l's, distinguishing consonants from their consonant cluster form and their isolated form. There are however two methods to identify this modification from the basic name.
When the diphone module is called the hook
is applied. That is a function of list of functions which will be
applied to the utterance. Its main purpose is to allow the conversion
of the basic name into an augmented one. For example converting a basic
l into a dark l, denoted by
ll. The functions given in
diphone_module_hooks may set the feature
diphone_phone_name which if set will be used rather than the
name of the segment.
For example suppose we wish to use a dark l (
ll) rather than
a normal l for all l's that appear in the coda of a syllable.
First we would define a function to which identifies this condition
and adds the addition feature
the name change. The following function would
(define (fix_dark_ls utt) "(fix_dark_ls UTT) Identify ls in coda position and relabel them as ll." (mapcar (lambda (seg) (if (and (string-equal "l" (item.name seg)) (string-equal "+" (item.feat seg "p.ph_vc")) (item.relation.prev seg "SylStructure")) (item.set_feat seg "diphone_phone_name" "ll"))) (utt.relation.items utt 'Segment)) utt)
Then when we wish to use this for a particular voice we need to add
(set! diphone_module_hooks (list fix_dark_ls))
in the voice selection function.
For a more complex example including consonant cluster identification
see the American English voice `ked' in
ked_diphone_fix_phone_name carries out a number of
The second method for changing a name is during actual look up of a
diphone in the database. The list of alternates is given by the
Diphone_Init function. These are used when the specified diphone
can't be found. For example we often allow mappings of dark l,
l as sometimes the dark l diphone doesn't actually
exist in the database.