Chapter 8. Signal Processing

The speech tools signal processing library provides a set of standard signal processing tools designed specifically for speech analysis. The library includes:

  • Windowing (creating frames from a continuous waveform)

  • Linear prediction and associated operations

  • Cepstral analysis, both via lpc and DFT.

  • Filterbank analysis

  • Frequency warping including mel-scaling

  • Pitch tracking

  • Energy and Power analysis

  • Spectrogram Generation

  • Fourier Transforms

  • Pitchmarking (of laryngograph signals)


Design Issues

The signal processing library is designed specifically for speech applications and hence all functions are written with that end goal in mind. The design of the library has centered around building a set of commonly used easy to configure analysis routines.


We have tried to make the functions as fast as possible. Signal processing can often be time critical, and so it will always be the case that if the code for a particular signal processing algroithm is written in a single function loop it will run faster than by using libraries.

However, the signal processing routines in the EST library are in general very fast, and the fact that they use classes such as EST_Track and EST_FVector does not make them slower than they would be if float * etc was used.


The library makes heavy use of a small number of classes, specifically EST_Wave EST_Track and EST_FVector. These classes are basically arrays and matrices, but take care of issues such as memory managment, error handling and file i/o. Using these classes in the library helps facilitate clean and simple algorithm writing and use. It is strongly recommended that you gain familiarity with these classes before using this part of the library.

At present, the issue of complex numbers in signal processing is somewhat fudged, in that a vector of complex numbers is represented by a vector of real parts and a vector of imaginary parts, rather than as a single vector of complex numbers.

Common Processing model

In speech, a large number of algorithms follow the same basic model, in which a waveform is analysed by an algorithm and a Track, containing a series of time aligned vectors is produced. Regardless of the type of signal processing, the basic model is as follows:

  1. Start with a waveform and a series of analysis positions, which can be a fixed distance apart of specified by some other means.

  2. For each analysis position, define a small portion of the waveform around that position, Multiply this by a windowing function to produce a vector of speech samples.

  3. Pass this to a frame based signal processing routine which in outputs values in another vector.

  4. Add this vector to a position in an EST_Track which correponds to the analysis time position.

Given this model, the signal processing library breaks down into a number of different types of function:
Utterance based functions

Functions which operate on an entire waveform or track. These break down into:

Analysis Functions

which take a waveform and produce a track

Synthesis Functions

which take a track and produce a waveform

Filter Functions

which take a waveform and produce a waveform

Conversion Functions

which take a track and produce a track

Frames based functions

Functions which operate on a single frame of speech or vector coefficients.

Windowing functions

which create a windowed frame of speech from a portion of a waveform.

Nearly all functions in the signal processing library belong to one of the above listed types. Quite often functions are presented on both the utterance and frame level. For example, there is a function called sig2lpc which takes a single frame of windowed speech and produces a set of linear prediction coefficients. There is also a function called sig2coef which performs linear prediction on a whole waveforn, returning the answer in a Track. sig2coef uses the common processing model, and calls sig2lpc as the algorithm in the loop.

Partly for historical reasons some functions, e.g. pda are only available in the utterance based form.

When writing signal processing code for this library, it is often the case that all that needs to be written is the frame based algorithm, as other algorithms can do the frame shifting and windowing operations.

Track Allocation, Frames, Channels and sub-tracks

The signal processing library makes extensive use of the advanced features of the track class, specifically the ability to access single frames and channels.

Given a standard multi-channel track, it is possible to make a FVector point to any single frame or channel - this is done by an internal pointer mechanism in EST_FVector. Furthermore, a track can be made to point to a selected number of channels or frames in a main track.

For example, imagine we have a function that calculates the covariance matrix for a multi-dimensional track of data. But the data we actually have contains energy, cepstra and delta cepstra. It is non-sensical to calculate convariance on all of this, we just want the cepstra. To do this we use the sub-track facility to set a temporary track to just the cepstral coefficients and pass this into the covariance function. The temporary track has smart pointers into the original track and hence no data is copied.

Without this facility, either you would have to do a copy (expensive) or else tell the covariance function which part of the track to use (hacky). Extensive documentation describing this process is found in the section called Frame based signal processing, the section called Access multiple frames or channels. in Chapter 5 and the section called Access single frames or single channels. in Chapter 5.