Skip to Content

Department of Linguistics


Hybrid Concatenation and Rule-based Synthesis

In the late 1980's and early 1990's colleagues and I developed two speech synthesis systems that utilised diphone synthesis. One of the systems was based on auditorily modeled channel synthesis, whilst the other system was based on parallel formant synthesis. Both systems utilised the same segmented and labeled database. This database was used in the extraction of the two sets of diphones.

The procedure for the extraction of the diphone parameters was, obviously, quite different for the two systems.

Parameter extraction for the channel system involved the simple frame-by-frame determination of the intensity of the signal as it passed through each of 18 Bark-scaled channel filters.

The extraction of the formant parameters was much more complex and was achieved in four steps. First, a specialised formant tracker extracted the formant tracks. This formant tracker was tuned to the specific speaker being analysed and relied on the segmenting and labeling information in the database and a knowledge of expected formant values in the various contexts. Further, unlike many formant trackers, it was not permitted to fail during stops and voiceless fricatives but was required to provide estimates of formants so as to maintain pole continuity. A second step proceeded more-or-less simultaneously with this step and utilised both acoustic algorithms and phonetic expectations (based on the labeled segments) to determine whether the segment was voiced, voiceless or of mixed excitation. These two steps provided prerequisite information for the next two steps. Steps three and four were applied simultaneously and extracted the formant bandwidths and intensities. The relationship between measured formant intensities and bandwidths and synthesiser filter gains and bandwidths is very complex. The simple procedure of applying extracted parameters directly to the synthesiser filters most often results in spectra that poorly match the original natural spectra. Formant gains and bandwidths interact greatly with each other so that increasing a formant’s bandwidth will result in a decrease in the formant peak intensity. For this reason they need to be analysed together. Further, there is also an interaction between source characteristics and the gain required to produce the same peak intensity in the resultant speech. Finally, the architecture of individual formant synthesisers varies in such a way that a given set of gain and bandwidth parameters may produce a good spectral profile on one synthesiser but a poorer match with natural speech on a second synthesiser. To overcome these problems it was decided to extract the intensity and bandwidth parameters utilising an analysis-by-synthesis approach. Utilising a computer model of the target formant synthesiser and, for each segment, the appropriate source (or source mix) the bandwidths and gains were modified until a best match (Euclidean distance) with the natural spectra shape was obtained.

These approaches produced excellent and very natural sounding speech when passed over an entire sentence (ie. when used as a formant or channel vocoder). When diphones were extracted and then concatenated to produce sentences the intelligibility and naturalness deteriorated badly. The major challenge in diphone systems is the smoothing of parameter tracks at diphone boundaries to produce continuous speech without smoothing out important dynamic and fast moving cues to the extent that intelligibility is affected. Careful examination of the processes of parameter-frame time-warping and parameter track smoothing has indicated that the step with the greatest effect on intelligibility and naturalness is the step that attempts to smooth the gain parameters. The most likely reason for this is that in both systems three types of intensity information have been retained in the diphone parameter. These are:-

  1. relative gain information: This information is essential, when combined with bandwidth and formant centre frequency, to properly specify the total spectral shape and thus to obtain the closest match between natural and synthetic spectral shapes.
  2. segmental intensity profile: This parameter, which should be applied globally across all formants or channels, controls, for example, the shape of the energy peaks at syllable peaks, or the rising amplitude profile of stop bursts (differences in rise times at different frequencies should be accounted for in parameter 1.)
  3. prosodic intensity profile: prosodic intensity contours, including the effects of intensity declination and stress, as well as differences in the loudness of the voice and emotional effects.

When diphone cut points are selected it is difficult to synchronise the frequency and intensity targets of a segment. For example, the target centre of a vowel often does not line up exactly with the intensity peak of the vowel. In Australian English /i:/, for example, the intensity peak most often occurs in the onglide which precedes the stable formant target. As a result some diphones are cut at the intensity peak, some before the intensity peak and some after the intensity peak. The concatenation of diphones mismatched for position of the intensity peak often results in more than one intensity peak or alternatively in phonemes with missing intensity peaks. The resulting speech can be quite unnatural or even sound “rough”. Although all sentences have been normalised to the same average level the different prosodic effects on diphone intensities can also result in mismatches which upon concatenation produces unnatural intensity contours. These problems effect both the formant and the channel synthesis systems as the cut-points are identical and were selected by hand on a spectrographic display and were therefore largely the result (at least in the vowels) of operator perception of steady-state formant targets.

This project started with the existing diphone sets and extracted from the parameter tracks the relative intensity information, removing the other two types of intensity information. At the same time the average intensities and time-domain intensity profiles were analysed and stored and used as the basis of the rules for the production of appropriate segmental intensity profiles. The next step in this process will be the development of prosodic intensity rules. This last step is, however of lower priority as the main aim of this project is the improvement of segmental quality and intelligibility. Whilst segmental quality and intelligibility is significantly influenced by good prosodic modeling it is usually assumed that well modeled prosodic pitch has the greatest effect on the perception of naturalness.

Once the three aspects of intensity specification have been adequately modeled it is proposed that a study be carried out on the perceptual effects of modifying each of these three types of intensity information. This part of the project will examine effects on both intelligibility and speech quality or naturalness. Separate specification of the gain information in this way will also make possible planned future work on the effects of intensity and different intensity contours on the perception of vocal emotion.

Project Status

After some significant (funded) progress on this project between 1997 and 2000, progress on this project slowed considerably for a few years whilst concentration shifted to the completion of the vowel space and vocoder perception projects.

One of the main barriers to the progress of this project was its use of an incomplete diphone database. That database covered quite well all legal within-syllable diphones but omitted numerous diphones that only occur across syllable and word boundaries. Further, it only provided fragmentary coverage of perceptually significant allophones. Additionally, the database had been collected in the early 1990's and the person's voice had changed significantly in that time meaning that additional recored material would not be a good match to the originally recorded speech. In any case, the whole database had significant design flaws. In 2003 a completely new diphone database was recorded and work progresses slowly on the processing of that database. In 2006 an additional fast and normal version of the same database, with the same speaker, was also recorded.

When processing of the normal speech rate diphone database has been completed, perceptual experiments will be carried out on speech generated by the system, both for the purpose of evaluating the system and for the purpose of developing better algorithms for the specification of intensity parameters.

It is proposed that in 2009 a small selected subset of the database will also be recorded, at both normal and slow rate, whilst simultaneously measuring tongue tip, tongue body, bottom lip and jaw movements using electromagnetic articulography (EMA). Another partial dataset featuring oral/nasal airflow (for an indirect meassure of velum opening) will also be collected. The purpose of these articulatory measurements is to examine the feasibility of incorporating articulatory constraints in to the rule-based components of the system.

A longer term goal is to record a similar database from a female speaker.

Relevant Papers

Mannell R.H. (2002), "Modelling of the segmental and prosodic aspects of speech intensity in synthetic speech", Proceedings of the Ninth Australian International Conference on Speech Science and Technology, Melbourne, December 2002.

Mannell R.H. (1998), "Formant diphone parameter extraction utilising a labelled single speaker database", Proceedings of the International Conference on Spoken Language Processing 1998, Sydney, Australia, 30 November - 3 December 1998.