Skip to Content

Department of Linguistics

TTS: An Overview of Concatenative Approaches to Speech Synthesis

Robert Mannell

Synthesis-by-Concatenation

Most modern TTS systems use some type of synthesis-by-concatenation method. Concatenation techniques take small units of speech, either waveform data or acoustically parameterised data, and concatenate sequences of these small units together to produce either time varying acoustic parameters or, alternatively, waveforms. The time-varying acoustic parameters then need to be converted into a waveform by passing them through a speech synthesiser. Concatenation systems are concerned with the selection of appropriate units and the algorithms that join those units together. TTS system designers need to make decisions about the type of acoustic parameter(s) to be used and the size of the concatenative units.

  • For information on how MU-Talk's concatenation branch works, click here.

Australian English Diphones

Concatenation synthesis is achieved by joining together units of one or more sizes. These units can be diphones, triphones, demisyllables, syllables (popular for Chinese), words, and even longer units. Some synthesisers are based on units of varying lengths, with the longest appropriate sequences being selected from a large speech database. These units may be stored in many different parametric forms (lpc, waveforms, formant parameters, etc.)

A diphone extends from some point (not necessarily the temporal centre) of one phoneme to some point within the adjacent phoneme. The diphone cut-points are generally defined in such a way that the transition between the two phonemes occurs entirely within the diphone. Typically, a diphone extends from the target of the first phoneme to the target of the second phoneme.

When attempting to determine an "optimal" diphone set for Australian English:-

  • You need to have a clear understanding of the phonotactic constraints in Australian English (ie. which phonemes can be adjacent within a syllable) but you should also understand that sequences of phonemes that are illegal within a syllable are very often, but not always, possible across syllable and word boundaries.
  • You should also consider whether all vowel-vowel sequences are possible across syllable and word boundaries or whether you can constrain the number of vowel-vowel diphones that you will need by assuming that in some (or all?) cases a reduced approximant (/r/,/w/,/j/) is inserted.
  • You should especially be aware of which phonemes never can occur in a syllable-initial or in a syllable-final position as this information will especially constrain the number of phoneme-phoneme sequences that can occur across syllable and word boundaries.
  • What vowel and consonant allophones will need to be taken into account when determining the diphone set? You should consider dark versus clear /l/, nasalised versus non-nasalised vowels, aspirated versus unaspirated voiceless stops, voiced versus de-voiced approximants, etc.
  • You should be aware of the effects of syllable and word boundaries on the selection of allophones that occur on either side of such boundaries (such effects are know as "juncture" effects).
  • What prosodic contexts need to be considered when determining a full set of diphones?
  • You will need to define what you mean by optimal when determining this diphone set. For example, do time and labour constraints prohibit the recording and processing of a very large diphone set?

Similar principles apply to the development of diphone databases for other dialects and languages.

Bibliography

  1. Bhaskararao P., (1994) "Subphonemic segment inventories for concatenative synthesis", In Keller E. (ed.) Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, State of the Art and Future Challenges. John Wiley and Sons, Chichester, pp69-86
  2. Campbell N., and Black A.W., (1997) "Prosody and the selection of source units for concatenative synthesis", In van Santen J.P.H., Sproat R.W., Olive J.P., and Hirschberg J., (eds.) Progress in Speech Synthesis, Springer, New York, pp279-292
  3. O'Shaughnessy D. (1992) "Spectral transitions in rule-based and diphone synthesis", In Bailly G., and Benoit C. Talking Machines: Theories, Models and Designs. North-Holland, Amsterdam, pp77-92
  4. Takeda K., Abe K., and Sagisaka Y., (1992) "On the basic scheme and algorithms in non-uniform unit speech synthesis", In Bailly G., and Benoit C. Talking Machines: Theories, Models and Designs. North-Holland, Amsterdam, pp93-106