Department of Linguistics

Text-to-Speech (TTS) Systems

Robert Mannell

A Text-to-Speech system is a system that converts, as its name suggests, text into speech. The text can be human generated or machine generated. The text can be plain text or it can be annotated text. If the text is annotated, then the annotation will most likely contain additional linguistic information (eg. lexical, syntactic or semantic) and it might also provide information on the target speaker (eg. a male or female voice). The annotation might be synthesiser specific or it might comply with a standard (such as W3C's SSML). This topic examines some of the issues in, and some of the solutions to, the generation of speech from input text.

A typical TTS system will normally include a selection of the following modules:-

  1. TTS Input: Plain Text and Other Input Formats
  2. Text Preprocessors
  3. Grammatical and Semantic Preprocessors
  4. Grapheme-to-Phoneme Conversion
  5. Prosody
  6. Context-Sensitve Rules (CSR)
  7. Connected Speech
  8. Synthesis-by-Rule (SBR)
  9. Synthesis-by-Concatenation
  10. TTS Output: Speech Synthesisers

Other TTS topics:-

  1. MU-Talk TTS System The TTS modules listed above will be described by particularly examining their implementation in MU-Talk.
  2. The Festival TTS System
  3. Multilingual TTS
  4. Paper on Speech Synthesis Research (by Sproat, Ostendorf, and Hunt, 1999)