Skip to Content

Department of Linguistics

TTS: Interfaces with Text Generation Systems

Robert Mannell

Interfacing Text Generation and TTS systems

One of the problems that needs to be overcome when attempting to interface TTS and Text Generation (TGen) Systems is one of communication (between systems or between TTS and TGen researchers). This might seem strange, given that both fields deal with text and both fields utilise both linguistic and computational approaches to solve their problems. The relationship should be simple, as the output of a TGen system, text, happens to be the input to a TTS system. Why not just plug the output end of a TGen system to the input end of a TTS system?

The problem arises because the output of a TGen system is often not what is really wanted by a TTS system. A TTS system can get raw text from many sources and can generally provide a reasonable, if often not especially natural, spoken output from that text.

So what exactly do TTS researchers want from a TGen system?

They want more than just text. They want the rich semantic and syntactic information that they assume lies just beneath the surface of a TGen system in some explicit and useable form.

The TTS researcher would like the TGen system to provide some of the information that pre-processor and other TTS linguistic modules attempt to provide. They would particularly like grammatical and syntactic information that would help them:-

  1. disambiguate homographs
  2. determine points of semantic focus
  3. identify syllables which should take a pitch accent
  4. determine prosodic phrase structure

Point number 1 requires either syntactic or semantic information depending on whether the homograph is of the type "'record" vs. "re'cord" or of the type "bow" /b@u/ vs. "bow" /bau/.

Points 2 to 4 are essential steps in the specification of natural prosody and particularly of natural intonation. TTS systems generally attempt to determine this information by analysing the text. Text analysis is not an easy task and the text analysis modules of TTS systems are often not especially sophisticated, so there are often errors in the analysis of the text. Most often this results in monotonous prosody, but sometimes it results in prosodic patterns that are quite wrong and that decrease the intelligibility of the output speech.

MU-TALK currently has a primitive mechanism for determining semantic focus. Words are identified as function or content words with the help of an internal dictionary. A FIFO (first-in, first-out) buffer contains the last 30 content words. If a content word still exists in the buffer when that word occurs again, that new occurrence of the word is assumed to be "given" information and that word's ability to take a pitch accent is inhibited. Otherwise, the word is consider to be "new" information and it takes a pitch accent (and so helps to define an intonational phrase). This often results in too many accents, improper placement of accents and sometimes in the non-placement of a semantically appropriate pitch accent.

What can a TGen system provide to assist in solving these problems?

It is most often unreasonable to expect a TGen system to provide intonational phrase boundaries or, even more ambitiously, to specify a particular intonation contour or sequence of pitch accents. Most TGen systems do not have any explicit prosodic knowledge. Many models utilised by TGen systems to generate text don't have any way of relating grammar or meaning to intonational patterns (but note that the Systemic Functional TGen system under development in Christian Matthiesen's laboratory could potentially supply such information). Grammatical information within a TGen system is often poorly matched to the needs of a TTS system which requires information for the disambiguation of homographs or for the determination of phrase boundaries.

Which system needs to be enhanced to facilitate the communication between a TGen and a TTS system?

Ideally both systems require some modification so that they are able to "meet half way". There may be information that a TGen system could potentially provide with some modifications that would be useful to a TTS system. Further, the TTS system will need modification to handle new types of input information. The supplied information may assist in the solution of some problems rather than completely solve them, so the TTS system may require enhanced parsers, semantic modules, etc. to make full use of the richer source of information.