Skip to Content

Department of Linguistics

Markup Languages and Text-to-Speech

Robert Mannell

Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers.

If you examine the documentation for earlier versions of the Festival TTS System you will discover a rich and detailed scripting language that allows a user to finely control the synthesiser. Such fine control is essential for researchers using the speech synthesis package as a tool in their research, but this interface is not an appropriate interface in any environment where the user makes no assumptions about which synthesiser is going to be used. The Festival interface is was not strictly a "proprietary tag set" as it uses a public domain scripting language, but its use of that scripting language is nevertheless sufficiently specific to the architecture of Festival to make it unattractive to other developers of TTS systems. The main reason for this interface's lack of attraction to other developers, is that these developers cannot assume that it will be supported in all the environments that they would like to deploy their technology, such as on the Web. In other words, it is not a standard mark-up language (or tag set, to use SGML terminology). More recently, the Festival project has adopted the Sable speech markup language as an interface to its system.

The problem with raw text input to a TTS system is that, in the absence of a sophisticated NLP front-end (parser and text-understanding system), the TTS system is often unable to supply appropriate prosody to the output speech. Further, there is no way that a TTS system can predict appropriate voices or vocal affect (emotion) for a particular raw text. This problem can be solved by utilising tags or codes that are inserted into the text and that can be interpreted by the TTS system which can then respond with appropriate prosody or voice quality.

There are currently a few developing proposals for a standard speech markup tag system. They include:-

  1. W3C (WWW Consortium) proposals for the inclusion of voice/speech markup in future versions of HTML (particularly read their page on SSML)
  2. Sable A Synthesis Markup Language which has been superseded by SSML
  3. Java Speech Markup Language (JSML) for text input to Java Speech API speech synthesizers. Also see the W3C document on JSML
  4. VoiceXML, a dialog markup language. The 2009 W3C VoiceXML draft is here.

Have a look at this talk on the the uses of speech mark-up languages in cascading style sheets. One of the proposed uses of speech mark-up languages is for non-visual (ie. auditory) web pages, for example for vision impaired users.

Look at the W3C Voice Browser Activity Pages which examine the future of voice markup languages. Searching for "speech" on the W3C web site will also reveal numerous papers on speech mark-up.

It should be noted, when comparing the above languages, that Sable was developed by the speech synthesis community specifically to provide a standard markup language for interfacing to TTS systems. The other standards originate in the SGML community and treat speech as simply another interace to be rendered by a mark-up language in a way analogous to HTML (or indeed even incorporated into evolving HTML standards). Close reading of the proposed standards should make it clear that they do not all have the same goals.


You may like to look at the following reference, which discusses a proprietary markup language ("symbolic input") for the prosodic control of a specific TTS system:-

Kohler K.J. (1997) "Parametric control of prosodic variables by symbolic input in TTS synthesis", In van Santen J.P.H., Sproat R.W., Olive J.P., and Hirschberg J., (eds.) Progress in Speech Synthesis, Springer, New York