The Proper Relation between SABLE and Aural Cascaded Style Sheets

Richard Sproat

What are Aural Cascaded Style Sheets?

Cascaded style sheets provide a mechanism to dissociate the document structure explicit in an HTML or XML document from the way in which that document is rendered in a particular medium. Traditionally, browsers such as Netscape have interpreted HTML tags such as H1 in a particular fashion, e.g. by using a certain sized bold font. Cascaded style sheets let one separate the markup of a particular piece of text as a level-one header using the tag H1, from the way that it is rendered, by allowing one to write a separate set of specifications - the style sheet - that defines how text enclosed in this particular tag should appear.

HTML documents may be rendered in a variety of media, including auditorily; this is the typical way in which a web page is presented to a visually impaired user. When rendering a document auditorily one can specify how the rendering is to be done using Aural Cascaded Style Sheets. Thus one might wish to indicate that H1 elements are rendered using a particular voice.

As an example of an Aural style specification, consider Figure 1, from Appendix D of the CSS2 Specification:


@media speech {
       H1, H2, H3, 
       H4, H5, H6    { voice-family: paul, male; stress: 20; richness: 90 }
       H1            { pitch: x-low; pitch-range: 90 }
       H2            { pitch: x-low; pitch-range: 80 }
       H3            { pitch: low; pitch-range: 70 }
       H4            { pitch: medium; pitch-range: 60 }
       H5            { pitch: medium; pitch-range: 50 }
       H6            { pitch: medium; pitch-range: 40 }
       LI, DT, DD    { pitch: medium; richness: 60 }
       DT            { stress: 80 }
       PRE, CODE, TT { pitch: medium; pitch-range: 0; stress: 0; richness: 80 }
       EM            { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }
       STRONG        { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }
       DFN           { pitch: high; pitch-range: 60; stress: 60 }
       S, STRIKE     { richness: 0 }
       I             { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }
       B             { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }
       U             { richness: 0 }
       A:link        { voice-family: harry, male }
       A:visited     { voice-family: betty, female }
       A:active      { voice-family: betty, female; pitch-range: 80; pitch: x-high }

Figure 1: An ACSS Specification

The sample specification in Figure 1 includes, for instance:

See the section on aural style sheets for definitions of the various constructs used.

What is SABLE?

SABLE is an XML/SGML-based markup language for Text-to-Speech synthesis. A more thorough description can be found in a paper, from which this description was excerpted; the current set of specifications for SABLE can be found at sable.html.

SABLE is being developed because there is an ever increasing demand for speech synthesis (TTS) technology in various applications including e-mail reading, information access over the web, tutorial and language-teaching applications, and in assistive technology for users with various handicaps. Due to incompatibilities in control sequences for various TTS systems, an application that was developed with a particular TTS system A cannot be ported, without a fair amount of additional work, to a new TTS system B, for the simple reason that the tag set used to control system A is completely different from those used to control system B. The large variety of tagsets used by TTS systems are thus a problem for the expanded use of this technology since developers are often unwilling to expend effort porting their applications to a new TTS system, even if the new system in question is of demonstrably higher quality than the one they are currently using. SABLE attempts to address this problem by proposing a standard markup scheme intended to be synthesizer independent.

More specifically, SABLE is being developed with the following goals in mind:

SABLE is based in part on two previous proposals:

SABLE, like its predecessors, supports two kinds of markup: the first - termed text description in STML, and structural elements in JSML - marks properties of the text structure that are relevant for rendering a document in speech. In the current version of SABLE, text description is handled by the DIV tag, whose attribute TYPE may be set to such values as sentence, paragraph or even stanza; and by SAYAS, which marks the function of the contained region (e.g. as a date, an e-mail address, a mathematical expression, etc.), and thereby gives hints on how to pronounce the contained region. The second kind of markup - STML's speaker directives or JSML's production elements - control various aspects of how the speech is to be produced. Falling into this latter category are tags such as: EMPH (marks levels of emphasis); PITCH (sets intonational properties); RATE (sets speech rate); and PRON (provides pronunciations as phonemic strings).

In both its generality and its coverage, SABLE has many advantages over existing markups such as Microsoft's SAPI, or Apple's Speech Manager control set:

A Sample of SABLE Text

As an illustration of SABLE markup, consider the text in Figure 2. This text is available as part of a demonstration of the SABLE markup language using the Bell Labs Multilingual Text-to-Speech System. See the Sable V.02 specification for definitions of the various tags:
<SABLE>
Welcome to the demonstration of the Sable markup language using 
the Bell Labs TTS system.
<SPEAKER gender=female age=younger>
This system allows you to play around with a subset of the
functionality of Sable. For example, you can switch languages between
</SPEAKER>
<LANGUAGE ID=FRA>
français
</LANGUAGE>
<LANGUAGE ID=ESL-MEXICAN>
español
</LANGUAGE>
<LANGUAGE ID=ESL-CASTILIAN>
español
</LANGUAGE>
<LANGUAGE ID=ITA>
italiano
</LANGUAGE>
<LANGUAGE ID=RON>
<SPEAKER AGE=middle GENDER=female>

</SPEAKER>
</LANGUAGE>
<LANGUAGE ID=DEU>
Deutsch
</LANGUAGE>
<LANGUAGE ID=ZHO CODE=BIG5>

</LANGUAGE>
You can also <EMPH LEVEL=2.0> emphasize </EMPH> words, or
put a break 
<BREAK LEVEL=large MSEC=200 TYPE="?"> between them.
<PITCH RANGE=HIGHEST>
You can set properties of the pitch range, <RATE SPEED=fastest> or the 
speech rate </RATE> .
</PITCH>
As we saw above, <SPEAKER age=child> you can set the speaker
</SPEAKER> .
You can insert an audio file, in this case an example of our Russian
TTS system:
<AUDIO SRC=russian6.wav>
And you can override the text by a call to the engine tag:
<ENGINE ID=BLTTS DATA="This is the Bell Labs TTS System.">
You won"t hear this.
</ENGINE>
You can set the <PRON SUB="pronounciation"> pronunciation
</PRON> of a 
word, though we do not currently support IPA.

Finally you can control aspects of how some kinds of strings are said
using the &Quot;say as&Quot; tag. For example, a date in English:

<SAYAS MODE=date MODETYPE=dmy>
18/11/1960
</SAYAS>
or in French:
<LANGUAGE ID=fra>
<SAYAS MODE=date MODETYPE=dmy>
18.11.1960
</SAYAS>
</LANGUAGE>

Just be careful to keep some whitespace between tags and surrounding
material. The parser is not very smart.  See the file Read me dot text
for further information.

</SABLE>

Figure 2: A Sample SABLE Document

Why SABLE is Necessary in Addition to ACSS

There have been suggestions that Aural Cascaded Style Sheets render a special TTS markup language like SABLE redundant. For example Dave Raggett, in a April 1998 presentation on Voice Browsers has a slide entitled "Why SABLE is inappropriate!" His conclusion is:
Rather than adopting SABLE, it makes more sense to work on improving CSS to enable high quality speech synthesis for any HTML document.

We agree that ACSS needs to be improved towards this end, but this does not obviate the need for SABLE. The fact is that in a great many applications of TTS, the input text is not an HTML/XML document. To take just one example, the majority of e-mail is still in plain text, so an e-mail reader would not typically start with an HTML document.

Now, supposing one wanted to mark up this text automatically in order to better control a synthesizer in reading the text: for e-mail, one might want to detect and set off in some special way the date in the e-mail header, for example. One could presumably mark up the text in HTML and then provide an Aural Cascaded Style Sheet which would provide the aural rendition of the HTML tags. But such an approach would be rather roundabout. Instead, one would prefer to be able to assign controls to the regions of interest directly. In other words, one would like to be able to mark up a text as follows:

<PITCH RANGE=HIGHEST> Nov 13, 1998 </PITCH>
This is precisely what SABLE allows, but ACSS doesn't.

One might add that even for HTML documents, an audio browser might want finer control or at least different control, than what ACSS allows. The ACSS model links every speech property to a property of the document structure, since ACSS properties are associated with particular text structural elements, or are linked to particular elements, such as font changes, which are primarily associated with the visual modality. This may be appropriate in many cases, but it isn't clear it is always so. In other words, it may not be desirable for text that shows up in bold font in the visual modality, to show up consistently the same aural style. The visual and aural modalities are different enough that even though it may be appropriate for different pieces of text to show up in the same bold face across the visual presentation, it may not seem so appropriate in the aural modality.

The Proper Relation between ACSS and SABLE

For universal access, HTML and other web-based documents clearly need to allow for aural style specifications, and ACSS is clearly a good start towards that goal. On the other hand, SABLE is also needed for the reasons outlined above. What then is the proper relation between ACSS and SABLE?

The answer is straightforward: ACSS should be used to provide aural stylesheets for web documents; the ACSS specifications should be implemented by a voice browser in a synthesizer-independent fashion using SABLE; this text could then be read to the user using their favorite TTS system. The architecture would be thus as in Figure 3.


This image just shows a simple flowchart with 
three boxes, one feeding 
into another. The first box has the text 'HTML Document + ACSS'. The 
second box has the text 'Audio Browser: Converts HTML text into SABLE 
using ACSS'. The third has the text 'TTS System: Interprets SABLE text 
provided by Audio Browser'.

Figure 3: A Model for Using ACSS and SABLE


Conversion Tools

The model of the relation between ACSS and SABLE discussed in the preceding section and diagrammed in Figure 3 presumes the existence of automatic conversion tools that specify how ACSS specifications translate into SABLE specifications. It is our intention that such tools will be provided in the future for at least that subset of ACSS that is practically implementable in current TTS systems. (For example, ACSS includes a voice quality attribute of "richness"; this is probably not practical to implement in any general way in current TTS systems, since it is far from clear what acoustic properties it corresponds to.)

The model of the interaction would presumably be as follows. Consider the implementation of an H1 tag using the ACSS from Figure 1. Say the audio browser encounters the text

<H1> Introduction </H1>
in an HTML document using this ACSS. The specifications for all tags of the header (H) family are that it is "male" (ignoring for the present discussion the name "paul", which is not likely to be portable among systems anyway) and that its stress is "20". (Again, we ignore richness). For H1, the pitch is "x-low" and the pitch-range is "90" (relatively high). Putting this all together, we have a male voice, speaking in the low part of its range, with a high pitch range, and lower than normal ("20") stress. Note that the "90" (also the "20") are on a 0-100 scale, where "50" is considered "normal". Thus "90" could be interpreted as 80% above the normal. In SABLE terms this might translate into:
<PITCH BASE=lowest RANGE=80%> 
<EMPH LEVEL=0.5> 
Introduction 
</EMPH>
</PITCH>
Note that we translate stress=20 using the EMPH tag, with a low setting for the level attribute; the single tag PITCH implements the two properties pitch=x-low (BASE=lowest) and pitch-range=90 (RANGE=80% or in other words 80% above the current setting in place, assuming the current is the normal range). (There is some uncertainty here: the descriptions of pitch-range and stress in ACSS seem to suggest that in fact they roughly relate to the same thing, namely pitch range. So pitch-range "specifies variation in average pitch", and stress "specifies the height of `local peaks' in the intonation contour of a voice": these two properties are closely linked to one another. This is a bit of unclarity that must be addressed in future versions of ACSS.) So the audio browser should produce use the ACSS specification to convert the H1 element into the equivalent SABLE elements shown above.

References.

Taylor, P. and Isard, A., 1997, "SSML: A speech synthesis markup language", Speech Communication, 21, 123-133.


Richard Sproat, November 1998.