Skip to Content

Department of Linguistics

Text-To-Speech Rule and Dictionary Development

Robert Mannell

Original Article: Mannell R.H., and Clark, J.E., (1987), "Text-to-speech rule and dictionary development", Speech Communication 6, North-Holland, pp317-324


Important: If you have not yet either installed the phonetic font "Charis SIL" or tested this installation to determine if the phonetic characters installed properly then click here to go to the phonetic font help pages.

Abstract

This paper describes the development and evaluation of the grapheme-to-phoneme sub-system of a complete real-time synthesis system under development at Macquarie University. It has been developed around a lexicon knowledge base which contains the 4000-5000 most common English words and which has been augmented by a suffix stripper and a set of grapheme to phoneme rules. Evaluation and development of this system has been facilitated by using weighted statistics which reflect the frequency of occurrence of each word in the LOB and Brown corpora of English. These statistics are derived from a test word database which includes all acceptable Australian pronunciations (as defined by the Macquarie Dictionary) of each word, as well as their LOB and Brown frequency counts. The pronunciation derived by the system is compared to the Macquarie Dictionary pronunciations and given a score proportional to its frequency in the two corpora. These scores facilitate decisions to be made about which alterations to the rules or lexicon will have the greatest effect on total system accuracy in ordinary running text (as reflected by the corpora frequencies).

Introduction

The grapheme-to-phoneme (GTP) transcription system to be described in this paper has been designed as a practical top end of a complete text-to-speech (TTS) system designed at the Speech, Hearing and Language Research Centre (SHLRC) at Macquarie University. A general outline of the SHLRC text-to-speech system is shown in figure 1. The intention has been from the outset to produce a computationally efficient knowledge base which could readily be implemented as firmware in a dedicated real-time hardware configuration. It has not been the intention to produce a theoretically sophisticated morphological or phonological analysis of the grapheme to phoneme conversion problem. Rather, the whole sub-system is designed around a lexicon which contains the most common English words together with their pronunciations and potential grammatical categories and which could be used as the basis of any syntactic analysis which may be necessary in order to predict (for example) appropriate prosody. It seems likely that as computer hardware (eg. laser discs, faster processors, etc.) and software (eg. more efficient indexing routines, etc) develop it will become easier to design very large databases which could be accessed in times comparable to that required for the operation of a grapheme to phoneme rule system thus rendering such a rule system largely redundant, except for nonce words.

The GTP module described in this paper consists of three major routines. The first is a lexicon or dictionary which contains the most common words and the most important exceptions to the GTP rules. This is augmented by a suffix stripper routine which identifies and temporarily removes suffixes from the input word and searches for the resulting root word in the lexicon. This means that the lexicon can cover not only the words in the lexicon but also those derivatives of these words which can be derived from them by adding one (or more) of the most common suffixes. Those words that are not covered by the suffixes and the lexicon are then submitted to a third routine, a GTP rule routine, which produces broad phonetic script using rules supplied in the rule base. The remainder of this paper describes the development of these last three routines (ie. lexicon, suffix stripper, and GTP rules).

Figure 1: An overview of the SHLRC text-to-speech system.

Grapheme-to-phoneme Rules

Laver and Clark (1982) converted an existing set of GTP rules (Elovitz et al 1976) into a system capable of producing Australian English. These rules have been undergoing continuous refinement at SHLRC ever since.

The rules are written in an easily accessible rule language formatted in a way which is quite intelligible to linguists. In order to link the rules to the total system they must first be passed through a pre-compiler which converts them into FORTRAN subroutines. The rules are accessed sequentially until a rule which satisfies the current part of the input string is found. The procedure defined by that rule is then performed, and a flag is incremented to point at the next unprocessed part of the input string, and so on until the whole string has been converted.

Some typical rules are as follows :-

#(CH)[C<1>] = /k/ eg. Christmas
[C<1>]E(CH) = /k/ eg. technical
(CH) = /tʃ/ default "CH" rule
(C)[V<f>] = /s/ eg. ceiling
(COM)[SUFFIX] = /kɐm/ eg. coming
(CC)[V<1-n>] = /ks/ eg. accept
(C) = /k/ default "C" rule

where :-

  1. (   ) contain the string to be transcribed
  2. All other characters to the left of the "=" represent the orthographic context of the characters in the (   )
  3. [   ] surround codes, not orthographic characters, for example
    • C = consonant
    • V = vowel
    • SUFFIX = one of a list of suffixes
  4. <   > modifies the code in [   ], for example
    • [C<1>] = one consonant
    • [V<1-n>] = any number of vowels
    • [V<f>] = a front vowel
  5. Letters not surrounded by any of the above are other orthographic characters which define the context.
  6. # = word break

Lexicon

The lexicon described below was originally designed (Clark and Summerfield, 1985) with the intention of capturing as large a percentage of words in continuous running text as possible. This was achieved by examining two English language word count corpora (the Lancaster-Oslo-Bergen (LOB) British English corpus and the Brown American English corpus). It was found (Clark and Summerfield, 1985) that about 4500 words accounted for approaching 90% of all words used in continuous text, and so these words were added to the lexicon. This was done to reduce the use of the rather time consuming letter-to-sound rule system. The most common words for which the GTP rules produced erroneous transcriptions were also added, especially if it turned out that the word could only be transcribed correctly by creating a rule for it alone. Lexicon format and content has undergone considerable modification since it was first developed, however its ongoing development and compilation is still guided by the above considerations. The lexicon is fully indexed and data retrieval is very fast.

An example of a lexicon entry is as follows:-

"abstract       \P 'æbstrækt əb'strækt \G adj nnn vbt(2)"

In this format, the headword field is fixed (15 characters) to facilitate indexing, whilst the remaining fields ("\P" Pronunciation field and "\G" the grammar field) are variable.

Words which are pronounced differently for different parts of speech, have each grammatically meaningful pronunciation in the "\P" field. If a part of speech is not followed by a number in a bracket it takes the first part of speech. If it is followed by a number in a bracket it takes the correspondingly numbered pronunciation. In the above example the noun and adjective take the first pronunciation /'æbstrækt/ and the transitive verb (tagged with a (2)) takes the second pronunciation /əb'strækt/.

Suffix Stripper

A two-part suffix stripper rule-base and database has been designed to augment the lexicon and to increase its capture range. The first part of the rule/database contains words which frequently participate in compound word formation whilst the second part contains the majority of the normal suffixes.

A typical suffix database entry is as follows :-

"-est, [|ɜː|ə|]=/Rəst/, *=/əst/"

which means

"when the suffix "-est" is to be attached to a word whose phonetic representation ends with one of the phonemes //, /ɜː/, /ə/, or // then this suffix is pronounced /Rəst/, otherwise, it is pronounced /əst/." ("R" is a linking /r/)

A further type of rule included in the suffix database is one which states that when a certain suffix is added to a word ending in a certain phoneme, that phoneme is changed into another phoneme.

eg. " -ion, [z]=/ʒən/, *=/ʃən/, {s>}, {t>},{aɪz>ɪ},{z>}"

which means

"when the suffix "-ion" is to be attached to a word whose phonetic representation ends with the phoneme /z/, then this suffix is pronounced /ʒən/, otherwise, it is pronounced /ʃən/. A pre-suffix terminal /s/, /t/ or /z/ ( but not /aɪz/ ) is first deleted. A terminal /aɪz/ is converted to a /ɪ/."

eg. administrate administration
    /...eɪt/ + "-ion" becomes /...eɪʃən/
  revise revision
    /...aɪz/ + "-ion" becomes /...ɪʒən/

In other words, the rules in {   } describe the fate of terminal phonemes in the root word, before the addition of the suffix. The first rule to be accessed which matches the root's terminal phoneme(s) is the rule used. Subsequent rules are then ignored. Pairs such as divide/division are treated as separate lexicon entries.

A further set of rules have been developed to deal with examples of the following type :-

running runn + -ing
happiness happi + -ness
saved sav + -ed

ie. examples which upon suffix deletion result in a modified version of the root word. These disjunctively ordered rules state that if the word in the search buffer is not found and the suffix strip flag is set then :

  1. IF the word ends in two identical consonants then delete one and search for the result in the lexicon.
  2. IF the word ends in an "i", then replace it with a "y" and search for the result in the lexicon.
  3. IF the word ends in a consonant, then add an "e" and search for the result in the lexicon.

The suffix stripping algorithm also deals with situations where the root remaining after removing the suffix (but before applying the above three rules) is identical to another word in the lexicon.

eg. barred barr + -ed bar + -ed is correct, but
  bared bar + -ed is not correct.

To overcome this problem, any words so affected have had an extra field added to their lexicon entry. This "\H" field refers the lexicon routine to the appropriate headword if the suffix strip flag is set and if the suffix belongs to the second part (see above) of the suffix database.

Example 1

bared bar + -ed

In the lexicon the following entry is initially found.

" bar \H bare \P ba \G nnn prp vbt"

Since the suffix flag is set, and the suffix "-ed" is in the second part of the rule/database then the headword in the "\H" field is searched for,

ie. " bare \P beə \G adj vbt"

and the pronunciation /beəd/ is derived.

Example 2

barman bar + -man

In this case the suffix flag is set but the suffix "-man" is in the first part of the rule/database and so the pronunciation for "bar" is selected, giving the final pronunciation /ba:mən/.

Overall Operation using Examples of "eath" Words

In order to demonstrate the operation of the various parts of the text-to-speech subsystem it is instructive to examine a selection of words containing a common orthographic string and yet with differing phonological realisations which are derived using all the features of the system . For that purpose a selection of words containing the string "EATH" were selected.

The lexicon, suffix and rule entries relevant to these examples are as follows:-

A) Lexicon

  1. breath \H breathe \P breθ
  2. breathe \P brið

B) Suffixes

  1. -y, [a:|ɜ:|ə|o:] = /ri:/, * = /i:/
  2. -es, [p|t|k|f|θ] = /s/, [s|z|ʃ|ʒ||] = /əz/, * = /z/
  3. -s, [p|t|k|f|θ] = /s/, [s|z|ʃ|ʒ||] = /əz/, * = /z/
  4. -ing, [a:|ɜ:|ə|o:] = /rɪŋ/, * = /ɪŋ/

nb. "|" means "or" . ie. "[p|t]" means following either /p/ or /t/.

C) Orthography to phoneme rules

  1. #BR(EA) = /e/
  2. (EA) = /i:/
  3. (TH)# = /θ/
  4. (THE)# = /ð/
  5. (THS)# = /θs/
  6. (TH)[SUFFIX]# = /ð/
breath lexicon 1 //
breathe lexicon 2 /i:ð/
breaths lexicon 1 + suffix 3 /eθs/
breathes lexicon 1 (\H)--> lexicon 2 + suffix 2 /i:ðz/
breathing lexicon 1 (\H)--> lexicon 2 + suffix 4 /i:ðɪŋ/
breathy lexicon 1 (\H)--> lexicon 2 + suffix 1 /i:ði:/ *
bequeath rules 2 + 3 /i:θ/ **
bequeathes rules 2 + 6 + ... /i:ðz/
bequeathing rules 2 + 6 + ... /i:ðɪŋ/
wreath rules 2 + 3 /i:θ/
wreathe rules 2 + 4 /i:ð/
wreaths rules 2 + 5 /i:θs/
wreathes rules 2 + 6 + ... /i:ðz/
wreathing rules 2 + 6 + ... /i:ðɪŋ/

The only mispronunciations produced by this combination of rules, suffixes and lexicon are indicated with * and ** (in Australian English, the pronunciation marked by ** is reasonably acceptable). Once such exceptions are identified, they may be placed in the lexicon if this is considered desirable. A highly unacceptable pronunciation is more likely to be added to the lexicon than a partially acceptable one such as for "bequeath". New rules are only profitably inserted if they solve more than one mispronunciation (ie. lexicon entries are considered preferable to single word rules because of lower computational overheads).

Evaluation and Development

Considerable improvement in the efficiency and accuracy of the rules has been achieved by producing an input word test list which represents all the words in both the LOB and the Brown corpora, and which are also in the Macquarie Dictionary (Delbridge 1981) database. Each entry includes the corpus word frequency count from both corpora and all pronunciations given in the Macquarie dictionary. A test version of the TTS program was then developed which passes each word in this list through the lexicon, suffix stripper and GTP rules and then compares the output phonemic string with the acceptable pronunciations derived from the Macquarie dictionary. This allowed the system's transcription of each word to be scored correct, almost correct (only schwa/vowel mismatches) and incorrect. Two sets of scores were derived as follows :-

 i) Raw scores (ie. a score of one for each word)

 ii) Corpus weighted scores (ie. a score for each word equal to the sum of the LOB and the Brown count for that word. This is used to predict the frequency of that word in average running English text.)

Further, it was determined which rules were used and whether any rule was used correctly, almost correctly, or incorrectly. Raw and corpus weighted scores were calculated.

a) String Matching Algorithm

Central to this procedure is a string matching algorithm which compares the phonemic string derived by the entire sub-system with all the acceptable Australian pronunciations listed in the Macquarie Dictionary database. Although on the surface this seems to be a fairly straightforward procedure, it is greatly complicated by the possibility of erroneous phoneme insertion and deletion as well as substitution of digraphs for single character phonemes (and vice versa). Such an algorithm is necessitated by the desire for accurate statistics on the performance of each rule rather than simply a tally of overall system performance as measured by gross word level errors.

The first step in the string matching algorithm is the marking of each phoneme as a vowel or consonant, and the identification of digraphs as single phonemes, e.g.

/foʊni:m/ |C<f>|V<>|C<n>|V<i:>|C<m>|

This analysis is applied to both the output phonemic string and also to all the Macquarie Dictionary pronunciations (test strings). The output string and the test strings are then aligned into consonant and vowel groupings, e.g.

If there are equal numbers of vowel and consonant groupings in each string and both strings start with the same type of group (ie. either both consonants or both vowels) then the match procedure proceeds on the (provisional) assumption that the aligned groups are in some way equivalent and the phonemes in each group can be compared with their counterparts in the equivalent group of the other string. e.g. comparison of two consonant clusters:

|spl| with |pl| best match = |spl| / |.pl|

indicating a deletion error (if /spl/ is the "correct" pronunciation) or an insertion error (if /pl/ is correct). The rule responsible for this insertion/deletion can then be flagged for an error (and the corpus weighted frequency added to the statistics). Conversely, if the output of a rule completely matches a substring of one of the test strings then it is flagged as a success.

The algorithm becomes especially complex if an entire consonant or vowel grouping is missing or a spurious one is inserted. e.g. compare |CC|V|C|V|C|V| (test string) with |C|V|CC|V| (output string) They could be matched up in the following ways (amongst others) :-

(1)
or  
(2)

Match (1) is preferable to (2) as it involves only the deletion of one vowel group whilst (2) involves the deletion of one consonant group and one vowel group. In other words the best fitting CV group pattern is selected. Individual phoneme matching can then be carried out within each group.

Sometimes, two equally well matching CV group patterns may be discerned. eg. compare |C|V|C|V|C|V| (test string) with |C|V|CCC|V| (output string). This gives the following two possible matches :

(1)
or  
(2)

both of which involve one vowel group deletion and at the phoneme level one vowel deletion and one consonant insertion. Both patterns are tested at the phoneme level and the pattern with the best match is selected.

Finally, if any ambiguity remains, the pattern matching resorts to probable phoneme confusions. For example, /i:/ /ɪ/ and // are more likely to be derived form an orthographic "I" than say a /ʉ:/.

Occasionally the possible matches are so poor that all rules used to derive that output are flagged as in error.

b) Corpus Based Statistical Analysis.

All incorrect words were output to an errors file with their corpus scores and then this file was ordered in corpus count order with the most frequently occurring words at the top. Since it can be reasonably assumed that high corpus count words occur more frequently in normal running text than low corpus count words it was decided to determine the cause of errors in high frequency words first and to make adjustments to the TTS rules or suffix rules if possible, or otherwise to add the word to the lexicon. As a further aid to TTS rule enhancement, the rules with the highest corpus corrected error scores were examined closely, and in this way many rules were removed, repositioned, split into two or more rules of greater specificity, or replaced with a correct rule. In all, about 1/3 of the present TTS rules did not exist either at all or in their current form before this process was attempted.

Previous to adopting this methodology, several months were spent improving the raw word score (ie. for the total system of lexicon plus suffixes plus TTS rules) from about 50% correct to 69% correct using a trial and error approach. Increased effort was eventually met with drastically diminishing results. Upon adopting the procedure described here, the raw score for words correct was easily improved by another 8% (to 77%).

These raw results do not, however, properly reflect the accuracy of the system because many of the words would be rarely met in running text. At the commencement of the present procedure, the corpus weighted results showed 97.13% words correct. This improved to 99.14% by the end of the new period. In other words, in an average text of 10000 words (including of course many repetitions of the more common words) 9713 words originally, and 9914 words now, would be expected to be correct (assuming, of course that such corpus frequencies can be said to reasonably predict the actual frequencies encountered by a text-to-speech system).

Figure 2 shows the percentage capture rate for words correctly output by the lexicon, lexicon plus suffixes, and the TTS rules for both raw and corrected scores. Clearly, the lexicon captures the largest number of individual tokens in average running text (ie. corpus-weighted score), whilst the TTS rules capture the largest number of unique words (ie. unweighted score).

Figure 2: Transcription statistics.

The suffix stripper in conjunction with the lexicon (providing root pronunciations) is now accurate 88.00% of the time (unweighted score) or 98.46% of the time ( corpus-weighted score) for those words not already captured by the lexicon and for which a suffix is detected. The TTS rules now have a corpus-weighted score of 85% correct (of words not already captured by the lexicon and suffix-stripper). It seems likely that with more sophisticated error analysis these results can be improved to >90%.

References

Clark, J.E. and Summerfield, C.D. (1985), "Developing a text to speech system dictionary", Festschrift in Honour of Arthur Delbridge, Beiträge zur Phonetik und Linguistik 48, pp 251-262.

Delbridge, A. (1981), The Macquarie Dictionary, Macquarie Library.

Elovitz, H., Johnson, R., McHugh, A. and Shore, J. (1976), "Letter-to-sound rules for automatic translation of English text to phonetics", IEEE.Trans.Acoust.Speech and Sig.Proc., ASSP-24, pp 446-459.

Johansson, S. (1978), Manual of Information (to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers), University of Oslo.

Kucera, H. and Francis, W.N. (1967), Computational Analysis of Present Day American English, Brown University Press, Providence, Rhode Island.

Laver, J. and Clark, J.E. (1982), "Australian English letter-to-sound rules based on the NRL rules", unpublished.