Skip to Content

Department of Linguistics

MU-Talk: Grapheme-to-Phoneme Rules

Robert Mannell

GTP Rules and Word Regularity

It is clear that English has many complex grapheme-to-phoneme rules. Also it is clear that many words don't obey these rules. Some rules only capture slightly more than 50% of the words that share a certain orthographic pattern, with a large minority of words being mis-analysed by such rules. A truly complete set of GTP rules would contain a very large number of rules and some rules would be so rare that they would only apply to a small number (or even one) root word (plus derived forms). In this extreme case the rules would effectively contain whole words and so the rules would also be part dictionary.

It appears that a reasonable rule of thumb is to consider English to be approximately 70% regular (and therefore 30% irregular). This means that in a word list of 100,000 unique words, 70,000 words would be correct and 30,000 words would be wrong if processed by a good GTP-rule system. However, in a normal written text of 100,000 words, a very large majority of words (> 90%) would be processed correctly. In such a text, many common words would appear many times and many uncommon words would not appear at all. So, regularity is not merely a function of how many words would be correct, but it is also a function of how common such words are.

It should be noted that the most common 5000 words in English, plus words derived from them with affixes, on average make up about 90% of English text. If you can get those 5000 words right, your GTP system would have a success score of 90% when applied to an average body of text. It is worth noting, however, that a very large number of irregular words occur in this list of 5000 words. The reason for this is simple. If a word is common the speakers of the language are very familiar with it and are likely to be aware of and to remember irregular spellings. Such words form part of the mental lexicon of the average reader and their irregularity is no problem to speakers of the language because the more common a word is, the less problem we have recognising its written form. For this reason common words retain irregular relationships between spelling and pronunciation whilst uncommon words with such irregular patterns are more likely to change over time and to become regular.

GTP systems avoid this problem by placing the most common words in the language into a dictionary. When an accurate affix processor is added to this, a dictionary of about 15000 words will, by itself, correctly process over 97% of the words in an average large text. In the present TTS dictionary a much larger word list is included which contains not only a very large number of English root words, but also those forms derived from these root words using the affix processor which would result in incorrect pronunciations.

Goals of a Good GTP Rule Set

So, what should a good Australian English GTP-rule system be able to do? It should be able to pronounce regular words accurately and to mis-pronounce unfamiliar irregular words in the same way that a native speaker would mis-pronounce that word if it were unfamiliar.

For an overview of how the GTP rules work, click here.