Skip to Content

Department of Linguistics

MU-Talk: Grapheme-to-Phoneme Dictionary

Robert Mannell

The core functionality of of a grapheme-to-phoneme dictionary is relatively simple. It consists of a simple mapping between an orthographic and a phonetic transcription of each word. A minimal TTS dictionary would consist of only two fields, a headword (orthographic) field and a pronunciation field.

In a text-to-speech (TTS) system a part-of-speech (POS) field could be used in a grammatical pre-processor. A POS field might possibly containing probabilistic information about the frequency of, for example, noun versus verb frequency information. A grammatical pre-processor could be used in the determination of phrase boundaries, the selection of content words for tone placement or for the selection of the appropriate alternative pronunciation in a grammatical stress-alternation pair.

A useful dictionary needs to facilitate morphological decomposition of complex words. A TTS system very often encounters words that contain common inflectional and derivational affixes (suffixes and prefixes). A TTS system needs to be able to derive the pronunciations of words like "unhappy", "happiness", "running", etc. from the root word pronunciation plus the affix pronunciation (and often some additional affix pronunciation rules). In the MU-Talk TTS system the problem of morphological decomposition of complex words is presently restricted to an affix processor.

A planned future modification to MU-Talk's morphology processing capabilities is a module which would attempt to break up complex words into smaller fragments (text chunking) and to attempt to find complete sequences of chunks that can be identified as words in the dictionary (eg "storeman" = "store" + "man"). A much harder task (not currently contemplated) is an algorithm which would attempt to analyse compound words for which only partial word chunk sequences can be found (eg. "brilligman" = *"brillig" + "man"). Without such an algorithm, such difficult to parse compound words would rely upon being processed accurately using the grapheme-to-phoneme rules. The problem with this, is that compound words are very likely to be poorly processed by the GTP rules system, particularly when the orthographic representation of the first element ends in a "silent e".

Another problem that a pronunciation dictionary needs to deal with is the homograph problem. There are three main types of words with more than one pronunciation. The first is the normal (semantic) homograph such as ("bow") where the word has more than one pronunciation and where the choice of pronunciation is dependent upon the word's meaning. The second is the grammar-based stress alternation (eg. "REcord" versus "reCORD"). Currently, there is no facility in MU-Talk to make choices between such alternate pronunciations and currently the first pronunciation is selected by default.

There is another set of words whose pronunciation depends upon the sociolect (eg. broad, general or cultivated Australian English) that the person speaks (eg. "dance", "either"). These variants are considered to be variations within a single dialect. In MU-Talk the simple solution has been to define the synthesised speaker as a speaker of General Australian English. This means that the appropriate pronunciation for this variety of Australian English has been pre-selected. This is a perfectly valid approach to this particular problem as a TTS system voice is a model of a single speaker.

There are a very large number of words in the dictionary with multiple pronunciations, but the vast majority of these are minor stress variants (mostly a matter of personal choice). At the time of writing this document, only a single pronunciation for such words has been selected and the remaining pronunciations have been deleted from the dictionary (this is clearly not the preferred option for a speech recognition system, where as many pronunciations as possible need to be predicted).

There is one further, and very major, source of multiple pronunciations. That is the speaker's English dialect. There are large pronunciation differences between dialects (eg Australian English and American English) and between standard dialects and non-standard pronunciations (such as those of migrants). These issues currently lie outside the scope of MU-Talk, which is currently an exclusively Australian English TTS system. A system for American English, for example, would need to have its own dictionary and rules. Whilst it could share some of the features of MU-Talk's GTP rules and dictionary, large parts of it would require considerable modification.

Full Entries for the First Few Headwords in the MU-Talk TTS Dictionary

Key to Data Fields

  1. The first field is the headword field and is terminated by "|".
  2. The second field is a work-in-progress field which indicates the current status of each entry with respect to morphological analysis. The codes mean:-
    • "%reg" - This headword and its pronunciation(s) can be derived automatically from another headword and one or more affixes using regular rules. Words so marked are deleted from the working TTS dictionary because they are redundant.
    • "%irr" - This headword has been determined to be irregular, in that its root morpheme has been identified, but either the orthographic or the phonemic form can't be derived regularly from that root morpheme plus one or more affixes. Words so marked are retained in the working TTS dictionary as they cannot be derived by regular rules.
    • "%hp" - This headword has been "hand processed" and it has been determined that this headword is a single root morpheme.
    • "%nhp" - This means "not hand processed" and it indicates that the word has been examined by a morphlogical algorithm which has determined that it cannot be derived from some other headword. The accuracy of this algorithm depends upon the accuracy of the affix database and related affix stripping rules.
    • "%hom" - All words so marked have been hand-processed and have been determined to be either homographs or words which undergo grammatical stress alternation.
    • "%todo" - The status of this headword has not yet been determined. Words so marked remain in the working TTS dictionary until their status can be determined (by a person). This label is automatically applied to headwords whose orthographic representation can be derived from another headword by regular rules, but whose phonemic representation cannot.
  3. When "#1", "#2", ... markers exist, this entry is also marked "%hom" (see above). Each sub-entry following a "#1" (etc.) indicates the pronunciation, part-of-speech, and root word for this sub-entry.
  4. The "\P" field is the pronunciation field. One or more pronunciations can be found enclosed in square brackets. The pronunciations utilise ANDOSL machine-readable phoneme symbols
  5. The "\G" field is the "part-of-speech" (POS) or grammatical category field. Many words have more than one such label. At present there is no indication of the probability of each of the POS labels and sometimes rare possibilities are listed along with a very common POS for a particular headword. Probabalistic POS labelling using a very large labelled corpus may occur some time in the future. These labels do not necessarily mean that the affix itself is used, but that this headword is semantically equivalen to the root word plus the affix. For example, "%suf -s" can also mean any other, irregular, plural noun or present tense verb form. Currently, the following POS labels are used:-
    • "adj." - adjective
    • "adv." - adverb
    • "art." - article
    • "aux." - auxilliary
    • "conj." - conjunction
    • "interj." - interjection
    • "n." - noun
    • "prep." - preposition
    • "prn." - pronoun
    • "v." - verb
  6. The final field is the "\R" field. The "R" in this label refers to the root word for this headword. Many words are self-referencing in that they are their own root word. The morphological analysis of each headword is always enclosed in round brackets. The suggested root morpheme comes first followed (optionally) by a "%pre" and a "%suf" tag referring to prefixes or suffixes, respectively. Multiple prefixes and suffixes are space separated. Prefixes are followed by a "-" (e.g. "un-") and prefixes are preceded by "-" (e.g. "-less").
a| %nhp. \P [ei] %red. [@] \G adj. art. n. \R (a)
aardvark| %nhp. \P ['a:dva:k] \G n. \R (aardvark)
aardwolf| %nhp. \P ['a:dwUlf] \G n. \R (aardwolf)
aardwolves| %irr. \P ['a:dwUlvz] \G n. \R (aardwolf %suf -s)
aaron| %nhp. \P [e:r@n] \G n. \R (aaron)
aaronic| %irr. \P [e:'rOnIk] \G adj. \R (aaron %suf -ic)
ab| %nhp. \P [Ab] \G prep. \R (ab)
aba| %nhp. \P ['Ab@] \G n. \R (aba)
abaca| %nhp. \P ['Ab@k@] \G n. \R (abaca)
abaci| %irr. \P ['Ab@si:] \G n. \R (abacus %suf -s)
aback| %nhp. \P [@'bAk] \G adv. \R (aback)
abacus| %nhp. \P ['Ab@k@s] \G n. \R (abacus)
abaft| %nhp. \P [@'ba:ft] \G prep. adv. \R (abaft)
abalone| %hp. \P [Ab@'l@uni:] \G n. \R (abalone)
abandon| %nhp. \P [@'bAnd@n] \G v. n. \R (abandon)
abandoned| %reg. \P [@'bAnd@nd] \G adj. \R (abandon %suf -ed)
abandonee| %todo. \P [@bAnd@'ni:] \G n. \R (abandon %suf -ee)
abandoner| %reg. \P [@'bAnd@n@] \G n. \R (abandon %suf -er)
abandonment| %reg. \P [@'bAnd@nm@nt] \G n. \R (abandon %suf -ment)
abase| %nhp. \P [@'beis] \G v. \R (abase)
abased| %reg. \P [@'beist] \G v. adj. \R (abase %suf -ed)
abasement| %reg. \P [@'beism@nt] \G n. \R (abase %suf -ment)
abaser| %reg. \P [@'beis@] \G n. \R (abase %suf -er)
abash| %nhp. \P [@'bAS] \G v. \R (abash)
abashment| %reg. \P [@'bASm@nt] \G n. \R (abash %suf -ment)
abasing| %reg. \P [@'beisIN] \G v. adj. \R (abase %suf -ing)
abatable| %todo. \P [@'beit@b@l] \G adj. \R (abate %suf -able)
abate| %hp. \P [@'beit] \G v. \R (abate)
abated| %reg. \P [@'beit@d] \G v. adj. \R (abate %suf -ed)
abatement| %reg. \P [@'beitm@nt] \G n. \R (abate %suf -ment)
abater| %reg. \P [@'beit@] \G n. \R (abate %suf -er)
abating| %reg. \P [@'beitIN] \G v. adj. \R (abate %suf -ing)
abatis| %hp. \P ['Ab@t@s Ab@'ti:] \G n. \R (abatis)
abator| %reg. \P [@'beit@] \G n. \R (abate %suf -er)
abattoirs| %nhp. \P ['Ab@twa:z 'Ab@to:z] \G n. \R (abattoirs)
abaxial| %nhp. \P [Ab'Aksi:@l] \G adj. \R (abaxial)
abb| %nhp. \P [Ab] \G n. \R (abb)
abba| %nhp. \P ['Ab@] \G n. \R (abba)
abbacies| %reg. \P ['Ab@si:z] \G n. \R (abbacy %suf -s)
abbacy| %nhp. \P ['Ab@si:] \G n. \R (abbacy)
abbatial| %hp. \P [@'beiS@l] \G adj. \R (abbatial)
abbess| %hp. \P ['AbEs] \G n. \R (abbess)
abbevillian| %nhp. \P [Ab@'vIli:@n Ab@'vIlj@n] \G adj. \R (abbevillian)
abbey| %nhp. \P ['Abi:] \G n. \R (abbey)
abbeys| %reg. \P ['Abi:z] \G n. \R (abbey %suf -s)
abbot| %nhp. \P ['Ab@t] \G n. \R (abbot)
abbotric| %nhp. \P ['Ab@trIk] \G n. \R (abbotric)
abbotship| %reg. \P ['Ab@tSIp] \G n. \R (abbot %suf -ship)
abbreviate| %hp. \P [@'bri:vi:eit] \G v. \R (abbreviate)
abbreviated| %reg. \P [@'bri:vi:eit@d] \G v. adj. \R (abbreviate %suf -ed)
abbreviating| %reg. \P [@'bri:vi:eitIN] \G v. adj. \R (abbreviate %suf -ing)
abbreviation| %irr. \P [@bri:vi:'eiS@n] \G n. \R (abbreviate %suf -ion)
abbreviator| %reg. \P [@'bri:vi:eit@] \G n. \R (abbreviate %suf -er)
abdicable| %nhp. \P [@b'dIk@b@l] \G adj. \R (abdicable)
abdicant| %nhp. \P [@b'dIk@k@nt] \G n. \R (abdicant)
abdicate| %nhp. \P ['Abd@keit] \G v. \R (abdicate)
abdicated| %reg. \P ['Abd@keit@d] \G v. adj. \R (abdicate %suf -ed)
abdicating| %reg. \P ['Abd@keitIN] \G v. adj. \R (abdicate %suf -ing)
abdication| %irr. \P [Abd@'keiS@n] \G n. \R (abdicate %suf -ion)
abdicative| %irr. \P [@b'dIk@tIv] \G adj. \R (abdicate %suf -ive)
abdicator| %reg. \P ['Abd@keit@] \G n. \R (abdicate %suf -er)
abdomen| %hp. \P ['Abd@m@n @b'd@um@n] \G n. \R (abdomen)
abdominal| %nhp. \P [@b'dOm@n@l Ab'dOm@n@l] \G adj. \R (abdominal)
abdominally| %reg. \P [@b'dOm@n@li: Ab'dOm@n@li:] \G adv. \R (abdominal %suf -ly)
abdominous| %nhp. \P [@b'dOm@n@s Ab'dOm@n@s] \G adj. \R (abdominous)
abduce| %nhp. \P [@b'dju:s Ab'dju:s] \G v. \R (abduce)
abduced| %reg. \P [@b'dju:st Ab'dju:st] \G v. adj. \R (abduce %suf -ed)
abducent| %nhp. \P [@b'dju:s@nt Ab'dju:s@nt] \G adj. \R (abducent)
abducing| %reg. \P [@b'dju:sIN Ab'dju:sIN] \G v. adj. \R (abduce %suf -ing)
abduct| %nhp. \P [@b'dVkt Ab'dVkt] \G v. \R (abduct)
abduction| %reg. \P [@b'dVkS@n Ab'dVkS@n] \G n. \R (abduct %suf -ion)
abductor| %reg. \P [@b'dVkt@ Ab'dVkt@] \G n. \R (abduct %suf -er)
abeam| %nhp. \P [@'bi:m] \G adv. \R (abeam)
abecedarian| %nhp. \P [eibi:si:'de:ri:@n] \G n. adj. \R (abecedarian)
abecedaries| %reg. \P [eibi:'si:d@ri:z] \G n. \R (abecedary %suf -s)
abecedary| %nhp. \P [eibi:'si:d@ri:] \G n. \R (abecedary)
abed| %hp. \P [@'bEd] \G adv. \R (abed)
abele| %nhp. \P [@'bi:l 'eib@l] \G n. \R (abele)
abelia| %nhp. \P [@'bi:li:@] \G n. \R (abelia)
abelmosk| %nhp. \P ['eib@lmOsk] \G n. \R (abelmosk)
aberdeen| %nhp. \P [Ab@di:n] \G n. \R (aberdeen)
aberrance| %nhp. \P ['Ab@r@ns @'bEr@ns] \G n. \R (aberrance)
aberrancy| %reg. \P ['Ab@r@nsi: @'bEr@nsi:] \G n. \R (aberrance %suf -y)
aberrant| %hp. \P ['Ab@r@nt @'bEr@nt] \G adj. \R (aberrant)
aberration| %hp. \P [Ab@'reiS@n] \G n. \R (aberration)
aberrational| %reg. \P [Ab@'reiS@n@l] \G adj. \R (aberration %suf -al)
abet| %nhp. \P [@'bEt] \G v. \R (abet)
abetment| %reg. \P [@'bEtm@nt] \G n. \R (abet %suf -ment)
abetted| %reg. \P [@'bEt@d] \G v. adj. \R (abet %suf -ed)
abetter| %reg. \P [@'bEt@] \G n. \R (abet %suf -er)
abetting| %reg. \P [@'bEtIN] \G v. adj. \R (abet %suf -ing)
abeyance| %nhp. \P [@'bei@ns] \G n. \R (abeyance)
abeyant| %nhp. \P [@'bei@nt] \G adj. \R (abeyant)
abhor| %nhp. \P [@b'ho:] \G v. \R (abhor)
abhorred| %reg. \P [@b'ho:d] \G v. adj. \R (abhor %suf -ed)
abhorrence| %todo. \P [@b'hOr@ns] \G n. \R (abhor %suf -ence)
abhorrent| %hp. \P [@b'hOr@nt] \G adj. \R (abhorrent)
abhorrently| %reg. \P [@b'hOr@ntli:] \G adv. \R (abhorrent %suf -ly)
abhorrer| %reg. \P [@b'ho:r@] \G n. \R (abhor %suf -er)
abhorring| %reg. \P [@b'ho:rIN] \G v. adj. \R (abhor %suf -ing)
abidance| %reg. \P [@'baid@ns] \G n. \R (abide %suf -ance)
abide| %nhp. \P [@'baid] \G v. \R (abide)
abided| %reg. \P [@'baid@d] \G v. adj. \R (abide %suf -ed)
abider| %reg. \P [@'baid@] \G n. \R (abide %suf -er)
abiding| %reg. \P [@'baidIN] \G v. adj. \R (abide %suf -ing)
abidingly| %reg. \P [@'baidINli:] \G adv. \R (abide %suf -ing -ly)
abidingness| %reg. \P [@'baidINn@s] \G n. \R (abide %suf -ing -ness)
abietic| %nhp. \P [Abi:'EtIk] \G n. \R (abietic)
abigail| %nhp. \P ['Ab@geil] \G n. \R (abigail)
abilities| %reg. \P [@'bIl@ti:z] \G n. \R (ability %suf -s)
ability| %hp. \P [@'bIl@ti:] \G n. \R (ability)
abiogenesis| %nhp. \P [.eibai@u'dZEn@s@s] \G n. \R (abiogenesis)
abiogenetic| %nhp. \P [ei.bai@udZ@'nEtIk] \G adj. \R (abiogenetic)
abiogenetically| %reg. \P [ei.bai@udZ@'nEtIkli:] \G adv. \R (abiogenetic %suf -ally)
abiogenist| %nhp. \P [eibai'OdZ@n@st] \G n. \R (abiogenist)
abirritant| %nhp. \P [Ab'Ir@t@nt] \G n. adj. \R (abirritant)
abirritate| %nhp. \P [Ab'Ir@teit] \G v. \R (abirritate)
abirritated| %reg. \P [Ab'Ir@teit@d] \G v. adj. \R (abirritate %suf -ed)
abirritating| %reg. \P [Ab'Ir@teitIN] \G v. adj. \R (abirritate %suf -ing)
abirritation| %irr. \P [Ab.Ir@'teiS@n] \G n. \R (abirritate %suf -ion)
abject| %nhp. \P ['AbdZEkt] \G adj. \R (abject)
abjection| %irr. \P [Ab'dZEkS@n] \G n. \R (abject %suf -ion)
abjectly| %reg. \P ['AbdZEktli:] \G adv. \R (abject %suf -ly)
abjectness| %reg. \P ['AbdZEktn@s] \G n. \R (abject %suf -ness)
abjuration| %irr. \P [AbdZ@'reiS@n] \G n. \R (abjure %suf -ation)
abjuratory| %todo. \P [@b'dZu:@r@tri:] \G adj. \R (abjure %suf -ate -ory)
abjure| %nhp. \P [@b'dZu:@] \G v. \R (abjure)
abjured| %reg. \P [@b'dZu:@d] \G v. adj. \R (abjure %suf -ed)
abjurer| %reg. \P [@b'dZu:@r@] \G n. \R (abjure %suf -er)
abjuring| %reg. \P [@b'dZu:@rIN] \G v. adj. \R (abjure %suf -ing)
ablactate| %nhp. \P [Ab'lAkteit] \G v. \R (ablactate)
ablactated| %reg. \P [Ab'lAkteit@d] \G v. adj. \R (ablactate %suf -ed)
ablactating| %reg. \P [Ab'lAkteitIN] \G v. adj. \R (ablactate %suf -ing)
ablactation| %irr. \P [.AblAk'teiS@n] \G n. \R (ablactate %suf -ion)
ablate| %hp. \P [@'bleit] \G v. \R (ablate)
ablated| %reg. \P [@'bleit@d] \G v. adj. \R (ablate %suf -ed)
ablating| %reg. \P [@'bleitIN] \G v. adj. \R (ablate %suf -ing)
ablation| %todo. \P [@'bleiS@n] \G n. \R (ablate %suf -ion)
ablative| %hom. #1 \P ['Abl@tIv] \G adj. n. \R (ablative) #2 \P [@'bleitIv] \G adj. n. \R (ablative)
ablator| %todo. \P [Ab'leit@] \G n. \R (ablate %suf -er)
ablaut| %nhp. \P ['Ablaut] \G n. \R (ablaut)
ablaze| %nhp. \P [@'bleiz] \G adj. adv. \R (ablaze)
able| %nhp. \P ['eib@l] \G adj. \R (able)
ablegate| %hp. \P ['Abl@geit] \G n. \R (ablegate)
abler| %reg. \P ['eib@l@] \G adj. \R (able %suf -er)
ablest| %reg. \P ['eib@l@st] \G adj. \R (able %suf -est)
abloom| %nhp. \P [@'blu:m] \G adj. adv. \R (abloom)
abluent| %nhp. \P ['Ablu:@nt] \G adj. n. \R (abluent)
ablution| %nhp. \P [@'blu:S@n] \G n. \R (ablution)
ablutionary| %reg. \P [@'blu:S@n@ri:] \G adj. \R (ablution %suf -ary)
ably| %reg. \P ['eibli:] \G adv. \R (able %suf -y)

Lexicon Root Word Field

The lexicon root word field is the most recently added field to the MU-Talk dictionary. This field provides the GTP dictionary with a great deal of morphological information.

Listed below is an outline of the treatment of different types of word in the lexicon root word field ("\R")

a) Simple entry

  1. single suffix \R1 (rootword %suf -er)
  2. multiple suffixes \R1 (rootword %suf -er-s)
  3. single prefix \R1 (rootword %pre un-)
  4. multiple prefixes \R1 (rootword %pre anti-dis)
  5. suffix and prefix \R1 (rootword %pre un- %suf -er)

eg. antidisestablishmentarianism \R1 (establish) %pre anti-dis %suf -ment-ary-an-ism)

(note that since "-an" is not a currently utilised suffix that the actual entry would probably be
\R1 (establishmentarian) %pre anti-dis %suf -ism)

b) Modified root words

  1. women \R1 (woman %suf ~s)
  2. feet \R1 (foot %suf ~s)

Notes:

  1. %suf ~s means semantically equivalent to -s suffix
  2. in both cases the \G field would be \G n. and so the ~s would be plural rather than present tense.

(cf. foots| ... \G v. \R1 (foot %suf -s) -- present tense as in "he foots the bill")

c) Compound word

  1. uninflected storewoman \R2 (store) & (woman)
  2. inflected warehouses \R2 (ware) & (house %suf -s)
  3. modified storewomen \R2 (store) & (woman %suf ~s)

Notes:

  1. \R1 indicates 2 root words
  2. "&" indicates that both root words are joined (ie. compounded)

d) Self-referencing root words

Simple root words: eg. run \R1 (run)

e) Referring to two root words

eg. banner| \P [bAn@] \G n. \R2 (banner) ! (ban %suf -er)

nb. "!" means "OR"

f) Headwords with multiple entries

eg. bow| #1 \P [b@u] \G ... \R1 (bow <1>) #2 \P [bau] \G ... \R1 (bow <2>)

Notes:

  1. #1 and #2 are the two types of bow
  2. each refers back to itself as a root word
  3. the self reference points back to the appropriately numbered entry eg. #1 points back to itself via <1>

g) Inflected multiple entries

Examples:

  1. flower| #1 \P [flau@] \G ... \R1 (flower) #2 \P [fl@u@] \G ... \R1 (flow %suf -er)
  2. bows| #1 \P [b@uz] \G ... \R1 (bow <1> %suf -s) #2 \P [bauz] \G ... \R1 (bow <2> %suf -s)

Root Word and Corpus Count File

At some time in the future it is intended that English corpus word-count frequencies be added to the dictionary database from which the TTS dictionary is generated. These corpus counts can be very helpful in evaluating Grapheme-to-Phoneme system performance in a way that relates to actual word usage.

It is proposed that the whole dictionary database be converted into a relational database and that one of the database tables (the "root word file") should contain word-count data in a format similar to the following proposal.

The root word file would contain root words as headwords followed by derived/inflected form pointer fields. A possible way of structuring this information is as follows:

  1. run| 126/222 %s (runs) 372/288 %ed ran 425/123 %ing running 126/155 %er runner 29/10 %other (runless) 5/0 %total 999/888
  2. bow| #1 24/22 %s5/2 %total 29/24 #2 12/5 %total 12/5
  3. foot| 33/31 %s feet 22/19 foots 1/0 %total 56/50

Notes:

  1. The numbers represent the corpus counts for this word.
  2. There is provision for >1 count value representing >1 corpus. The origin of the various numbers will need to be carefully documented as the corpus is not explicitly indicated. If there is a count for a word from one corpus and not from another then the number "0" should be entered so that the corpus fields can be correctly interpreted.
  3. The first number is the count for the root word.
  4. The %total field shows the total count for all derived, inflected and root forms associated with that headword.
  5. Eventually there may be different counts for words such as "bow", but currently such words will use the simpler "A" format.
  6. If a field contains a word that means that it is the derived form and can be found in one of the databases. It such a field contains >1 word then that means that there are >1 word types for that field (eg. feet/foots).
  7. If a field contains a number and a word enclosed in ( ), then that means that there is a count for that word but that the word does not occur in any of the lexical databases.

Legal Morphological Tags in the Current Corpus Count File Proposal

  1. %s (includes -es and any plural of simple present form)
  2. %ed
  3. %ing
  4. %er (includes -or and includes comparative)
  5. %est
  6. %other (all other suffixes, prefixes and compound words)
  7. %total (total corpus count for root word plus all derived/inflected forms)