Skip to Content

Department of Linguistics

Important: If you have not yet either installed the phonetic font "Charis SIL" or tested this installation to determine if the phonetic characters installed properly then click here to go to the phonetic font help pages.

Robert Mannell

Speech Intelligibility Testing

Much of the work that has been done on speech intelligibility testing has had one of two aims. One is the use of speech intelligibility testing in the assessment of hearing disorders (speech audiometry) and the other has used these tests to assess speech synthesis and speech transmission systems. This paper examines speech intelligibility tests in general, without focusing on specific tests used in speech audiometry.

Speech Quality and Speech Intelligibility

Some writers (eg. Nakatani and Dukes, 1972) have described intelligibility as just one of many attributes which affect speech quality, adding that "...high intelligibility is necessary but not sufficient to assure that a speech sample is of high quality" (ibid, p1083). This conclusion appears to have been reached because of the lack of sensitivity of certain intelligibility tests that were then current. The situation would often arise that two high quality synthetic speech samples would both give very high or even perfect scores. In other words their intelligibilities under optimal conditions were identical. Since intelligibility tests were commonly used by speech synthesis system designers as a measure of the quality of their systems, it might be concluded that the systems were of equal quality. This conclusion was not in accord with that of a group of tests which Nakatani and Dukes called "subjective scaling methods" (ibid) in which listeners were asked to rate the speech on a scale of one or more of "preference, comprehensibility, and naturalness" (ibid). Using these tests, systems which gave equal intelligibility scores often gave different quality scores. It seems more likely that certain features which define speech quality might also define speech intelligibility, and that the degree to which these various features contribute to one or the other would also vary. For example, it is likely that a listener would find speech with a familiar quality as more intelligible, and perhaps preferable to an unfamiliar one.

Voiers (1980) examined the interdependencies between speech quality and intelligibility as measured by his Diagnostic Acceptability Measure and his Diagnostic Rhyme Test. He found "...that overall acceptability or "quality" is heavily but not totally dependent on measured intelligibility, and, moreover, that the discrepancies between results of intelligibility measures and acceptability can be attributed to a limited number of systematic factors rather than to chance or to the unreliability of our measurements of intelligibility and acceptability."(ibid, p705) According to this view it is reasonable to consider the quality of a synthetic system to be a measure of its naturalness. Further, naturalness might be considered to be the degree to which a speech synthesis system is able to model the target natural speech model, or the degree to which it is degraded spectrally and temporally. Such degradations will have differing effects on the intelligibility of the speech. In hearing-impaired listeners the situation is quite complex, but it is clear that hearing loss causes a degradation in both the quality and the intelligibility of speech. For the hearing impaired, however, the major issue is not one of "naturalness", as it is with speech synthesis systems, but is especially one of intelligibility. Nevertheless, it is not uncommon for a hearing aid recipient to adjust a hearing aid's settings on the basis of preferred quality in spite of there being another setting that is likely to result in better intelligibility.


Egan (1948) recommended extensive training of listeners until their performance levels out, in order to control the effects of listener familiarity with the Harvard PB lists. Moser and Dreher (1955) studied the effects of listener training, and found that intelligibility test results are highly sensitive to training, that subject responses grow more stable with training, and that when a small number of subjects (eg. < 10) are used, training is essential for valid results. Miller and Nicely (1955) showed that some discriminations required for phoneme recognition are more difficult than others. This led Voiers (1977b) to suggest that familiarisation training might lead to unequal familiarity with certain features and thus desensitise the test to certain deficiencies in a system under test. These findings are of some relevance to the hearing impaired. For example, a typical hearing impaired client can be considered a trained listener, in the sense used above, as he or she is intimately familiar with the perception of speech utilising their speech transmission channel, their impaired auditory system. The only times when they would behave similarly to an untrained listener in a speech technology evaluation study is immediately after the fitting of a new hearing aid or cochlear implant or the utilisation of a new speech processing strategy in an existing auditory prosthesis. This may explain the desire of some clients to reset their prosthesis to a setting with a preferred speech quality rather than a setting which audiometry and psychoacoustics would predict to result in better intelligibility. The familiar, the hearing experience to which they are accustomed, may be initially more preferable to the more intelligible condition.

Linguistic Effects

Two major contextual effects influence the listener's response. The first involves the phonological rules of the language and the inter-phonemic constraints that operate within a word. The second is the tendency for listeners to favour those words which occur with the greatest frequency in the language (Howes, 1957). These two effects can operate together as follows. The listener may identify perhaps two or three of the phonemes in a test word. There may be only two or three words in the language with that combination of phonemes and the listener's potential response will be limited to that list (nonsense words will be excluded). The listener is then most likely to select the word in that list which is the most familiar. Schultz (1964) found that even when words have been correctly identified initially, there is a considerable tendency for highly familiar words to be substituted for them. This is of great importance in the development of speech audiometry tests. Words in a list should, as far as possible, all be highly familiar words. Similar sounding words need to be as close as possible to being equally familiar.

The longer the test item, the more context it is likely to contain. Hirsh et al (1954) and Rubenstein and Decker (1959) examined the relationship between word length and intelligibility and found that as word length increased from one to three syllables, intelligibility increased. This tendency continues as the test material further increases in length to sentence lists (Giolas, 1966), and to continuous discourse (Giolas, 1963). Added to the phonemic context which occurs at the word level, is syntactic and semantic context. The effects of these types of context are complex, and Giolas (1963) was unable to use the results of the word list tests to predict intelligibility of continuous speech with any reliability. Speech audiometry based upon sentences or other longer sequences of speech conflate auditory processing with cognitive and especially linguistic processing. Two hypothetical hearing impaired clients with identical hearing loss may perform differently in clause, sentence or utterance based hearing tests because one is more skilled at utilising contextual and grammatical cues than the other.

Phonetically (or rather phonemically) balanced (PB) word lists are a long established (eg. Egan, 1948) tool for the study of speech intelligibility. They generally contain monosyllabic CVC words that have been selected in such a way that the lists reflect the statistical distribution of the phonemes in that dialect. Most PB word lists have been designed for American English (eg. Tillman and Carhart, 1966), but more recently Clark (1981) produced a set of four PB word lists based on Australian English. PB word lists are often limited to monosyllables because of the effect of word length (syllable number) on intelligibility (Rubenstein and Decker, 1959, see above). Because of the limited size of typical PB word lists, repetition of the list is very likely to lead to the listener learning the list. This problem can be overcome by only presenting the list once, or by training the subjects first so that the effects of learning have levelled out before the actual tests. Once the list is learned, the PB word list is equivalent to a limited response set (effectively a multiple choice test).

The response set is the list of allowable responses in a test and may be limited to a list of words, segments or syllables presented to the subject, or it may be unlimited (or open ended) with responses selectable from (for example) the complete lexicon. Miller et al (1951) examined the effect of tests using different response set sizes ranging from two items to all possible monosyllables. Intelligibility scores were found to decrease as the response set size increased.

When a response set is utilised that has more than one uncertainty (eg. PB word lists in which all three phonemes in the CVC word are uncertain) it is very difficult to separate the various complex interactions that affect the test result. When that uncertainty is limited to a single phoneme, as in the Fairbanks Rhyme Test (Fairbanks, 1958), inter-phonemic context is controlled. Such tests are generally monosyllabic, and are usually CV lists or CVC lists. Most tests are designed to examine consonant intelligibility, as they are usually more prone to degradation by speech systems. Those tests designed to examine vowel intelligibility often use the long established /h_d/ frame (varying vowel between an /h/ and a /d/ consonant). Most of these tests have as a response set all consonants (or all vowels). The Modified Rhyme Test (House et al, 1965) restricted the response set to a limited number of consonants. Voiers (1977b) in his Diagnostic Rhyme Test (DRT), limited the response set even further. Instead of limiting the set to a class of consonants, he limited the response set to two consonants which differed from one another by only one feature (using a feature set derived from Miller and Nicely (1955) and Jakobson, Fant and Halle (1952)). He argued that this avoided arbitrary restrictions of the listener's options and thus "...ambiguity as to the specific cause of an erroneous response can be eliminated" (Voiers, 1977b). This is a test for consonants only, and it only utilises the initial position. Pols (1983), in a test which scaled Dutch consonant confusions found very little differences in the confusion patterns of initial, medial and final consonants.

Masking and Distortion or Filtering of Speech

It is often necessary to sensitise intelligibility tests to help identify potentially vulnerable acoustic cues in the speech signals being tested. Such sensitisation generally involves the masking or systematic distortion of the test items in such a way that the differences in the output of different speech production systems (both natural and synthetic) are increased to a point where they are more readily measurable. This is made necessary because it is commonly found that two systems with discernible quality differences may both give intelligibility scores so close to perfect in noiseless conditions that often, no significant differences between them can be measured (a "ceiling" effect). Sensitising aims to show differences in important information-bearing cues in the signals being compared. In other words, tests which result in close to 100% correct responses are not very sensitive to minor speech transmission differences. With some masking, for example, responses in the region of 20% to 80% might be achieved. Two conditions which resulted in 100% intelligibility utilising unmasked speech might be distinguished (eg. 80% versus 50% intelligibility) when utilising masked speech. In speech audiometry, an analogous result is achieved when presentation level of an unmasked word list is adjusted until a 50% correct response is achieved. Similar results could be achieved by adjusting masking noise levels until the 50% intelligibility point is achieved. Two hearing impaired clients' hearing would be discriminated on the basis of the presentation level or the masking level that each required to produce a 50% intelligibility result.

Miller (1947) listed three categories of sounds which could interfere with speech (i) tones (pure and complex), (ii) noises, and (iii) voices. The intelligibility of masked speech (regardless of whether the noise masked all frequencies or only certain bands) was shown to decrease as the intensity of the masking noise increased. He concluded that "...the greatest interference with vocal communication is produced by an uninterrupted noise which provides a relatively constant speech-to-noise ratio (S/N) over the entire range of frequencies involved in human speech" (ibid, pp124-125).

Hirsh et al (1954) examined the effect on natural speech intelligibility as the degree of noise masking is increased, and found that intelligibility decreased rapidly as the level of masking increased beyond 0 dB S/N (s.p.l.), with very poor intelligibility at -13 dB S/N. (nb. 0 dB S/N is when the signal and the noise intensities are identical. Negative S/N ratios occur when the noise intensity exceeds the signal (speech) intensity.)

Pisoni and Koen (1982) demonstrated that as the S/N ration decreases, synthetic speech shows a faster decrease in intelligibility than masked natural speech. This is also true for the hearing impaired, even people with very moderate hearing loss. As the noise level rises the intelligibility of speech declines for the hearing impaired more rapidly than it does for the normally hearing.

Clark (1983) used five different levels of noise masking (Quiet, +12 dB S/N, +6 dB S/N, 0 dB S/N, -6 dB S/N) to sensitise intelligibility tests of natural speech, analysis-synthesis speech derived from natural speech using a 12-parameter serial formant synthesiser, and synthesis-by-rule speech. His results for the vowels (in a /h_d/ frame) showed no significant differences between the speech types and the masking levels for quiet, +12 dB S/N, and +6 dB S/N conditions. There were significant drops in intelligibility from +6 dB S/N to 0dB S/N to -6 dB S/N masking levels. Differences in intelligibility between the speech types only occurred at the -6 dB S/N masking level where the synthetic vowels are slightly more intelligible than the natural vowels. This result, he felt would "...probably result from the almost idealised formant structure in the synthesised vocalic sounds being marginally less affected by noise masking than their natural counterparts."(ibid, p43) The intelligibility of the natural consonantal sounds (/Ca/ context) on the other hand was significantly better than both sets of synthesised consonants for all masking conditions. Further, the natural consonants remained "quite resistant to masking until 0 dB S/N is reached" (ibid, p45) whilst the intelligibility of the synthetic consonants fell rapidly even at +12 dB S/N. These trends were greatest for the synthesis-by-rule set which is not surprising, for such a process is limited not only by the intrinsic synthesiser limitations (which are also shared by the analysis-synthesis method) but also by the validity of the rule structure and its related data base. Clearly, the limited number of parameters available on such synthesisers are adequate to model vowel structure, but are inadequate for the modelling of consonants. Further, within the consonants, the perception of nasals and liquids is much less degraded by masking noise than are the stops and fricatives, probably due to the reliance in the former on vowel-like formant cues whilst the latter relies more on transient high frequency cues, zeros etc. which are not adequately modelled on this type of synthesiser. It is interesting to note that noise masking of the natural speech causes less degradation in the intelligibility of nasals and liquids than of stops and fricatives. This may be due to the use of white noise in which the intensity level at higher frequencies is the same as that at lower frequencies. In general, the low frequency intensity of a vowel nucleus is often greater than the high frequency intensity of an adjacent frication, and so lower intensity high frequency cues may be subjected to greater masking than low frequency cues when white noise is used as the masker.

Busch and Eldredge (1967) examined the effects of white noise and "speech shaped" (-9 dB / octave above 500 Hz) noise on the intelligibility of various consonants. They found that nasals and liquids (/m n ŋ w j r l/) were more severely affected by the speech-shaped noise, with its more intense low frequency components than they were by white noise. Fricatives were more intelligible in speech-shaped noise than they were in white noise. When both types of noise are set to a certain intensity, the measurements are usually based on the total energy of that signal. Speech-shaped noise will have more intense low frequency components and less intense high frequency components than white noise of the same average intensity and so will have a greater effect on phonemes with low frequency cues than it will on phonemes with high frequency cues.


The discussion of the intelligibility tests clearly indicates that the results least contaminated with the effects of other contextual variables are those from tests which use test items that are as short as possible, and are the least familiar lexically. Monosyllabic words, such as PB word lists are preferable therefore to polysyllabic words because of their minimal lexical and phonetic contextuality, and words are preferable to sentences and continuous discourse because they lack syntactic and semantic context. Such contextual effects may not be important in tests of overall system quality, but they make it very difficult to reach an opinion about the effect of adjustments of gross time and frequency resolution on the intelligibility of individual phonetic elements. Rhyme tests claim to reduce the number of phonetic variables being examined at any one time down to a single phonetic group or even distinctive feature. Their disadvantages include the large number of pairs that must be tested to examine all the consonants thoroughly, and the fact that they do not examine vowels. /h_d/ items and CV syllables might be chosen because they allow a smaller set of test words than the rhyme tests, and because they only examine one vowel, or one consonant per item and therefore none of the other phonemes in each item can provide context which can help the listener identify the phoneme of interest. PB words, on the other hand, are not minimally contrasting (the two consonants and the vowel all vary) and further, PB results in a pilot test of synthesiser performance (Mannell et al, 1985) were shown to be unreliable predictors of phonetic level performance. Nonsense syllables such as /h_d/ and CV tokens, whilst allowing finely tuned phonetic examination of speech transmission performance, are most undesirable in most speech audiometry settings. Phonetically naive listeners often have great difficulty with non-words, and also show highly variable performance with PB word lists which are not controlled for word frequency. Some audiology clients may be not inclined to cooperate when subjected to a stream of meaningless nonsense syllables.

Noise masking has been utilised as a sensitisation procedure for some synthesis evaluation procedures (Mannell, 1994). The noise masker used was a speech shaped noise which was chosen as a more realistic simulation of real listening conditions. This masker was used to allow sensitised global comparisons between some types of conditions and might also be useful in speech audiometry for global comparisons between clients. Masked tests are difficult to interpret when analysing detailed phonetic confusion patterns as masking creates its own characteristic confusion patterns which would be conflated with the effects speech technology transmission methodology or impaired listener auditory transduction.


Details of the books and articles refered to in this paper can be found in:-

  1. Mannell, R.H. (1994), bibliography
  2. A bibliography of Readings on Speech Perception, Speech Audiometry and Intelligibiity Testing

Other Reading

  1. A more detailed version of this paper (but more focussed on speech technology testing).

The bibliography list (see (2) above) lists papers on this topic grouped by sub-topic. Interested students might wish to look at some of those papers.