Skip to Content

Department of Linguistics

Natural and Synthetic Speech Intelligibility and Quality Testing

Robert H. Mannell
Department of Linguistics
Macquarie University, Sydney, Australia

Original Article: Mannell R.H., (1984), "Natural and synthetic speech intelligibility and quality testing", in Mannell R.H. (1984), Aspects of Speech Synthesis Performance, unpublished Honours dissertation, Macquarie University, Sydney, Australia, Chapter 3, pp 39-61. The introductory and concluding paragraphs have been greatly modified to remove the specific context of the original chapter. Minor editing has been applied to the remaining text of the chapter. Section and figure numbering has been changed and three of the figures have been brought forward from chapter 4.

The term "intelligibility", as it is used in the literature of phonetics and speech technology, refers to a continuum of states ranging from zero intelligibility to very high intelligibility. The intelligibility of speech is the degree to which that speech can be understood. A highly intelligible utterance is an utterance whose words can be correctly identified. In other words, intelligibility and the reliability of lexical access are positively correlated. To take this one step further, the higher the intelligibility of an utterance the more accurately the phonemes in that utterance are identified (a reasonable assumption if one assumes that accurate word recognition relies on the accurate identification of the phonemes in each word). In many of the papers reviewed in this article, accuracy of phoneme identification by listeners is seen to be a direct measure of intelligibility. Note, however, that this view of intelligibility ignores the contribution of prosody to the understanding of the meaning of an utterance.

The term "speech "quality", on the other hand, is more resistant to simple definition. The word "quality" implies an aesthetic evaluation of speech, and to a certain extent this is true. Many measures of speech quality seek to determine listener preferences, usually by asking listeners to rank speech tokens from the most preferred to least preferred (often determined from a series of token pair preferences). Such tests of listener preference or opinion are quite common approaches to the testing of speech quality. Evaluation of the quality of synthetic speech often attempts to place speech tokens somewhere on a continuum ranging from the most "natural-sounding" to the least "natural-sounding" speech. Such an approach provides a simple linear, or binary, view of speech quality with a feature similar to "good" at one end and its opposite, similar to "bad", at the other end of the continuum. In this case, the quality "naturalness" is posited as the goal of speech synthesis systems designers (a not unreasonable assumption). The problem with this approach is that "naturalness" is itself not easily defined in ways that would be helpful to speech synthesis systems designers. Helpful definitions of speech naturalness (or quality) would most likely be multi-dimensional rather than linear. Some of the papers reviewed here examine multi-dimensional scales of speech quality.

The relationship between speech intelligibility and speech quality is quite complex, with some indicators of speech quality correlating with speech intelligibility and some being mostly unrelated to speech intelligibility.

Some writers (e.g. Nakatani and Dukes, 1972) have described intelligibility as just one of many attributes which affect speech quality, adding that "... high intelligibility is necessary but not sufficient to assure that a speech sample is of high quality" (ibid., p1083). This conclusion appears to have been reached because of the lack of sensitivity of certain intelligibility tests that were then current. The situation would often arise that two high quality synthetic samples would both give very high or even perfect scores. In other words their intelligibilities under optimal conditions were identical. Since intelligibility tests were commonly used by speech system designers as a measure of the quality of their systems, it might be concluded that the systems were of equal quality. This conclusion was not in accord with that of a group of tests which Nakatani and Dukes called "subjective scaling methods" (ibid.) in which listeners were asked to rate the speech on a scale of one or more of "preference, comprehensibility, and naturalness" (ibid.). Using these tests, systems which gave equal intelligibility scores often gave different quality scores.

It seems more likely that certain features which define speech quality might also define speech intelligibility, and that the degree to which these various features contribute to one or the other would also vary. The degree of variation may also vary from listener to listener. Two major causes of the differences in natural speech quality from speaker to speaker are physiology and habitual articulatory setting. Clearly, the length and area function of different speakers' vocal tracts vary greatly, as do the size of the larynx etc. and any variation in either source or filter in any acoustic system will affect the resultant acoustic signal. Further, different speakers (even those with very similar vocal tracts) adopt different habitual vocal settings. Laver (1980, p1) describes voice quality (in this latter sense) as "... a cumulative abstraction over a period of time of a speaker-characterizing quality, which is gathered from the momentary and spasmodic fluctuations of short-term articulations used by the speaker for linguistic and paralinguistic communication". For example, certain settings (e.g. nasality) might be used by all members of a certain sociolinguistic group. It is likely that a listener would find a familiar speech quality as more intelligible, and perhaps preferable to an unfamiliar one.

Voiers (1980) examined the interdependencies between speech quality and intelligibility as measured by his Diagnostic Acceptability Measure and his Diagnostic Rhyme Test (see below). He found "... that overall acceptability or quality is heavily but not totally dependent on measured intelligibility, and, moreover, that the discrepancies between results of intelligibility measures and acceptability can be attributed to a limited number of systematic factors rather than to chance or to the unreliability of our measurements of intelligibility and acceptability."(ibid., p705)

For the present it seems reasonable to consider the quality of a synthetic system to be a measure of its naturalness. Further, naturalness might be considered to be the degree to which a speech synthesis system is able to model the target natural speech model, or the degree to which it is degraded spectrally and temporally. Such degradations will have differing effects on the intelligibility of the speech.

1 Intelligibility Testing

1.1 Speaker differences

Smith (1979) found highly significant differences between the results of the Diagnostic Rhyme Test (Voiers, 1977b, see 1.3 below) for different speakers.

Silverstein et al (1953) found that the speech of untrained male speakers was more intelligible in the presence of high noise levels than was female speech but speaker training (familiarisation with the task) eliminated those differences.

Pickett (1956) in a study on the effects of vocal force on intelligibility found "... less than a 5% deterioration in intelligibility over the range from a moderately low voice to a very loud voice ..." but found abrupt decreases in intelligibility as the speech moved from this range to a very soft whisper and to a very loud shout.

Pickett and Pollack (1963) examined the effects of rate of utterance and duration of excerpt on intelligibility. The same list of words were removed from three recordings of a text spoken at three rates (very fast, normal, very slow). Intelligibility was shown to increase as sample duration increased (i.e. as rate decreased). The fast samples were synthetically stretched to a length a little greater than the slow samples, and were shown to still be less intelligible. They concluded that it is not duration as such that caused all the intelligibility loss, but that in order to articulate quickly, some "speech context" is excluded.

Kahn and Garst (1983) examined the effects of five voice quality characteristics on the quality of LPC speech and found a strong correlation between certain vocal characteristics and a reduction in LPC quality. The LPC model, being an all-pole model, does not handle nasalised segments well. Creaky, whispery and harsh voices often have "... irregularity or noise in the pitch periods which may obscure fundamental frequency or cause poor spectral analysis" (ibid., p531) whilst pitch extreme voices often simply test the pitch range that the system can handle. For the system they were testing, nasality and whisper caused the greatest decrease in quality. For a system which utilises zeros, such as an FIR-filter-driven channel vocoder, the problems caused by nasality may not be as pronounced, but all of the other voice quality effects described by Kahn and Garst should still be applicable. They concluded "... that overall intelligibility for a particular voice is related not to discrete errors but rather to the overall precision with which the system models the speaker's voice." (ibid., p533).

1.2 Listener effects

Voiers (1977b) describes the listener's perception and understanding of the linguistic intent of a speaker as a "dual process". On the one hand it involves discrimination of the acoustic attributes of the speech, and on the other hand it involves the inferences that the listener makes from these acoustic features. The first of these processes is affected by the listener's physiological ability (eg. normal or abnormal hearing) and psychological orientation (eg. attention, motivation, boredom etc.). The second process, that of discrimination, is affected by the listener's knowledge of the language and of various sociolinguistic and dialectal aspects of the utterance as well as the listener's familiarity with the test structure and test materials. This process is probably also affected by the aspects of psychological orientation mentioned above. Egan (1948) recommended extensive training of listeners until their performance levels out, in order to control the effects of listener familiarity with the Harvard PB lists. Moser and Dreher (1955) studied the effects listener training, and found that intelligibility test results are highly sensitive to training, that subject responses grow more stable with training, and that when a small number of subjects (eg. < 10) are used, training is essential for valid results. Voiers (1977b) argued that such familiarity with the PB word lists might effect "... qualitative changes in the listener's task and, ultimately ... his performance". Miller and Nicely (1955) showed that some discriminations required for phoneme recognition are more difficult than others. This led Voiers (1977b) to suggest that familiarization training might lead to unequal familiarity with certain features and thus desensitize the test to certain deficiencies in a system under test.

Two other contextual effects influence the listener's response. The first involves the phonological rules of the language and the inter-phonemic constraints that operate within a word. The second is the tendency for listeners to favour those words which occur with the greatest frequency in the language (Howes, 1957). These two effects can operate together as follows. The listener may identify perhaps two or three of the phonemes in a test word. There may be only two or three words in the language with that combination of phonemes and the listener's potential response will be limited to that list (nonsense words will be excluded). The listener is then most likely to select the word in that list which is the most familiar. Schultz (1964) found that even when words have been correctly identified initially, there is a considerable tendency for highly familiar words to be substituted for them.

The longer the test item, the more context it is likely to contain. Hirsh et al (1954) and Rubenstein et al (1959) examined the relationship between word length and intelligibility and found that as word length increased from one to three syllables, intelligibility increased. This tendency continues as the test material further increases in length to sentence lists (Giolas, 1966), and to continuous discourse (Giolas and Epstein, 1963). Added to the phonemic content which occurs at the word level, is syntactic and semantic content. The effects of these types of content are complex, and Giolas and Epstein (1963) was unable to use the results of the word list tests to predict intelligibility of continuous speech with any reliability.

Silverstein et al (1953) found that there are no effects of listener sex on intelligibility scores.

1.3 Intelligibility tests and test protocols

1.3.1 Word and syllable list tests

Phonetically (or rather phonemically) balanced (PB) word lists are a long established (eg. Egan, 1948) tool for the study of speech intelligibility. They generally contain monosyllabic CVC wards that have been selected in such away that the lists reflect the statistical distribution of the phonemes in that dialect. Most PB word lists have been designed for American English, but Clark (1981) produced a set of four PB word lists based on Australian English. These lists of 50 words each and are derived from an American PB word list (the Northwestern University Auditory Test No. 6, Tillman and Carhart, 1966), taking into account differences in American and Australian English phonology (eg. post-vocalic /r/) and phoneme distribution. PB word lists are often limited to monosyllables because of the effect of word length (syllable number) on intelligibility (Rubenstein et al, 1959, see above). Because of the limited size of typical PB word lists, repetition of the list is very likely to lead to the listener learning the list. This problem can be overcome by only presenting the list once, or by training the subjects first so that the effects of learning have levelled out before the actual tests. Once the list is learned, the PB wordlist is equivalent to a limited response set (effectively a multiple choice test).

The response set is the list of allowable responses in a test and may be limited to a list of words, segments or syllables presented to the subject, or it may be unlimited (or open ended) with responses selectable from (for example) the complete lexicon. Miller et al (1951) examined the effect of tests using different response set sizes ranging from two items to all possible monosyllables. Intelligibility scores were found to decrease as the response set size increased.

Haggard (1973) examined the effect of differing response set sizes on the improvement in identification response with repeat presentation. He argued that if repeated presentations allow elimination of alternatives during a response list search, then the larger the list of possible responses, the greater the number of repeat presentations which would show improved response. His results, however, showed no significant variation with response list size. (see 1.3.2 below)

When a response set is utilised that has more than one uncertainty (eg. PB word lists in which all three phonemes in the CVC word are uncertain) it is very difficult to separate the various complex interactions that affect the test result. When that uncertainty is limited to a single phoneme, as in the Fairbanks Rhyme Test (Fairbanks, 1958), interphonemic context is controlled. Such tests are generally monosyllabic, and are usually CV lists or CVC lists. Most tests are designed to examine consonant intelligibility, as they are usually more prone to degradation by speech systems. Many tests designed to examine vowel intelligibility use the long established /h_d/ frame. Most of these tests have as a response set all consonants (or all vowels). The Modified Rhyme Test (House et al, 1965) restricted the response set to a limited number of consonants. Voiers (1977b) in his Diagnostic Rhyme Test (DRT), limited the response set even further. Instead of limiting the set to a class of consonants, he limited the response set to two consonants which differed from one another by only one feature (using a feature set derived from Miller and Nicely (1955) and Jakobson, Fant and Halle (1952)). He argued that this avoided arbitrary restrictions of the listener's options and thus "... ambiguity as to the specific cause of an erroneous response can be eliminated" (Voiers, 1977b). This is a test for consonants only, and it only utilises the initial position. Pols (1983), in a test which scaled Dutch consonant confusions, found very little differences in the confusion patterns of initial, medial and final consonants.

Voiers (1982) took this concept one step further when designing the Diagnostic Discrimination Test (DDT). This test uses the same test material as his DRT but instead of requiring an identification response it requires the listener to judge whether a pair of test words are the same or different. This test is designed to test whether a system has "intrinsic" or "cosmetic" deficiencies with regard to the transmission of the cues for distinctive features. An intrinsic deficiency is the failure to transmit the information whilst a cosmetic deficiency is a failure to correctly reassemble actually transmitted information. If the DDT gives consistently correct results in cases where the DRT gives only chance responses (eg. in spectrum inverted speech) for a set of pairs differing only by one feature, then it is assumed that the speech, although virtually unintelligible, still preserves "... all, or virtually all, of the intelligibility relevant information" (ibid., p1005).

Kryter and Whitman (1965) compared the Harvard PP test with the Fairbanks and the Modified Rhyme tests against variable-level noise masked speech. The results of the two rhyme tests were very similar, and both gave higher intelligibility scores at any masking level than did the PB scores. Williams and Hecker (1968) also compared these three tests, and the Harvard test sentences test for their performance with three types of speech distortion (noise-masking, peak-clipping, and vocodering). The degrees of masking and peak clipping, and the error rate of the vocoder were varied to sensitize the tests. They found that the relation between the four types of test results were not constant across the three types of distortion. The results of the two rhyme tests were in all cases very close. Further, the results of the vocoded and the noise masked speech were also very similar. For both the masked and the vocoded speech, the sentence tests provided the highest intelligibility curve, followed by the rhyme tests, and then the PB tests. The results for the clipped speech showed quite a different pattern however. The highest intelligibility curves for the peak clipped speech were the rhyme tests, followed by the sentences and then the PB words. The rhyme test scores appeared to be independent of degree of peak clipping with a flat relationship hovering at around 90% intelligibility. All other relationships showed an intelligibility decay curve with increasing distortion. (see 3.1 below).

1.3.2 Stimulus repetition

Many early studies of the effect Stimulus repetition on intelligibility reported little (Miller et al, 1951, Thwing, 1956) or no (Traul and Black, 1965) improvement in results upon repeated presentation. If any improvement was noted, it usually resulted from the second presentation. In contrast, Pollack (1959) using -15 dB S/N masked items found significant improvements for up to six repetitions. In all these studies, the test items were lexically meaningful natural speech items sensitised to the test by the use of noise and presented to trained listeners. Clark et al (1982) examined the effect of repeated presentation on the intelligibility of nonsense syllable (CV) test items from both natural and synthetic speech, using lowered presentation level and naive subjects with no prior experience and no training prior to the test. He argued that implicit evidence existed in some previous studies (eg. Miller at al, 1951) "that the greatest intelligibility improvement from repetition occurs with test items having the simplest phonological structure and the least lexical familiarity" (Clark et al, 1982, p94). These two criteria are achieved to a reasonable extent with CV syllables. A presentation level for each speech type was selected which gave about 50% intelligibility at one presentation. The natural speech showed a significant improvement from presentation one to two only whilst the synthetic speech showed no significant improvements. Natural speech also shows a linear reduction in score variance up to the third presentation, whilst synthetic speech showed no such improvement. Clark concluded that nonsense syllables have less phonological and lexical redundancy than most lexically familiar items (such as those used in most previous tests) and so the listener is forced to rely more heavily on acoustic-parametric cues. Two repetitions are necessary in natural speech (at this presentation level) to provide sufficient cues to make a stable decision. In synthetic speech, it was assumed that these cues are poorer, less consistent, and therefore not sufficient to allow a stable decision even on repetition. Because of the redundancy in the cues in lexically familiar and/or longer words, sufficient cues have been received in one presentation to give a listener a stable response. Such factors may be responsible for the contradictory nature of previous studies.

Haggard (1973) examined the effect of repeated presentation of noise masked stimuli on identification response. He was interested, not in the previously observed improvement in results with repeated presentation, but with "the way in which the first presentation of a stimulus may influence the process whereby the subsequent presentations are analysed" (ibid., p286). He tested two competing models. The first model, a "statistical averaging model", assumes that the complete signal of speech plus noise is stored and that with repeated presentations the speech plus noise signals are "added cumulatively such that the differing contributions of noise cancel out ... while the representations of the true signal add, giving an effective improvement in signal to noise ratio'' (ibid., p287). The second model assumes that upon the first presentation of the stimulus, a "state of perceptual adjustment'' (ibid., p286) or normalisation occurs so that on a repeated presentation, perceptual selectivity has been narrowed giving the highest weight to those aspects of the first stimulus that had the greatest "acoustical coherence" (ibid., p286).

The averaging model predicts that repeated presentations would give more improvement when the added noise samples are different, whilst the narrowing selectivity model predicts that improvement would be greater when stimuli are identical allowing easier convergence upon important information. Further, the averaging model predicts a continuous linear improvement with ever increasing presentations whilst the other model predicts the possibility of non-linear steps in performance with "the emergence of a new state of perceptual adjustment" (ibid., p289). Typically, such perceptual adjustments would be expected to be complete after one presentation and so improvement after the second presentation would be slight. Haggard's results show little improvement beyond the second presentation. Further, with differences in noise type or level, the same pattern occurs, there being no significant change in the improvement on the second presentation. The first result favours the narrowing selectivity theory whilst the second favours neither theory, suggesting that some statistical averaging may exist alongside adaptation.

Stimuli that have been distorted in a fashion that actually removes, rather than merely masks, the information in the speech signal would not allow convergence of the most relevant information since much of that information would be missing. Repeated presentation would thus result in little improvement. Haggard (ibid.) used speech bandpassed between 500 and 2000 Hz and his results confirmed this expectation. Non-bandpassed speech was masked with noise of a sufficient level to reduce the performance on the first presentation to that of the bandpassed speech. It nevertheless still showed the same proportional increase upon the second presentation as did speech with lower noise levels. Clearly, the information still existed in the heavily masked but non-bandpassed speech.

1.3.3 Identification response rate and competing task tests

Pisoni et al (1982b, 1983, and Luce, Feustel and Pisoni, 1983) examined the limitations of the human information processing system, and in particular short term memory, on the perception and comprehension of natural and synthetic speech. They argued that because of "very severe limitations" on the ability of human memory to "encode and store raw sensory data" (1983, p535) it is necessary for the listener to rapidly "recode and transform sensory input into more abstract neural codes" (ibid.) and so reduce the pressure put on the short term memory. The storage capacity of short term memory is described as the major limiting factor in sensory processing. In one experiment they examined the time it took for listeners to answer "word" or "non-word" (by pressing one of two buttons) to a series of either natural or synthetic words and "permissible non-words". The mean reaction times to the words was faster for words than non-words, and faster for natural than for synthetic speech. The response for synthetic speech was 145 ms longer for synthetic speech than for natural speech. They concluded that it takes more "effort" to process and encode synthetic speech than natural speech. Five days of practice improved all results, but the differences in reaction time remained the same. A similar experiment required the subject to name the words. The results again showed more errors and a much slower response time for the synthetic words than the natural words. They concluded that "these results demonstrate that the extra processing time needed for synthetic speech does not depend on the type of response made by the listener since the results were comparable for both manual and vocal responses" (ibid., p536).

Another series of tests examined the effects of competing tasks on the subjects' memory. Subjects were given a list of 0, 3, or 6 digits to memorise and remember whilst a list of ten natural and synthetic words were being read, and then to repeat the digits in the order of presentation and also, recall as many words from the list as possible. For 0, 3, or 6 digits, recall of the synthetic words was worse than recall of the natural words and further, as the number of digits increased the recall of both natural and synthetic words deteriorated. Further, as the number of digits increased, significantly fewer subjects were able to recall all the digits when synthetic speech followed their visual presentation than when natural speech did. It was concluded that synthetic speech interferes much more with other competing cognitive processes than does natural speech.

Pisoni et al then examined the effects of presenting a list of ten natural or synthetic words to the subjects who were given the task of recalling the list when its presentation was complete. There was no difference in ability to recall the second half of the lists, but recall of the first few words of the synthetic list was significantly poorer than for the natural words. They concluded that the second half of the list actively interfered with the maintenance of the memory of previous words, and that this effect was greater for synthetic words.

A further study examined the ability of listeners to comprehend a passage of either natural or synthetic speech. They found that the more abstract levels of the passage were better understood by those who heard natural speech and that the surface structure of the passage was better remembered by subjects who heard synthetic speech. They concluded that it is harder to encode synthetic words than natural words, and so more time is spent on this encoding, whilst with the natural speech more time is left to allocate to the understanding of the ideas in the passage. The extra time and effort spent on the encoding of the synthetic words seems to have made them more memorable. An extension of this study showed that whilst training made both natural and synthetic speech more intelligible, it had no effect on the comprehension of the synthetic speech.

2 Speech distortion and masking

Nakatani and Dukes (1972) compared a number of "subjective" acceptability ratings and found that only the noise and distortion ratings were reasonably independent. They felt that "... it may be possible to represent many different types of speech degradation by points in a two-dimensional space defined by orthogonal axes corresponding to amount of signal distortion and amount of background noise".

2.1 Waveform distortion

Licklider (1946), and Licklider and Pollack (1948) examined the effect of various kinds of amplitude distortion on the intelligibility of speech. They examined the effects of peak clipping (finite, infinite, symmetrical and asymmetrical) as well as various degrees of linear rectification on speech intelligibility. Licklider (1946) found that moderate peak clipping, whether symmetrical (both positive and negative peaks clipped uniformly) or asymmetrical, produced words that were identified 96% correctly in silence. Even infinitely peak clipped speech, in which all values are either plus or minus a selected amplitude value and thus in which only zero crossing information is retained, maintained a 50% word intelligibility ( 90% sentence intelligibility) in silence. Even fairly moderate centre clipping, on the other hand, caused a considerable reduction in intelligibility and "... the words sounded more like atmospheric static than like speech." (ibid., p431) Half wave rectification had little effect on intelligibility, whilst full-wave rectification produced "badly garbled" speech of very low intelligibility. Both studies concluded that the quality of the speech was impaired more severely than was intelligibility by peak clipping. It is clear from these experiments that although only the temporal patterning of time axis crossings was maintained, this contained sufficient frequency domain information to allow reasonable intelligibility. In the second experiment, infinitely clipped speech was post-processed by either differentiating (upward frequency tilt; high frequency emphasis) or integrating (downward frequency tilt; low frequency emphasis). Whilst integration improved the perceived speech quality and differentiation decreased the quality, neither caused any appreciable improvement in intelligibility. In other words, whilst both processes had considerable though opposite effect on the speech quality, neither process either restored any cues removed by the clipper nor did they remove any more cues.

The results of Williams and Hecker (1968, see 1.3.1 above) showed that as the degree of peak clipping is increased from 0 dB to 22 dB the scores of both sentence tests and PB tests indicate decreasing intelligibility, but the results of two rhyme tests show a flat, high intelligibility score. Perhaps peak clipping causes "cosmetic" distortion (see Voiers, 1982) which has a great effect on overall intelligibility as measured by PB tests, but does not actually remove the cues necessary for the perception of distinctive features. Williams and Hecker argue that the reason for these results is that the rhyme tests examine consonant intelligibility while moderate peak clipping mainly affects vowels. Because of the low energy level of most consonants, moderate degrees of clipping would not have as great an effect on consonants as vowels, but with infinite clipping, which was not included in the Williams and Hecker tests, all phonetic classes would be equally affected.

Ainsworth (1967) examined the effect of different types of clipping on the intelligibility of different phonetic groups and found that infinite clipping caused the fewest confusions for vowels and the greatest number of confusions for fricatives. Stops had the second largest number of confusions. Between phoneme category confusions were non-existent for vowels and greatest for fricatives.

Figure 1 shows the effect of infinite clipping the spectrum of a natural vowel /a/, whilst figure 2 shows the effect of infinite clipping on the speech time domain waveform.

Figure 1: Comparison of spectra of the vowel /a:/ for natural speech and infinitely clipped speech.

Figure 2: Comparison of the waveforms of the token /ta:/ of natural and infinitely clipped speech.

2.2 Filtering

Haggard (1973) during an experiment on repeat presentation (see above) utilized speech bandpassed between 500 and 2000 Hz and concluded that the lack of improvement with the second presentation indicated that there was insufficient information in the signal to allow convergence on the important cues (ie. at least some of those cues had been completely removed). The nature of such filtered signals is therefore different from that of signals which have been masked to the same first-presentation intelligibility level (see below).

Hirsh et al (1954) examined the effect of low-pass (LP) or high-pass (HP) filtering at different frequencies on the intelligibility of natural speech and found that as the HP cutoff frequency increased from 200 to 1600 Hz and also as the LP cutoff frequency decreased from 6000 Hz to 1600 Hz the intelligibility scores remain nearly constant. When the cutoff frequency is moved above 1600 Hz for LP or below 1600 Hz for HP the intelligibility drops rapidly.

Agrawal and Wen (1975) examined the effect of selectively filtering out one of the four lowest formants. They found the greatest effect on intelligibility when F2 was filtered out, and some effect when F1 was filtered out, but none when F3 or F4 was filtered out. This may explain the results of Hirsh et al (above), because if the frequency response curve around the cutoff frequency was not too steep, when the cutoff frequency was around 1600 Hz there may always have been sufficient F2 remaining to assist in word identification.

2.3 Masking

Miller (1947) listed three categories of sounds which could interfere with speech (i) tones (pure and complex), (ii) noises, and (iii) voices. All masking at that time was additive. The intelligibility of masked speech (regardless of whether the noise masked all frequencies or only certain bands) was shown to decrease as the intensity of the masking noise increased. Similar results were reported for tones and for masking by other voices although it was noted that a single voice was an inefficient masker, and that multi-speaker babble was required (although for more than 4 voices no increased masking ability was produced). Miller commented that "... the masking produced by a sound depends upon the spectrum of the sound and is independent of the phase relations among the component frequencies" (ibid., p118). He also noted that "... two noises with the same over-all level can produce quite different masking results depending on the spectra of the noises" (ibid., p122). He also examined the effect of interrupted masking noises at various intensities and found that even for very loud masking sounds, very little masking of speech intelligibility occurs if the masking sound is on for less than 50% of the time. He concluded that "... the greatest interference with vocal communication is produced by an uninterrupted noise which provides a relatively constant speech-to-noise ratio over the entire range of frequencies involved in human speech" (ibid., pp124-125).

Stevens et al (1946) studied the effect of masking by sine waves, square waves and regular and modulated pulses on the "threshold of perceptibility' (the minimum level at which connected discourse can be followed). They found that the degree of masking by all three regular maskers was dependent upon their frequency with the greatest masking occurring with a masking signal that has a period of between 100 and 500 Hz. Both square waves and regular pulses are more effective in masking speech (ie. a lower intensity is required) than are sine waves. This is because of their richer harmonic structure. Pulses subjected to interval modulation have very complex spectra and the degree of masking of the speech signal is greater than for the unmodulated pulses.

Miller and Licklider (1950) looked at the effects of three types of speech interruption. The speech was either interrupted by (a) multiplying by a square wave with alternate values of +1 and 0 (ie. the speech was interrupted by silence), (b) noise multiplied by a similar square wave was added to the speech (ie. the speech was masked by interrupted noise), (c) interrupted speech and interrupted noise alternated. When the speech was interrupted by silence, the effect on intelligibility was dependent on the frequency of the interruption. Intelligibility was poor for less than 10 interruptions per second (10 Hz) as 100 ms (or more) segments would be missing at a stretch, completely deleting whole phonemes. Intelligibility was the quite high until about 100-200 Hz was reached, at which point intelligibility dropped rapidly, only to start climbing gain at about 1000 Hz. Interruptions by silence at about 10000 Hz had no effect. A similar effect was found with noise interrupted speech and speech masked by interrupted noise, except that above about 1000 Hz the masking effect levelled out giving a degree of intelligibility loss proportional to the S/N ratio (ie. the speech was now effectively masked by continuous noise). The high level of intelligibility loss between 200 and 1000-2000 Hz would be equivalent to the masking of the vowel F1 and F2.

An early study (Egan and Hake, 1950) related the ability of a band of noise to mask a tone near its centre to the intensity of the noise and the width of the noise band. They found that, the narrower the masking stimulus the greater the spread of the masking effect (especially towards higher frequencies) because of the presence of beats. When the bandwidth of the masking was increased from 90 Hz (centred around 410 Hz) to a band from 0-1000 Hz the subjects could no longer hear buzz or rattle in masked tones just above the masked threshold, however the masked tone needed to be 4.2 dB more intense before it could be heard even though its sound pressure level was the same as that of the narrow band of noise. The wider noise band thus allowed a more accurate estimate of the critical band width.

Fastl (1976/77) examined temporal masking effects of a critical band noise masker showing forward (post) masking and backward (pre) masking effects and the interaction of pre and post masking between adjacent noise masker bursts. The longer the masking noise burst, the greater is the effect of both pre and post masking.

Schroeder et al (1979) examined the ability of tones to mask noise and of noise to mask tones. They cite two studies (Zwicker et al, 1967 and Zwicker, 1963) which found that "while a tone in a critical band of noise becomes inaudible at a level of 2 to 6 dB below the noise level, depending on the frequency ..., a critical band of noise is completely masked by a mid-band tone only when its level is 24 dB (at 1 kHz) below the tone level". (Schroeder et al, 1979, p222). This can be explained by the fact that a tone at -6dB relative to the masking noise is of the same level as the fluctuations in the noise, whilst in the reverse case a small amount of noise fluctuation is detectable against fluctuation-free pure tone.

Rothauser and Urbanek (1965) used multiplicative rather than additive noise allowing the noise level to be correlated with the speech signal.

Mermelstein (1982) examined the effect of band-limited multiplicative noise masking on the threshold of degradation, which is "... the point at which degradation becomes noticeable in the signal..." and is defined by "... the signal-to-noise ratio at which the unimpaired speech just becomes preferable to the degraded speech" (ibid., p1369). The white noise was introduced into one of four frequency bands, 0-500 Hz, 500-1000 Hz, 1000-2000 Hz, and 2000-4000 Hz. The threshold of degradation was highest for the 0-500 Hz band (18 dB S/N) and decreased with increasing frequency to the 1000-2000 Hz band (10.5 dB S/N) and then became fairly constant.

Hirsh et al (1954) examined the effect on natural speech intelligibility as the degree of noise masking is increased, and found that intelligibility decreased rapidly as the level of masking increased beyond 0 dB S/N (s.p.l.), with very poor intelligibility at -13 dB S/N.

Nakatani and Dukes (1972) also examined interference thresholds, but used interfering speech of various levels (and thus masking all bands). The results were compared with "subjective" acceptability ratings and were shown to have a high degree of sensitivity and reliability.

Pols (1983) examined the effects of interfering speech and of reverberation on the confusion patterns of Dutch consonants, and found that they had very similar effects. He assumed that this was because reverberation "... creates a speech-like background noise caused by earlier speech segments which gradually fades away" (ibid., p291).

Haggard (1973) introduced noise into his synthesised speech samples by interfering with the frequency and amplitude control data in such a way that he produced "structural level" noise in the signal during the synthesis process. He did this by adding random numbers to the bottom 4, 3, 2 or 1 bits of the 5 bit control parameters giving amplitude ratios of 2, 4, 8, and 16 to 1 and S/N ratios of approximately 6, 9, 12 and 18 dB.

Haggard (1973, see 1.3.2 above) demonstrated that the type of degradation of speech intelligibility caused by noise masking and by bandpass filtering are different, with some of the acoustic cues necessary for improvement due to perceptual adjustment or normalisation being absent in the filtered speech, but not in the masked speech. In other words, repeat presentation improved the recognition scores of noise masked speech but not of the filtered speech.

Pisoni and Koen (1982a) demonstrated that as the S/N ratio decreases, synthetic speech shows a faster decrease in intelligibility than masked natural speech.

Clark (1983) used five different levels of noise masking (Quiet, +12 dB S/N, +6 dB S/N, 0 dB S/N, -6 dB S/N) to sensitise intelligibility tests of natural speech, analysis-synthesis speech derived from natural speech using a 12-parameter serial formant synthesizer, and synthesis-by-rule speech. His results for the vowels (in a /h_d/ frame) showed no significant differences between the speech types and the masking levels for quiet, +12 dB S/N, and +6dB S/N conditions. There were significant drops in intelligibility from +6 dB S/N to 0 dB S/N to -6 dB S/N masking levels. Differences in intelligibility between the speech types only occurred at the -6 dB S/N masking level where the synthetic vowels are slightly more intelligible than the natural vowels. This result, he felt would "... probably result from the almost idealised formant structure in the synthesized vocalic sounds being marginally less affected by noise masking than their natural counterparts" (ibid., p43). The intelligibility of the natural consonantal sounds (/Ca/ content) on the other hand was significantly better than both sets of synthesized consonants for all masking conditions. Further, the natural consonants remained "quite resistant to masking until 0 dB S/N is reached" (ibid., p45) whilst the intelligibility of the synthetic consonants fell rapidly even at +12 dB S/N. These trends were greatest for the synthesis-by-rule set which is not surprising, for such a process is limited not only by the intrinsic synthesiser limitations (which are also shared by the analysis-synthesis method) but also by the validity of the rule structure and its related data base. Clearly, the limited number of parameters available on such synthesizers are adequate to model vowel structure, but are inadequate for the modelling of consonants. Further, within the consonants, the perception of nasals and liquids is much less degraded by masking noise than are the stops and fricatives, probably clue to the reliance in the former on vowel-like formant cues whilst the latter relies more on transient high frequency cues, zeros etc. which are not adequately modelled on this type of synthesizer. It is interesting to note that noise masking of the natural speech causes less degradation in the intelligibility of nasals and liquids than of stops and fricatives. This may be due to the use of white noise in which the intensity level at higher frequencies is the same as that at lower frequencies. In general, the low frequency intensity of a vowel nucleus is often greater than the high frequency intensity of an adjacent frication, and so lower intensity high frequency cues may be subjected to greater masking than low frequency cues when white noise is used as the masker.

Busch and Eldredge (1967) examined the effects of white noise and "speech-shaped" noise (-9 dB/octave above 500 Hz) on the intelligibility of various consonants. They found that nasals and liquids /m, n, ng, w, j, r, l/ were more severely affected by the speech-shaped noise, with its more intense low frequency components than they were by white noise. Fricatives were more intelligible in speech-shaped noise than they were in white noise. When both types of noise are set to a certain intensity, the measurements are usually based on the total energy of that signal. Speech-shaped noise will have more intense low frequency components and less intense high frequency components than white noise of the same average intensity and so will have a greater effect on phonemes with low frequency cues than it will on phonemes with high frequency cues.

Figure 3 shows the effect of different levels of "speech-shaped" noise on the time-domain waveform of a natural word /ta/. Figure 4 depicts the effect of noise masking on the frequency domain of a sample vowel.

Figure 3: Comparison of the waveforms of a natural speech token /ta:/ for unmasked and progressively greater degrees of masking. The noise used in the masking was USASI S1.4 speech-shaped noise.

Figure 4: Comparison of the spectra of the vowel /3:/ for unmasked natural and channel vocoded speech and for progressively greater degrees of masking of the vocoded speech. The noise used in the masking was USASI S1.4 speech-shaped noise.

3 Tests for measuring speech quality

In general, two major categories of approach have been used by phoneticians, psychologists and engineers when testing the success of synthetic speech algorithms. The first approach has focused on some form of auditory test, whilst the second has attempted to define automatic computational methods (ie. acoustic testing).

3.1 Auditory or perceptual tests of speech quality

Hecker and Guttman (1967) in a survey of speech quality tests, especially of speech processing systems, considered that such speech quality testing is only necessary for systems that perform well in word-intelligibility tests. They distinguished two types of such test. Firstly, tests which are concerned with obtaining an over-all measure of speech quality and secondly, tests in which the various psychological components of speech quality are explored. They felt that although the former group of tests are more utilitarian, especially for engineers, it is not really possible to reduce speech quality to a unidimensional scale and so the second group of tests are probably more accurate. Hecker and Williams (1966) described the first set of tests as involving a "unidimensional measure of relative merit" (ibid., p946) in which systems that are equally preferred have equal quality, regardless of any individual physical differences, whilst tests of the second type they defined as "multidimensional measures in perceptual space" (ibid., p946). Both types of test are therefore auditory tests. It is also possible to have tests of speech quality that are acoustically based as can be seen from the section on automatic computational methods below (sect. 3.2).

Preference tests are a typical example of the first kind of test, and involve the presentation of a series of pairs of speech samples (usually one from the system being tested and one reference sample). The subject is required to indicate which of the pair is preferred, without any indication of the degree of preference. Munson and Karlin (1962) originated a version of this procedure called the "isopreference test" in which the test samples were presented unaltered whilst the reference samples were varied either in their speech level or their noise level. The results were then plotted as isopreference contours, which are lines joining systems of equal preference on a 2D plane of reference speech level against reference noise level. Each contour was then given a rank order number as a measure of relative preference or quality. In effect this test gave a measure of speech produced by any particular system as a function of a particular series of S/N ratios of a reference system (usually natural speech). If two systems appeared on the same isopreference contour, ie. if they were equally preferred to the particular reference system, then when compared directly to each other they were also equally preferred. In other words, if system A is equally preferred to system B, and system C is equally preferred to system B, then it follows that system A is also equally preferred to system C. This allowed the identification of "isopreference sets" which are sets of equally preferred systems which may be in all other senses different. Members of these sets could then be used randomly in the paired comparison tests as references and would avoid the problem of the test deteriorating into a discrimination test (ie. the situation in which previous responses colour the present response and in which the subject then merely discriminates which of the two signals, the test signal or the masked reference, is being presented).

The IEEE Subcommittee on Subjective Measurements (1969) criticised the Munson-Karlin test on three grounds: i) attempts at replication of the transitivity results (ie. if A and B are similar, and A and C are similar, then B and C are similar) have not been successful; ii) the reasons for the need for a speech level/noise level plane are not clear; and iii) it is unclear why a large range of loudness needs to be examined. They further describe an alternative version of the isopreference test in which test and reference signal levels are held constant and only the S/N ratio is varied. This method, developed by Rothauser et al (1968), was considered necessary as it was felt that subject responses would be influenced by the loudness of previous pairs.

Voiers (1977a) developed a test he called the Diagnostic Acceptability Measure (DAM), which combined 100-point rating scale judgements of six qualities of the signal (fluttering, thin, rasping, muffled, interrupted and nasal) and four qualities of the background (hissing, buzzing, babbling and rumbling). The results of these tests were compared with the listeners' overall judgements of intelligibility, pleasantness and overall acceptability. Each of these three overall judgement scores was plotted against the DAM results indicating a high degree of correlation in all cases. Only echoic speech showed a poor relationship between its DAM scores and overall intelligibility. The effects of echo are therefore not captured by the signal and background descriptors.

Hecker and Williams (1966) suggested using a variable reference system in which the reference speech would be distorted in a number of fundamentally different ways. This was also designed to avoid discrimination responses and to encourage preference responses, but was designed to be simpler to set up than the isopreference test (ie. the development of the isopreference sets was avoided). They argued that "a set of reference conditions representing different types of distortions produces less variance in the results of preference tests than a set of reference conditions representing one type of distortion." (ibid., p947). Their test employed five different reference systems:-

  1. bandwidth 0-10000 Hz, 45dB S/N
  2. bandwidth 900-3000 Hz
  3. LP filtered (3000 Hz) speech plus LP filtered (500 Hz) white noise, 10 dB S/N
  4. speech plus reverberation echo
  5. peak clipped speech BP filtered (300-2000 Hz).

They found that the results of their test produced less variance than control tests using varying levels of the same distortion.

McDermott (1969) examined both preference tests and similarity tests (pairs scored according to degree of perceived similarity) to examine speech produced with different types of distortion, and found a high degree of correspondence between the two types of tests when plotted onto a three dimensional perceptual space. She considered this method capable of indicating those types of signal variations which showed the greatest personal variation in preference.

3.2 Automatic computational methods

Much early testing of new synthesis techniques and devices, especially by engineers, was deemed to be too "subjective" to be reliable when comparing newly developed systems with their predecessors or competitors. Indeed, much of that testing had been very perfunctory, consisting of little more than informal listening sessions. Further, those developers who undertook more systematic perceptual tests nevertheless found them "tedious and expensive" (Billi & Scaglioli, 1980). The desire for some sort of "objective" tests led some engineers (eg. Licklider et al, 1959) to consider various machine-based computational algorithms.

Typically, it was necessary that the speech coder or vocoder being tested be able to be modelled as "an additive noise source" (ibid.), and that any noise so produced be "uncorrelated with the input signal" (ibid). The notion of noise is central to this type of evaluation method and refers to the measured difference between the input and output signals. Clearly, only those systems which involve some sort of coding of an input speech signal can be evaluated in this way. Of the various synthesis systems of interest to phoneticians, only vocoders would be amenable to this sort of evaluation. Unfortunately, the noise introduced by a vocoder system is very difficult to define and measure (Makhoul et al, 1976). For example, although the input and output signals of a vocoder may appear quite different in terms of wave shape, the actual difference perceived by a listener may be judged to be insignificant. Markhoul et al (ibid) considered it necessary to relate any method of "objective" vocoder evaluation to the vocoding process and also to perception. They identified "analysis, encoding, transmission and synthesis" as those components of a vocoder system which "contribute to the degradation of vocoded speech quality" (ibid.). In analysis, they considered the type of excitation used to be the most important factor causing degradation. Coding was seen to be the most important of the above components in determining output speech quality of narrow band or low bit rate vocoders (<5000 bps) because of the heavy quantization that occurs. For such vocoders they proposed comparing the input and output values of the encoder.

The success of "objective" evaluation of vocoders is dependent upon the selection of appropriate measures. Viswanathan et al (1983) used Barnwell's measures (see below) to compare the performance of medium and narrow band waveform coders and vocoders, and found that substantially poorer results for vocoders improved greatly when (for example) a common sampling frequency was used. Schroeder et al (1979) made calculations of noise loudness based on human auditory perception (critical bands) in which the lower frequencies are weighted more heavily than higher frequencies. This explains Viswanathan's results, since Schroeder (ibid) notes that if higher frequencies are zeroed the lower frequency noise level can (as an approximation) be treated as linear. By limiting the sampling frequency to 6.6 kHz Viswanathan et al (1983) had effectively set frequencies above 3.3 kHz to zero.

Barnwell (1980a, 1980b) and Barnwell and Quackenbush (1982) compared the success of various "objective measures'' with the Diagnostic Acceptability Measure (Voiers, 1977a, see above), a measure of perceived speech quality, using a distorted and an undistorted speech database. The distorted database was created by

  1. running the speech through various coding algorithms (eg. APCM, ADPCM, LP coding, vocoder etc.)
  2. filtering, additive noise, interruption, clipping, etc.
  3. frequency distortion and masking

giving a total of 264 distorted tokens.

Four classes of objective quality measures were compared for their ability to predict the subjective results.

  1. "spectral distance measures" (frequency domain distortion measures, using "spectral envelopes estimated using 10th order LPC analysis". ie. linear and log spectral distortion).
  2. "parametric distance measures" (comparison of measures extracted from the LPC analysis of distorted and undistorted speech, including, area ratios, feedback coefficients, PARCOR coefficients, energy ratio, and also log area ratios, log feedback ratios, and log PARCOR values).
  3. "noise measures" (S/N measurements)
  4. "Composite measures" (combinations of the above)

The tests were further divided into frequency variant and invariant tests, and unframed and framed tests. The frequency variant tests divided the spectrum into six bands and weighted these results according to frequency before computing a combined result. The framed, or short time tests, divided the signal into 10-30 ms segments and weighted each segment's results according to its energy before computing a combined result. In general, it was found that frequency-variant tests performed better than frequency-invariant tests, and that framed tests performed better than unframed tests. Frequency variance had the greater effect. The log-area-ratio distance and the energy ratio measures were the only frequency invariant parametric distance measures which performed well, and both performed better than the frequency invariant spectral distance measures. Frequency variant methods greatly improved the linear spectral distance scores to a level comparable with the moderately improved log spectral distance scores. Unframed S/N tests performed poorly, whilst framed S/N tests performed well. The best result of all was obtained by the frequency variant framed S/N test. A further result was that "often more improvement was obtained by combining a good measure with a bad measure of a vastly different type than from combining two or more similar good measures." (Barnwell and Quackenbush, 1982, p 998)

The results of both Barnwell and of Schroeder et al point to the need to consider the time and (especially) frequency domain aspects of human auditory perception when designing computational methods of assessing speech coding success. Unfortunately, the prerequisites for such evaluations do not allow the comparison of systems which have natural Speech input (such as vocoders) with systems which have some other form of input, such as synthesis-by-rule systems.

4 Conclusion

Speech quality tests, whether auditory or automatic, are designed to examine the overall acceptability of the output of a speech processing system. The reviewed automatic speech quality tests could have been applied to a vocoder (but not to a TTS system) since vocoders have input and output speech signals which can be compared and tested in the manner described, but an estimate of, for example, the "noise" added to a signal by the system would not allow a detailed examination of the phonetic-level effects of modifications to system design parameters. The same thing applies to the auditory tests of speech quality, since they only measure listener preference between systems, or listener approval of the system based on fairly informal judgements of the presence and severity of certain gross quality parameters.

The intelligibility tests whose results are least contaminated with the effects of other contextual variables are those which use test items that are as short as possible, and are the least familiar lexically. Monosyllabic words such as PB word lists are preferable to polysyllabic words because of their minimal lexical and phonetic contextuality, and words are preferable to sentences and continuous discourse because they lack syntactic and semantic context. Such contextual effects may not be important in tests of overall system quality, but they make it very difficult to reach an opinion about the effect of adjustments to various phonetic and acoustic parameters on the intelligibility of individual phonetic elements. Rhyme tests claim to reduce the number of phonetic variables being examined at any one time down to a single phonetic group or even distinctive feature. Their disadvantages include the large number of pairs that must be tested to examine all of the consonants thoroughly, and the fact that they do not examine vowels. Nonsense syllables, such as /h_d/ and CV syllables, have the advantage of having a smaller set of test tokens than the rhyme tests. Further, because only a single vowel (/h_d/) or consonant (CV) is varied from token to token and the remaining phonetic context remains constant, none of the other phonemes in each item can provide any context which can help the listener identify the word and thus the phoneme of interest.

In some cases it is also desirable to sensitize tests of high quality systems by using certain levels of systematic distortion (such as noise masking or infinite clipping). Such sensitisation avoids ceiling effects.

It should be noted that some listeners tend to experience increasing difficulty with test items as they are reduced in size or familiarity (infrequent words and non-words) or are masked or distorted. This is not normally a problem with most listener groups, but this may be an issue in the design of tests targeting the very young, the very old, or people with cognitive or auditory impairments. Designers of tests of the performance of speech processors in hearing aids or cochlear implants, for example, need to be aware of such listener constraints.

Footnote: This paper is a review of natural and synthetic speech quality testing up to the original date of publication in 1984. In the intervening years this field has progressed substantially. For example, a lot of work has recently been carried out on the assessment of TTS systems. Such work examines not just the phoneme-level intelligibiliy of a TTS system, but TTS prosody and other aspects of TTS performance. Also, since 1984 new standards for the assessment of voice coder quality have been adopted. A review of work in this field since 1984 should be forthcoming at some time in the future.

5 Bibliography

Agrawal A and Wen C. Lin, (1975), "Effects of voiced speech parameters on the intelligibility of PB words", JASA 57(1), 1975, 217-222.

Ainsworth W.A., (1967), "Relative intelligibility of different transforms of clipped speech", JASA 41(5), 1967, 1272-1276.

Barnwell III T.P., (1980a), "Correlation analysis of subjective and objective measures for speech quality", IEEE ICASSP 1980, 706-709.

Barnwell III T.P., (1980b), "A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results", IEEE ICASSP 1980, 710-713.

Barnwell III T.P., and Quackenbush S.R., (1982), "An analysis of objectively computable measures for speech quality testing", IEEE ICASSP 1982, 996-999.

Billi R., and Scagliola C., (1980), "An identification method for objective quality measurements on speech waveform coders", IEEE ICASSP 1980, 712-718.

Busch A. C., and EIdredge, D., (1967), "The effect of differing noise spectra on the consistency of identification of consonants", Language and Speech 10, 1967, 194-202

Clark J.E., (1981), "Four PB word Iists for Australian English", Aust. J. Audiology 3(1), 1981, 21-31

Clark J.E., (1983), "Intelligibility comparisons for two synthetic and one natural speech source", J. Phonetics 11, 37-49.

Clark J.E., Dermody P., and Palethorpe S., (1982), "Effects of repeated stimulus presentation on natural and synthetic speech IntelligibiIity", SLRC Working Papers (Macquarie University) 3(3) 1982, 91-109

Egan J.P., (1948), "Articulation testing method", Laryngoscope 58, 1948, 955-991.

Egan J.P., and Hake H.W., (1950), "On the masking pattern of a simple auditory stimulus", JASA 22(5) 1950, 622-630

Fairbanks G., (1958), "Test of phonemic differentiation: the rhyme test", JASA 30, 1958, 596-600

Fastl H., (1976/77), "Temporal masking effects: II. Critical band noise masker", Acustica, 36, 1976/77, 317-331

Giolas T.G., and Epstein A., (1963), "Comparative intelligibility of words lists and continuous discourse", JSHR 6, 1963, 349-358

Giolas T.G., (1966), "Comparative intelligibility scores of sentence lists and continuous discourse", J. Auditory Research 6, 1966, 31-38

Haggard M,. (1973), "Selectivity versus summation in multiple observation tasks: Evidence with spectrum parameter noise in speech", Acta Psychologica 37, 1973, 285-299

Hecker M.H.L., and Guttman N., (1976), "Survey methods for measuring speech quality", J. Audio Engineering Soc., Vol 15, 1967, 400-403

Hecker M.H.L., and Williams C.E., (1966), "Choice of reference conditions for speech reference tests", JASA 39(5), 1966, 946-952

Hirsh I.J., Reynolds E.G., and Joseph M., (1954), "The intelligibility of different speech materials", JASA 26(4), 1954, 530-538

House A.S., Williams C.E., Hecker H.L., and Kryter K.D., (1965), "Articulation testing methods: Consonantal differentiation with a closed response set", JASA 37, 1965, 158-166

Howes D., (1957), "On the relation between the intelligibility and frequency of occurrence of English words", JASA 29, 1957, 296-305

IEEE subcommittee on subjective measurements, (1969), "IEEE recommended practice for speech quality measurements", IEEE Trans. Audio and Electroacoustics, Vol AU-17, no. 3, 1969

Jakobson R., Fant C.G.M., and Halle M., (1952), "Preliminaries to speech analysis: The distinctive features and their correlates", Tech. Report No.13, Acoustics 1ab., MIT, 1952

Kahn M., and Garst P., (1983), "The Effects of five voice quality characteristics on LPC quality", IEEE ICASSP-83, 1983, 531-534

Kryter K.D., and Whitman E.C., (1965), "Some comparisons between rhyme and PB word intelligibility tests", JASA, 1965, p.1146

Laver J., (1980), The Phonetic Description of Voice Quality, Cambridge Uni. Press, 1980

Licklider J.C.R., (1946), "Effects of amplitude distortion upon the intelligibility of speech", JASA 18(2) 1946 429-434

Licklider J.C.R., Bisberg A., and Schwartzlander H., (1959), "An electronic device to measure the intelligibility of speech", Nat. Electronics Conf. Proc. 15, 1959, 329-334

Licklider J.C.R., and Pollack I., (1948), "Effects of differentiation, integration and infinite peak clipping upon the intelligibility of speech", JASA 20(1), 1948, 42-51

Luce, P.A., Feustel T.C., and Pisoni D.B., (1983), "Capacity demands in short-term memory for synthetic and natural speech", Human Factors 25(1), 1983, 17-32

McDermott B.J., (1969), "Multidimensional analysis of circuit quality judgments", JASA 45(3), 1969, 774-781

Makhoul J., Viswanathan R., and Russell W., "A framework for the objective evaluation of vocoder speech quality", IEEE ICASSP-76, 1976, 103-106

Mermelstein P., (1982), "Threshold of degradation for frequency-distributed band-limited noise in continuous speech", JASA 72(5), 1982 1368-1373

Miller G.A., (1947), "The masking of speech", Psychological Bulletin 44(2), 1947, 105-129

Miller G.A., and Licklider J.C.R., (1950), "The intelligibility of interrupted speech", JASA 22(2), 1950, 167-173

Miller G.A., Heise G.A., Lichten W., (1951), "The intelligibility of speech as a function of the context of test materials", J. of Exper. Psychology 41, 1951, 329-335

Miller G.A., and Nicely P., (1955), "An analysis of perceptual confusions among some English consonants", JASA 27, 1955, 338-352

Moser H.M., and Dreher J.J., (1955), "Effects of training on listeners in intelligibility studies", JASA 27(6), 1955, 1213-1219

Munson W.A., and Karlin J.E., (1961), "Isopreference method for evaluating speechtransmission circuits", JASA 34(6), 1961, 762-774

Nakatani L.H., and Dukes K.D., (1972), "A sensitive test of speech communication quality", 1972, 1083-1092

Pickett J.M., (1956), "Effects of vocal force on the intelligibility of speech sounds", JASA 28(5), 1956, 902-905

Pickett J.M., and Pollack I., (1963), "Intelligibility of excerpts from fluent speech: Effects of rate of utterance and duration of except", Lang. & Speech 3, 1963, 151-164

Pisoni D.B., and Koen E., (1982a), "Some comparisons of intelligibility of synthetic and natural speech at different speech-to-noise ratios", JASA 71 (Suppl 1), 1982, S94

Pisoni D.B., (1982b), "Perception of speech: The human listener as a cognitive interface", Speech Technology 1(2), 1982, ,10-23

Pisoni D.B., Nusbaum H.C., Luce P.A., and Schwab E.C., (1983), "Perceptual evaluation of synthetic speech: Some considerations of the user/system interface", IEEE ICASSP-83, 1983, 535-538

Pollack I., (1959), "Message repetition and message reception", JASA 31, 1959, 1509-1515

Pols L.C.W., (1983), "Three-mode principal component analysis of confusion matrices, based on the identification of Dutch consonants, under various conditions of noise and reverberation", Speech Communication 2, 1983, 275-293

Rothauser E.H., and Urbenek G.E., (1965), "New reference signal for speech quality measurements", JASA 38, 940 (A), 1965

Rothauser E.H., and Urbenek G.E., and Pachl W.P., (1968), "Isopreference methods for speech evaluation", JASA 44, 1968, 408-418

Rubenstein H., Decker L., and Pollack I., (1959), "Word length and intelligibility", Language and Speech 2, 1959, 175-178

Schroeder M.R., Atal B.S., and Hall J.L., (1979), "Objective measures of certain speech signal degradations based on masking properties of human auditory perception", in Lindblom B., and Ohman S., (eds), Frontiers of Speech Communication Research, Academic Press, 1979

Schultz M.C., (1964), "Word familiarity influences in speech discrimination", JSHR 7, 1964, 395-400

Silverstein B., Bilger R.C., Hanley T.D., and Steer M.D., (1953), "The relative intelligibility of male and female talkers", J. of Educational Psychology 44, 1953, 418-428

Smith C.P., (1979), "Talker variance and phonetic feature variance in diagnostic intelligibility scores for digital voice communications processors", IEEE ICASSP-79, 1979, 456-459

Stevens S.S., Miller J., and Truscott I., (1946), "The masking of speech by sine waves, square waves, and regular and modulated pulses", JASA 18(2), 1946, 418-424

Thwing E.J., (1956), "Effects of repetition on articulation scores for PB words", JASA 28, 1956, 302-303

Traul G., & Black J., (1965), "Effects of context on aural perception of words", JSHR 8, 1965, 363-369

Viswanathan et al (1983), "Objective speech quality evaluation of mediumband and narrowband real-time speech coders", IEEE ICASSP-83, 1983, 543-546

Voiers W.D., (1977a), "Diagnostic acceptability measure for speech communication systems", IEEE ICASSP-77, 1977, 204-207

Voiers W.D., (1977b), "Diagnostic evaluation of speech intelligibility", in Hawley M.E., (Ed.), Speech Intelligibility and Speaker Recognition, Vol 11, Benchmark Papers in Acoustics, Dowden, Hutchinson and Ross, 1977

Voiers W.D., (1980), "Interdependencies among measures of speech intelligibility and speech quality", IEEE ICASSP-80, 1980, 703-705

Voiers W.D., (1982), "Measurement of intrinsic deficiency in transmitted speech: The diagnostic discrimination test (DDT)", IEEE ICASSP-82, 1982, 1004-1007.

Williams, C.E., and Hecker M., (1968), "Relation between intelligibility scores for four test methods and three types of distortion", JASA 44(4), 1968, 1002-1006

Zwicker E., (1963), "Uber die Lautheit von ungedrosselten und gedrosselten Schallen." Acoustica 13, 194-211. Cited by Schroeder et al (1979)

Zwicker E., and Feldtkeller R., (1967), "Das Ohr als Nachrichtenempfanger." Hirzel, Stuttgart, 1967, cited by Schroeder et al (1979)