Skip to Content

Department of Linguistics

Important: If you have not yet either installed the phonetic font "Charis SIL" or tested this installation to determine if the phonetic characters installed properly then click here to go to the phonetic font help pages.

Robert Mannell

Consonant Perception

Various studies (eg. Agrawal and Wen, 1975) have indicated that of the lower four formants only F1 and F2 are absolutely essential for good intelligibility. The study by Agrawal and Wen (ibid) filtered out one of the bottom four formants at a time and found that the removal of F3 or F4 had no measurable effect on intelligibility whilst the removal of F1 decreased intelligibility by 7.6% and the removal of F2 resulted in a decrease in intelligibility of 27.6%. A closer examination of the confusions indicated that it was not the vowels that lost intelligibility, but certain of the consonants. This is no doubt explained by the importance of vowel onglide and offglide transitions as cues for consonant place of articulation. The direction of the F2 transition is the most important cue in this respect, however F3 transitions are sometimes necessary to disambiguate two consonants with very similar F1 and F2 transitions.

These formant transition cues have both frequency and time domain components. The actual frequencies at the start of the onglide (for example) is the primary cue of place of articulation, whilst the length of the transition can be an important cue of manner of articulation. For example, glides /w,j/ can be discerned from oral-occlusives /b,d,m,n/ by their much longer formant transitions (esp. F1 and F2). /r/ can be discriminated from the other glides and nasals by its rapid F3 transition (time domain cue) from a low frequency (frequency domain cue) (1). Liberman et al (1956) found that, on tokens of various transition durations and possible identifications of /b, ɡ/ or /w, j/, the stops were perceived when the transition durations were less than about 40 ms.

Fricatives and stop bursts are both characterised by high frequency random noise. The location of this noise spectrum is a cue to place of articulation. Fine frequency domain detail of this noise does not seem to be as important as the overall position of the major fricative resonance peak. Because of pole continuity, all fricatives have formants (which in English, /h/ excluded, represent the greatly muted resonances of the cavity posterior to the constriction). These formants are important for /h/ in English and may be important for some non-fronted fricatives (eg. [x]) in other languages. Homorganic, voiceless stop/fricative pairs are discriminated by mainly temporal cues. Stops are characterised by highly transient cues. Their release burst is very brief, whilst the noise spectrum of a fricative is quite a great deal longer and rises to its target amplitude more gradually than a stop does. Stops also are characterised by sometimes very brief periods of occlusion or silence as also are affricates. Affricates are discriminated from stops by their usually much longer and often more intense release burst. Voicing in stops and fricatives is often, but not always, accompanied by audible voicing. Many of the cues for voicing, however, are time and intensity cues. In general, for both stops and fricatives, the duration of the high frequency noise is longer and its intensity is greater for voiceless than for voiced consonants. One of the most intensively studied cues of stop consonant voicing is the voice onset time (VOT). Generally, VOT is earlier for voiced stops than it is for voiceless stops. Various studies (eg Carney et al, 1977; Pisoni, 1977) have shown, for English CV stops, that if the VOT is either negative (ie. voicing precedes the burst) or occurs no more than 20 msec following the release, then the stop is perceived as voiced, whilst if the VOT is more than 20 msecs after the burst the stop is perceived as voiceless. Pisoni (ibid) demonstrated that the ear is capable of perceiving the onsets of two stimuli as separate only if these onsets are more than 20 msecs apart. These results have been confirmed on adults, infants and chinchillas, for synthetic speech, and on adults for non-speech stimuli.

Van Heuven (1987) lists eight potential cues for perception of the fricative-affricate distinction, three of which have generally been considered to be the main cues for the distinction, noise amplitude rise time, noise duration, and silent interval duration. Noise duration appears to be a more important cue than rise time as it can only be traded with noise amplitude rise time if the noise duration is in a narrow range (90-130 ms in isolated CV context). Van Heuven (1987) presents evidence that the existence of a silent gap is a necessary condition for affricate perception in VCV context and that gaps in excess of 20 ms result in affricate rather than fricative perception. For 15 ms gaps it is possible to examine noise duration vs rise time perceptual cross overs. Evidence is presented to support the hypothesis that even when gap duration is less than that required or is absent, if there is a sufficient length of low intensity noise that is registered peripherally as noise it can be centrally reinterpreted as silence and the sound is classified as an affricate (although not as strongly as it would if there was a true gap that registered peripherally as silence).

Stevens (1970) suggested that the distinctive feature [+consonantal] (as defined by Jakobson, Fant & Halle (1963)) is "... associated with the rapid spectrum change in the 20-odd ms following the release of a consonant into a following vowel .." (ibid, p304) particularly in prestressed position. This abrupt spectrum change can be found in most consonants, especially the stops, but not in /w, j/ which show much slower changes and, in any case, are [-consonantal]. In stops, the timing of these rapid spectral changes relative to onset of voicing determines whether the stop is perceived as voiced or voiceless and that if there are substantial spectral changes remaining after the onset of voicing the stop is perceived as voiceless in English. Further, cues to the place of articulation of stops and nasals are largely found in the same period of rapid spectral change. Stevens (1970, pp314-315) presented evidence that for coronal consonants /d, n/ "... the onset of energy at high frequencies precedes the onset at lower frequencies...", for labial consonants /b, m/ "...the spectrum at the initial onset of energy has an energy concentration that is lower in frequency than in the spectrum sampled a few milliseconds later...", and for velar consonants , ŋ/ '... the major energy concentration at onset is in the middle frequency range ... [followed by] a spreading ... of the spectral energy to frequency regions above and below this middle range." Stevens proposed two possible detector which would permit detection of these places of articulation (at least in some vowel contexts), a detector that responds to rising spectral energy, and a detector that responds to falling spectral energy (both would be active for the velar).

From the above, it seems likely that some of the most severe effects on consonants will occur, when frequency resolution reduces the differences between onsets and offsets of vowels. The Agrawal and Wen (1975) results suggest that consonants are more sensitive to degradation of vowel formant cues than are the vowels themselves. Further, time resolution could have a severe effect on the ability to discriminate between fricatives, stops and affricates if the resolution is low enough. Time resolution would probably have to be worse than 20 ms to affect stop and fricative VOT cues. The more vowel-like consonants, on the other hand are likely to also be greatly affected by increases in transmission channel bandwidth in much the same way as vowels. Also, transitional place cues for stops should be affected not just by temporal degradation, but also by frequency degradation which would hinder the accurate tracking of transition centre frequencies.

References

Agrawal A.,and Wen C. Lin (1975) "Aspects of voiced speech parameters on the intelligibility of PB words", JASA 57(1), 1975, 217-222.

Carney A.E., Widin G.P.,and Viemeister N.F. (1977) "Noncategorical perception of stop consonants differing in VOT", JASA 62(4), 961-970.

van Heuven, V.J. (1987) "Reversal of the rise-time cue in the affricate-fricative contrast: An experiment on the silence of sound", In M.E.H. Schouten (ed) The Psychophysics of Speech Perception, Martinus Nijhoff, Dordrecht, pp181-187

Jakobson R., Fant C.G.M., and Halle M. (1963) Preliminaries to Speech Analysis, MIT Press, Cambridge, Mass.

Liberman, A.M., Delattre, L.J., Gerstman, L.J. & Cooper, F.S. (1956) "Tempo of frequency change as a cue for distinguishing classes of speech sounds", J. Exptl. Psychol. 52, 127-137

Pisoni D.B (1977) "Identification and discrimination of the relative onset time of two component tones: Implications for voicing perception in stops", JASA 61(5), 1352-1361

Stevens, K.N. (1970) "The potential role of property detectors in the perception of consonants", In G. Fant & M.A.A. Tatham (eds) Auditory Analysis and Perception of Speech, Academic, London

Footnotes

1. It is, of course, rather artificial to divide a single cue (a dynamic F3 movement in time and frequency) into separate time and frequency cues but this is of practical value in a series of experiments that attempt to distort the time and frequency domain separately.