Department of Linguistics
Speech Perception and the Hearing Impaired
An examination of speech perception by the hearing impaired is very difficult for a number of reasons:-
- Hearing impairment varies greatly from subject to subject, even for impaired
hearers with identical audiograms. Impairment may vary not merely in terms
of the threshold of pure tone audibility at different frequencies, but also
in terms of impaired frequency selectivity, frequency discrimination, auditory
filter asymmetry, temporal acuity and susceptibility to auditory filter
saturation at high presentation intensities.
- Many studies of speech perception by the hearing impaired have relied
only on audiograms for the categorisation of subjects, often into groups
as broad as "normally-hearing" vs. "hearing-impaired". No attempt has been
made in the majority of cases to independently determine the frequency selectivity
(filter resolution), for example, of the subjects in the studies. The resulting
complex, and often conflicting, results might be attributed to the non-control
of a number of important independent variables, such as hearing-impairment
severity, frequency selectivity, etc.
- Speech perception can be said to be a combination of both sensory and
cognitive processes. Even when sensory factors of the type mentioned above
are fully understood and their effect on speech perception is fully characterised
they only account for the peripheral part of the process of speech perception.
Some aspects of the speech perception process that we may observe when studying
the sensory aspects of speech perception may result from a conflation of
auditory and cognitive effects. Further, cognitive factors may vary greatly
from one hearing-impaired subject to another.
- Even the sensory aspects of speech perception are not the result of auditory processing alone. Speech perception is multi-modal and also involves vision.
Frequency Selectivity, Frequency Discrimination and Pitch Perception
In the following discussion a distinction is made between frequency selectivity, frequency discrimination and pitch perception.
- Frequency selectivity is the ability to separate or resolve multiple spectral
peaks in a complex sound. It is most directly related to the bandwidth and
tuning of the auditory filters and thus to the place principle of hearing
and so to the Bark or ERB frequency scales. There is a great deal of variation
in the frequency selectivity of people with the same pure tone thresholds.
- Frequency discrimination is the ability of a hearer to reliably (above
chance) perceive that two sounds which only differ in frequency are different.
Frequency discrimination is the measurement of frequency jnd's (just noticeable
differences). Frequency discrimination is the result of both place and frequency
principles of hearing, with the frequency principle dominating at low frequencies
and the place principle increasingly dominating at higher frequencies (and
taking over entirely at very high frequencies).
- Pitch is the perceptual correlate of fundamental frequency (F0). Pitch perception is based on a complex combination of place and frequency principles. Often, when researchers refer to "pitch perception", they are actually referring to frequency discrimination. Pitch perception can also refer to the perception of relative pitch (judgements of halving and doubling of pitch, for example). The relative pitch scale is know as the mel scale. For pure tones the frequency principle (which assumes the phase locking of the sensory auditory fibres attached to the inner hair cells) is effectively solely responsible for pitch discrimination below 1000 Hz. The place principle gradually takes over from 1000 to 3000 Hz and is solely responsible for pitch discrimination at very high frequencies. If the discrimination of pitch in complex tones (harmonic complexes) were identical to that of pure tones then the frequency principle would account for the perception of pitch over the range of possible fundamental frequencies in the human voice (up to about 1000 Hz). Pitch perception of complex tones, such as occur in voiced speech is more complex (see the next section).
Pitch and Voicing Perception and the Hearing Impaired
Fundamental frequency analysis as well as the identification of voicing in speech is crucial for the linguistic and paralinguistic identification of numerous aspects of speech and voice including: the phonetic voiced-voiceless distinction; intonation; tone; speaker sex, age and identity; and speaker emotional state.
Voiced speech is effectively a complex tone or harmonic complex consisting of the fundamental frequency and its harmonics at integer multiples of F0.
Fundamental frequency discrimination studies of the normally-hearing have indicated jnds of around 0.25-0.3% of F0 for sinusoids (at F0 values typical of speech). A study by Moore et al (1984) found F0 jnds of harmonic complexes for normal hearers are much finer at about 0.13-0.22%. For a complex tone with an F0 of 200 Hz the first three harmonics tested individually have jnds about the same as for sinusoids around the F0. For harmonics around 2000 Hz the jnds have dropped to about 2-5%. Since the discriminability of the F0 of a harmonic complex is better than that of any of the individual components, it is clear that information is combined across harmonics in such a way that discriminability is improved.
Moore and Glasberg (1986) described a model of pitch perception for complex tones such as occur in voiced speech which uses a combination of both place and temporal information. In this model, an initial frequency analysis divides up the spectrum into frequency bands or channels (the auditory filters). The temporal information (fibre firing rates) of each channel is then processed separately. Temporal analysis of F0 is most accurate in the low frequency region where the harmonics are resolved, permitting temporal analysis of a single harmonic per auditory filter. Figure 7.15 (from Rosen and Fourcin, 1986) shows a model of the temporal information across adjacent auditory filters. Note that the first four auditory filters align with the first four harmonics and the temporal information is sinusoidal. Up to harmonic 7 a step of one auditory filter is about the same as one harmonic although not perfectly as some deviation from a sinusoid is clearly seen for filters/harmonics 5 and 7. For higher auditory filters the harmonics are no longer resolved and the temporal information encoded for each filter includes the information of more than one harmonic. Extraction of accurate pitch information from an auditory filter is most accurate when only a single harmonic is represented as a simple sinusoid.
For the hearing impaired, as frequency selectivity decreases the bandwidths of the filters increase resulting eventually in the non-resolution of harmonics even for the lowest auditory filters. This results in non-sinusoidal temporal information in all auditory filters which results in a decrease in pitch discrimination. Figure 7.16 (from Rosen and Fourcin, 1986) shows an impaired auditory filter bank with bandwidths three times normal. Even the fundamental is not adequately resolved and the temporal information in all filters is non-sinusoidal. This results in much reduced pitch discrimination.
Whilst the pitch discrimination of sinusoids by normally-hearing subjects is inferior to the pitch discrimination of harmonic complexes, the reverse is often true for the hearing impaired. For hearing impaired subjects with poor frequency selectivity, only pure tones avoid the problem of multiple harmonics contributing to the temporal information in the relevant auditory filters.
Often, hearing impaired subjects are unable to label pitch contours as falling, level or rising whilst normally-hearing subjects are able to (nb. hearing impaired subjects often are much better able to label sinusoidal rather than harmonic complex F0 frequency transitions). This may, according to this model, be due to reduced frequency selectivity.
Reduced pitch discrimination may also be due, however, to poor temporal discrimination which affects the accuracy of temporal information encoded in the auditory filters and so reduces (or further reduces) the pitch discrimination of harmonic complexes such as vowels. A possibly relevant aspect of reduced temporal discrimination may be impaired phase locking of inner hair cells, but this has its greatest effect on frequencies above about 750 Hz.
Phase sensitivity of auditory filters may also affect pitch discrimination. "Phase sensitivity is constrained by auditory filtering ... a loss in frequency selectivity would lead to more acute phase sensitivity in the hearing impaired." (Rosen and Fourcin, 1986, p417-418) This increase in phase sensitivity interferes with accurate processing of auditory filter temporal information. Hearing impaired subjects who have good pitch sensitivity in ideal conditions (headphones, free-field listening conditions in dead rooms), where phase relationships between harmonics are predictable, may have highly impaired pitch sensitivity in reverberant rooms where phase relations between harmonics is rendered random. In such cases pitch may sound "indistinct". (ibid.)
Formant and Spectral Contrast Discrimination
In normally hearing subjects formant frequency jnds may be as low as 0.5% in psychoacoustic tests on sounds that are only moderately speech like. For more speech-like sounds, and for speech itself, formant jnds are about 2-5% of formant frequency for normally-hearing listeners. Rosen and Fourcin (1986, p425) suggest that it "... seems reasonable to suppose that only formant frequency differences (fx) of 6% or more will play an important role in speech contrasts." Therefore, if the formant frequency discriminatory abilities of hearing impaired subjects degrade to a point where fx is significantly greater than 6% one would expect that phonetically significant degradations in the perception of vowels and vowel-like sounds would occur. Pickett and Martony (1970) tested the vowel perception and F1 jnds of a number of hearing impaired subjects. These subjects had very poor (0-16% correct) speech perception scores, had close to normal F1 jnds at 200-400 Hz (if loss was <60 dB HL) but had F1 jnds above 10% around 825 Hz (where loss was >85 dB HL). This degradation in F1 jnds may be due to the combined effect of flatter filters (lower frequency selectivity) at these frequencies as well as reduced audibility of F1.
Reduced frequency selectivity has the effect of flattening peaks in a speech sound's spectrum by broadening the apparent bandwidth of formant peaks and by reducing the peak to valley differences. This reduces the distinctiveness of the auditory representations of spectra (based on the place principle) and so reduces the distinction between spectra. Reduced distinction between spectra results in a reduced ability to discriminate between spectra and so reduces the listener's ability to identify speech sounds. Figure 7.28 (Rosen and Fourcin, 1986, p443) shows the effect of good and poor frequency selectivity on the auditory representation of speech spectra. Some phonetically important peaks are visible in the representation of the unimpaired ear but not in that of the impaired ear. This can have the effect of greatly reducing the intelligibility of speech sounds.
The formant jnds can also be affected by a reduction in intensity discrimination which further degrades the discrimination of peak to valley differences in intensity.
In the last paragraph of the previous section reference was made to the effect of heightened phase sensitivity in hearing impaired subjects with reduced frequency selectivity. This increase in phase sensitivity with reducing frequency selectivity is a direct consequence of the interaction of adjacent harmonics in the wider auditory filters. Interaction of harmonics means that as phase relationships between the harmonics change the shape of the temporal information for that filter changes. This does not occur in the low frequency auditory filters of normally hearing subjects as harmonics are resolved and so cannot interact to provide different temporal patterns as phase relations between harmonics change.
Increased phase sensitivity can also affect the analysis of the first formant (F1) of vowels. F1 is generally represented by the highest local harmonic in the low frequency part of the spectrum (200-850 Hz for adult male speakers, ~20% higher for adult females). Depending on the actual F1 formant frequency and on the F0, the harmonics are often resolved by the unimpaired auditory system. The F1 frequency can therefore be determined by both temporal and place mechanisms. The poor frequency selectivity of many impaired auditory systems simultaneously fails to resolve the harmonics in the region of F1 and also, because of increased phase sensitivity, causes the temporal information of the auditory channel nearest F1 to loose its salience. As a consequence the place mechanism comes to dominate in the discrimination of F1. There is evidence (eg. Darwin and Gardner, 1986) that phase manipulations can change the perception of F1 in vowels. The increased phase sensitivity of impaired listeners has some effect on the perception of F1 and may effect vowel identification, but the main effect on F1 perception is the spectral effect of reduced frequency selectivity. Phase sensitivity effects on F1 perception, however, may have a significant effect on the perception of F1 in competing speech and competing noise conditions and may greatly affect the impaired listener's ability to segregate competing voice cues.
Some studies (eg. Rosen and Ball, 1986) have shown that single channel cochlear and extra-cochlear implants which have no facility for providing a place mapping of vowel spectra nevertheless, for a small set of patients, can provide sufficient information to permit the discrimination of vowel F1 (but not F2) spectral changes. Such non-place perception of spectral quality must rely entirely upon temporal analysis of nerve firing patterns. It is this temporal processing that is impaired by the increased phase sensitivity that results form reduced frequency selectivity.
A further process that affects the perception of spectral contrasts is auditory filter saturation at high presentation levels (an effect of intensity recruitment). Hearing impaired subjects are typically fitted with hearing aids that very often amplify the spectrum of a speech sound to very high presentation levels. At 100 dB presentation levels the auditory system tends to saturate, especially if outer hair cell loss is considerable as is very often the case in hearing impaired subjects. Saturation of auditory filters also results in less distinct and therefore less intelligible spectral patterns. Figure 7.29 compares the effects of 100 dB presentation on normally hearing and hearing impaired listeners. The impaired auditory spectral representation is greatly affected by saturation and its spectral peaks are indistinct.
Upward spread of masking is another process which affects the perception of spectral contrasts. Upward spread of masking is well known to audiologists, particularly with respect to the rising amplitude frequency response which is often recommended for hearing aids. Such settings cannot be assumed to be universally successful however owing to the high variability that occurs in the characteristics of the auditory filter of hearing impaired people. Upward spread of masking is generally suspected in clients who have an audiogram which indicates a steeply sloping high frequency loss but residual low frequency hearing. A consequence of this is that low frequency spectral peaks such as F1 are passed through the auditory system with less attenuation than higher frequency peaks such as F2 which may be greatly attenuated or even inaudible. If a hearing aid provides the same level of amplification across the spectrum this results, for many clients, in potentially audible F1 and F2. Speech tests, et cetera, show that for many clients the perception of F2 contrasts is much less than would be predicted by audiograms alone. This occurs especially when F2 is close to F1. The reason for this is that the much louder F1 masks the F2. This occurs for two reasons. Firstly, clients with audiograms of the type mentioned above often have reduced frequency selectivity. This increases the bandwidth of the affected auditory filters. The result of this is that the intense F1 peak is spread out across a much broader range of frequencies. Secondly, people with reduced frequency selectivity often exhibit auditory filter asymmetry. The most common asymmetric profile is a much greater spread at the high frequency end than at the low frequency end. This exacerbates the tendency of an intense F1 to mask a weaker F2 and can cause even F2's that are fairly distant from F1 to be masked. Such people can often be assisted by giving a greater intensity boost around the F2 frequency region than around the F1 frequency region. This reduces the relative auditory intensity of F1 and so reduces its tendency to mask F2. Hearing impaired clients with such an audiogram do not show consistent patterns of either frequency selectivity or of auditory filter asymmetry. Some have fairly good frequency selectivity, others have no filter asymmetry or have asymmetry which displays a steep high frequency edge and a shallow low frequency edge. Such clients are unlikely to display upward spread of masking.
The final, and in some cases the most important, effect on the perception of spectral contrasts is the issue of audibility. For many hearing impaired people loss is so great at some frequencies that no safe level of amplification will make acoustic features at those frequencies audible. Inaudible acoustic features cannot contribute to the process of spectral contrast discrimination as the features have been effectively removed. The processes examined in this section assume that features in a particular frequency band are audible, at least with amplification.
Perception of Dynamic Cues
Speech is essentially a dynamically changing acoustic phenomena. Steady state acoustic cues in speech are only relatively steady state in that they last for up to 300 ms rather than being of the order of say 10-50 ms.
Dynamic cues can obviously be affected by poor auditory acuity, but can also be affected by a poor frequency selectivity. Auditory temporal processing is reviewed by Mannell (1994, pp34-43).
Temporal integration refers to the temporal response of primary auditory neurons to a sudden burst of noise. In normal auditory systems onset response is very quick whilst offset response decays slowly. What this means is that the offset response to a prior burst can mask the onset response to a following burst if they are close enough together. This is know as forward masking. Backward masking can occur but because of the brief onset response of the auditory system to a burst it must be very close to a previous burst response to mask it. A gap between two noise (or tone) bursts will be obscured if the temporal offset response to the preceding sound overlaps sufficiently with the temporal onset response to the following sound. The gap detection threshold is a measure of the auditory system's ability to detect gaps in noise. In hearing impaired subjects with very poor gap detection thresholds, gaps as wide as typical stop occlusions in continuous speech may be obscured and so word pairs such as "speed" and "seed" may not be correctly identified.
The distinction between sequences such as /be/ - /we/ - /ue/ can be perceived by the rapidity of the formant transitions (/be/ very fast, /we/ slower, /ue/ slowest) or by the rapidity of change of the amplitude envelope at release (again, /be/ very fast, /we/ slower, /ue/ slowest). Auditory temporal impairment, if severe enough, may cause the faster changes to be "smeared" in time and so to appear less distinct from the slower changes.
Formant transitions are important acoustic cues to the place of articulation of many consonants. Reduced auditory temporal acuity may cause a rapid transition to be mis-perceived. Perhaps surprisingly, frequency selectivity often has an even greater effect on the perception of dynamic cues than temporal acuity.
Formant transition perception can be affected by the broadened auditory bandwidths that are a consequence of reduced frequency selectivity. The formants can be so broadened that it becomes difficult to accurately identify their trajectory. Further, transitions may be difficult to perceive because they may fail to resolve because of spectral flattening. F2s and higher frequency F1s also suffer from audibility problems for many hearing impaired subjects. F2 transitions may also be obscured by the upward spread of masking from F1. Also, phase sensitivity that accompanies reduced frequency selectivity may cause considerable degradation of F1 tracking because of a reduced ability to analyse the temporal firing patterns of auditory fibres.
Speech Perception in Noise
The speech perception of hearing impaired subjects is very often affected much more by noise than is the speech perception of normally hearing subjects. This is largely because of the two most important effects of reduced frequency selectivity. The first of these effects is the broadening and flattening of peaks as a consequence of broader and often asymmetric auditory filters. The second of these effects is related to the increased phase sensitivity that co-occurs with reduced frequency selectivity.
The flat, less detailed spectra of people with reduced frequency selectivity are already much less discriminable from each other than is the case for normally hearing people. Peaks that are still evident typically have a much reduce peak to valley distance and these valleys are more likely to be filled by noise than would be the case for the auditory spectra of normally hearing people.
F0 discrimination is best when auditory filters are able to fully resolve the first few harmonics in voiced speech. Segregation of speech from noise is much easier in such auditory systems because it is possible to isolate the auditory filters which are centred over the appropriate harmonics and to ignore auditory filters in between that would be contain relatively more noise. In an auditory system with reduced frequency selectivity, even low characteristic frequency auditory filters will have temporal information which is based on non-resolved adjacent harmonics. This temporal information is already "noisy" and analysis of this information is even more susceptible to noise masking than is the case for temporal information based on resolved harmonics. Further, segregation of speech from noise is extremely difficult because of the errors in F0 analysis which prevent accurate selection of only those auditory filters which are centred over harmonics.
Normally hearing people rely on time and frequency gaps in noise to assist in the analysis of masked speech. Noise, being random, is not constant and so at certain times the intensity of noise over the various bands of speech information randomly reduces to a level where the speech acoustic cues become relatively more prominent and so more analysable. Poor temporal and frequency acuity may have the effect of spreading the more intense components of the noise over adjacent regions of temporarily lower noise intensity (eg. forward masking and upward spread of masking). The hearing impaired person is therefore less able to utilise these gaps in the segregation of speech from noise.
Darwin, C.J. & Gardner, R.B. (1986) "Mistuning a harmonic of a vowel: grouping and phase effects on vowel quality", J.Acoust.Soc.Am. 79, 838-845
Mannell, R.H. (1994) The Perceptual and Auditory Implications of Parametric Scaling in Synthetic Speech, Ph.D. dissertation, Macquarie University (especially chapter 2)
Moore, B.C.J. (1995) Perceptual Consequences of Cochlear Damage, Oxford UP. (especially chapter 7, "Speech perception by people with cochlear damage")
Moore, B.C.J., Glasberg, B.R. & Shailer, M.J. (1984) "Frequency and intensity difference limens for harmonics within complex tones", J.Acoust.Soc.Am. 75, 550-561
Moore, B.C.J. & Glasberg, B.R. (1986) "The role of frequency selectivity in the perception of loudness, pitch and time" in Moore, B.C.J. (ed.) Frequency Selectivity and Hearing, London: Academic Press.
Pickett, J.M. & Martony, J. (1970) "Low-frequency vowel formant discrimination in hearing impaired listeners", J. Speech Hear. Res. 13, 347-359.
Rosen, S.M. & Fourcin, A.J. (1986) "Frequency selectivity and the perception of speech", in Moore, B.C.J. (ed.) Frequency Selectivity and Hearing, London: Academic Press.
Rosen, S.M. & Ball, V. (1986) "Speech perception with the Vienna extra-cochlear single-channel implant: a comparison of two approaches to speech coding", Br. J. Audiol 20, 61-83