Skip to Content

Department of Linguistics

Speech Perception
Background and Some Classic Theories

Robert Mannell

Click here for a PDF version of this document

Speech perception studies have not at the present time arrived at a single theory of speech perception which adequately explains all experimental observations in a way which makes it clearly superior to all the other competing theories and models. None of the theories or models presented below can be considered as the final word in speech perception. Some of them, however, undoubtedly provide insights into the process of speech perception.

The acoustic signal is in itself a very complex signal, possessing extreme inter-speaker and intra-speaker variability even when the sounds being compared are finally recognised by the listener as the same phoneme and are found in the same phonemic environment. Further, a phoneme's realisation varies dramatically as its phonemic environment varies. Speech is a continuous unsegmented sequence and yet each phoneme appears to be perceived as a discrete segmented entity. A single phoneme in a constant phonemic environment may vary in the cues present in the acoustic signal from one sample to another (eg. voiced stops may or may not have voicing during the occlusion). Also, one person's utterance of one phoneme may coincide with another person's utterance of another phoneme and yet both are correctly perceived. A theory of speech perception must explain why this extreme acoustic variability can result in perceptual phonemic constancy.

Speech Perception as Pattern Processing

The perception of speech involves the recognition of patterns in the acoustic signal in both time and frequency dimensions (domains). Such patterns are realised acoustically as changes in amplitude at each frequency over a period of time.

Figure 1. Serial processing

Most theories of pattern processing involve series, arrays or networks of binary decisions. In other words, at each step in the recognition process a yes/no decision is made as to whether the signal conforms to one of two perceptual categories. The decision thus made usually affects which following steps will be made in a series of decisions. If the decision steps are all part of a serial processing (figure 1) chain then a wrong decision at an early stage in the pattern recognition process may cause the wrong questions to be asked in subsequent steps (ie. each step may be under the influence of previous steps). Therefore, the earlier in the pattern recognition process that an error occurs the greater the chances of an incorrect decision. A serial processing system also requires a facility to store each decision (short-term memory?) so that all the decisions can be passed to the decision centre when all the steps have been completed. Clearly, extremely complex signal processing tasks, such as speech perception, could potentially require so many steps that the decision could not be reached quickly enough (ie. processing would not be in real time) and the next speech segment would have arrived before the one being processed was finished. Further, there is also the possibility in a long and complex task of the memory of earlier decisions fading and being distorted or lost.

Figure 2. Parallel processing

Because of all these problems with serial processing strategies, most speech perception theorists prefer at least some sort of parallel processing (figure 2). In parallel processing all questions are asked simultaneously (ie. all cues or features are examined at the same time) and so processing time is very short no matter how many features are examined. Since all tests are processed at the same time, there is no need for the short-term memory facility and further there is also no effect of early steps on following steps (ie. no step is under the influence of a preceding step).

Many theorists prefer a combination of both parallel and serial processing of auditory data. This might be in the form of a series of parallel processing banks (figure 3). Some theorists suggest that infants might start with purely serial processes and that as their knowledge of language improves parallel processes (which reflect that knowledge) may gradually take over. This might explain the slow speech response time of young children when compared to adults and suggests that part of the process of learning might involve the re-organisation of speech perception into increasingly more efficient parallel systems.

Figure 3. A series of parallel processing arrays

There are four major types of pattern recognition theory of relevance to speech perception (after Sanders, 1977).

  1. Template theories: where input is matched to one of a series of internal standard patterns or templates. The range of such a system can be extended by a process known as normalisation. Normalisation overcomes the need for a separate template for each speaker's production of each phoneme in each context, as it performs a transform on the input signal which causes the present speaker's speech to fit more neatly into the listener's template system.
  2. Filtering theories: where information is passed through banks of perceptual filters to facilitate decoding.
  3. Feature detection theories: active selection of information utilising active neural units or detectors tuned to specific patterns.
  4. Analysis-by-synthesis theories: based on both internal rules and information gleaned from a crude analysis of the input signal, an expected or probable pattern is internally synthesised and then compared with the input signal. If the match is not close enough the synthesised pattern is modified until an acceptable match is achieved.

Further, speech perception theories can be considered to be of two types or a combination of both (after Sanders, 1977). They are:-

  1. Passive or Non-mediated theories. These theories are based on the assumption that there is some sort of direct relationship between the acoustic signal and the perceived phoneme. In other words, perceptual constancy is in some way matched to a real acoustic constancy. These theories tend to concentrate on discovering the identity of such constant perceptual cues and on the ways the auditory system might extract them from the acoustic signal. In one way or another, these theories are basically filtering theories and do not involve the mediation of higher cognitive processes in the extraction of these cues. These higher processes are restricted to making a decision based on the features or cues which have been detected or extracted closer to the periphery of the auditory system.
  2. Active or mediated theories. These theories, on the other hand, suggest that there is no direct relationship between the acoustic signal and the perceived phoneme but rather that some higher level mediation is involved in which the input pattern is compared with an internally generated pattern.

In practice, however, most theorists concede the possibility that speech may operate as a combination of both active and passive processes and some suggest that they may even be alternative optional methods of perception which may operate under certain conditions.

Passive (Non-Mediated) Theories

1) Distinctive Feature Theory

The notion of distinctive features has a long history within phonological theory but most of the early work is of little use in a serious study of speech perception as the proposed features bear little relation to the actual acoustic signal. Jacobson and Halle (1956) and Jacobson, Fant and Halle (1963) proposed a set of distinctive features which combined both acoustic and articulatory features and which could be used as part of a binary code system of yes/no decisions to allow the identification of speech sounds at a phonemic level. The number and nature of the distinctive features would vary from language to language taking account of the speech sounds used distinctively in each language. The number of distinctive features would be at least sufficient to allow the separation of all phonemes in the language. Such a system would include features such as vocalic/non-vocalic, consonantal/non-consonantal, compact/diffuse, grave/acute, flat/ plain, nasal/oral, tense/lax, continuant/interrupted, strident/ mellow. It is clear from the above list that not all of these features correspond closely with actual acoustic features as visible in a speech spectrogram. The interrupted member of the continuant/interrupted pair could be detected acoustically as a break in the flow of speech and interpret this as a stop or affricate. This would of course need to be distinguished from a pause. The nasal member of the nasal/oral pair could be detected by looking for the low frequency nasal resonance peak or for the nasal anti-resonance at 1000 Hz. Vocalic segments might be distinguished from non-vocalic segments by their greater amplitude. Tense consonant segments might be distinguished from lax consonant segments by their gentler spectral slope (ie. higher intensity high frequency components) but this would have to take into account the resonance patterns of each of the segments. For some languages, the tense category consists of the voiceless consonants and it seems likely that it might be more profitable to examine the signal for voicing. This would of course be complicated by considerations of VOT and adjacent vowel length, etc. On the other hand, some languages are described as having tense vowels. In some cases tense vowels are more peripheral (closer to [i], [u] or [a]) than lax vowels. In some languages tense and lax vowels are mostly distinguished by their duration. The tense lax distinction may have clear acoustic correlates in some languages, but the terms appear to mean different things in different languages. Finally, there is no really clear relationship between many other pairs off features and any readily identifiable features in the acoustic signal.

2) The acoustic theory

Fant (1960, 1962, 1967) links both the production and perception of speech to the concept of distinctive features discussed earlier. This model is based on the source/filter model developed by Fant to explain production acoustics. The distinctive features are largely production based and are encoded onto the acoustic wave and then subsequently internalised by the listener as articulatory maps in the auditory system. It is important to note that this distinctive feature information is seen as being encoded directly and passively at the periphery of the auditory system.

3) Selfridge's (1959) Pandemonium model

This model, which was originally designed for character recognition in reading, is a feature detection model expressed in a rather bizarre terminology but nevertheless worthy of serious attention. This is a multi-level model with processing at each level being carried out in parallel. At each level a number of "demons" carry out work appropriate to that level and then pass the results of their work to the next higher level. At the lowest level a set of data demons store a copy of the input pattern presumably in a neural analogue form. At the next level computational demons analyse this data and extract frequency and amplitude parameters relevant to their particular function. These parameters are then passed up to the next level again presumably in an analogue form. At the next level are a number of narcissistic cognitive demons which shriek proportionally to the degree to which they see a reflection of themselves in the input pattern. At the highest level a decision demon is faced with pandemonium. The decision demon's task is to attend to all the shrieking cognitive demons, determine which one is shrieking the loudest and accept that one as the correct response.

This model can be restated in a more sober form as follows. The data demons are presumably short term memory. They are linked by deliberately vague linkages to the next level allowing for complex neural interconnections. The cognitive demons can be either template storage or feature detectors which can be tuned by weighting the value of each demon's shrieks and by rearranging the relationship between them and the decision demon. That is, some cognitive demons may have naturally louder voices than others and they may be more or less closely linked to the decision demon. These weightings would be based on long term experience (linguistic knowledge) and on short term knowledge (selective attention based on immediately preceding data). The decision demon would correspond to a superordinate awareness.

4) Uttley (1959,1966)

Uttley's model is a modification of the feature detection approach adopted by Selfridge. This model proposes a hierarchy of feature detection units increasingly more remote from the original signal. This is basically a system in which serial processing paths can be traced upwards through a network of parallel processing arrays. The input to each level of feature detection units is the output of the previous level. Each unit can only fire when a specific complex of lower units has fired before them. In other words, the lowest level of units fires when a particular acoustic pattern is detected and the next level of units will only fire when a specific pattern of basic detection units have fired, and so on up through the network. This model is based on the notion that peripheral neurons have a high degree of specialisation whilst higher level neurons do not and must therefore respond to multiple inputs. Some of the inputs to higher levels may actually be the delayed output of lower levels processed at some prior point in time. Such delays are essential if the system is to be capable of detecting the temporal aspects of the signal (VOT, formant transitions, etc).

5) Abbs and Sussman (1971)

This model, like other neurological models, postulates the existence of specialised groups of nerve and receptor cells. These detectors must be able to respond to spatial-temporal changes in the signal and must also have a wide tolerance to the large variation known to exist in acoustic signals. Such patterns may be made clearer by a neural feedback mechanism which selectively inhibits nerve cells adjacent to or in some way related to the neuron(s) which have the greatest rate of firing. Such inhibition may not be instantaneous but may allow the neural response at one point in time to direct attention to related following events by such selective inhibition. Further, the travelling wave on the basilar membrane does not propagate instantaneously and so a certain temporal span is available at any moment for direct analysis.

This model assumes that speech sounds and non-speech sounds must be processed differently. This idea is supported by the observation that speech sounds can be processed more rapidly than can non-speech sounds. For example, listeners can remember accurately the order of rapidly presented short-duration (70-80 msecs) speech sounds but cannot remember the order of presentation of non-speech sounds even when they have been increased in length to 200 msecs (Warren et al, 1969). Studies of categorical perception have also lent support to the notion of a special mode of perception for speech sounds. Categorical perception is the inability to discriminate between two sounds which belong to a particular (linguistic) category whilst being able to clearly distinguish between two sounds differing by exactly the same degree as the first pair but belonging on either side of the category boundary. This means that some speech sounds (eg. stops) which can be produced synthetically as a continuously changing series of tokens are not perceived as a continuum but merely as belonging to two or more distinct categories. This is in contrast to other speech sounds (vowels) and to non-speech sounds which can be perceived as a continuum even if they are placed into classes.

Active (Mediated) Theories

1) The Motor Theory

This theory, developed by Liberman and colleagues, has as its basic premise the notion that speech is perceived in terms of the place and manner of production of the acoustic signal. What a listener does, according to this theory, is to refer the incoming signal back to the articulatory instructions that the listener would give to the articulators in order to produce the same sequence. One of the reasons for this approach is the belief that it would seem wasteful for a speaker/listener to possess two equally important and equally complex processes for language coding and language decoding. At some point the two systems must merge into a unified system but most theorists would place this point above the phoneme production/perception level. The motor theory argues that the level of articulatory or motor commands is analogous to the perceptual process of phoneme perception and that a large part of both the descending pathway (phoneme to articulation) and the ascending pathway (acoustic signal to phoneme identification) overlap. The two processes represent merely two way traffic on the same neural pathways. The motor theory points out that there is a great deal of variance in the acoustic signal and that the most peripheral step in the speech chain which possesses a high degree of invariance is at the level of the motor commands to the articulatory organs. The encoding of this invariant linguistic information is articulatory and so the decoding process in the auditory system must at least be analogous. The motor theorists propose that there exists an
"...overlapping activity of several neural networks - those that supply control signals to the articulators, and those that process incoming neural patterns from the ear..." and "... that information can be correlated by these networks and passed through them in either direction." (Liberman et al, 1967)

It is clear that good production depends upon normal perception. The pre-lingually deaf are clearly at a disadvantage in the learning of speech. Since normal hearers are both transmitters and perceivers of speech they constantly perceive their own speech as they utter it and on the basis of this auditory feedback they instantly correct any slips or errors. Infants learn speech articulation by constantly repeating their articulations and listening to the sounds produced. This continuous feedback loop eventually results in the perfection of the articulation process. Early in life infants go through a babbling stage where the content of their articulations bears no relationship to the language of their elders. They can and do produce all the speech sounds of their language as well as many sounds which do not form part of the phonemic structure of their language and which they will be unable to reproduce later in life. At a later stage they auditorily compare their articulations with those of their elders and eventually weed out the sounds which do not belong to their parents' language. People who become deaf post-lingually are deprived of the constant auditory feedback of their own speech and it gradually deteriorates. These facts indicate a joining of the speech production and perception pathways at some point but do not in themselves directly support the notion that the two pathways are identical below the level of the phoneme. They don't, however, contradict the assertion of the Motor theorists that the act of perceiving someone else's speech is analogous to the act of producing and listening to one's own speech.

This theory suggests that there exists a special speech code (or set of rules) which is specific to speech and which bridges the gap between the acoustic data and the highly abstract higher linguistic levels. Such a speech code is unnecessary in passive theories where each speech segment would need to be represented by a discrete pattern (eg. a template) somehow coded into the auditory system at some point. The advantage of a speech code or rule set is that there is no need for a vast storage of templates since the input signal is converted into a linguistic entity using those rules. The rules achieve their task by a drastic restructuring of the input signal. The acoustic signal does not in itself contain phonemes which can be extracted from the speech signal (as suggested by passive theories), but rather contains cues or features which can be used in conjunction with the rules to permit the recovery of the phoneme which last existed as a phonemic entity at some point in the neuromuscular events which led to its articulation. This, it is argued, is made evident by the fact the speech can be processed 10 times faster than can non-speech sounds. Speech perception is seen as a distinct form of perception quite separate from that of non-speech sound perception. Speech is perceived by means of categorical boundaries whilst non-speech sounds are tracked continuously. (see the discussion on categorical perception in the section on Abbs and Sussman, 1971, above). Like the proponents of the neurological theories (such as Abbs and Sussman, 1971) the motor theorists believe that speech is perceived by means of a special mode but they believe that this is not based directly on the recognition of phonemes embedded in the acoustic signal but rather on the gating of phonetically processed signals into specialised neural units. Before this gating, both phonetic and non-phonetic processing has been performed in parallel and the non-phonetic processing is abandoned when the signal is identified as speech.

Speech is received serially by the ear and yet must be processed in some sort of parallel fashion since not all cues for a single phoneme coexist at the same point in time and the boundaries of the cues do not correspond to any phonemic boundaries. For example, a voiced stop requires several cues to enable the listener to distinguish it. VOT (voice onset time) is examined to enable a voiced/voiceless decision but this is essentially a temporal cue and can only be measured relative to the position of the release burst. The occlusion itself is a necessary cue for stops in general in a VCV environment, and yet it does not coexist in time with any other cue (except perhaps voicing when VOT is negative). The burst is another general cue for stops, whilst the following aspiration (if present) contains a certain amount of acoustic information to help in the identification of the stop's place of articulation and further helps in the identification of positive VOT and thus of voiceless stops. The main cues to the stop's place of articulation are the formant transitions into the vowel and these in no way co-occur with the remainder of the stop' s cues. This folding over of information onto adjacent segments, which we know as coarticulation, far from making the process of speech perception more confusing, actually help in the determination of the temporal order of the individual speech segments (so argue proponents of the Motor theory) as it permits the parallel transmission via the acoustic signal of more than one segment at a time.

Early versions of the motor theory required the existence of a specialised mode of categorical speech perception in order to work. One of the cues which was at one time seen to be perceived categorically was the voice onset time (VOT) of stops. A continuum of VOTS were perceived by English speakers to belong to either the voiced or the voiceless category. Speakers of languages with a third category of stops, unaspirated, perceived the points on the VOT continuum as belonging to one of three categories. This was seen as strong evidence of a special speech mode of perception. Pisoni (1977), Carney et al. (1977) and others carried out a series of experiments which showed that VOT perception may actually be conditioned by the temporal resolution properties of the auditory system itself and not be related to any special mode of perception. The categories were perceived by pre-linguistic infants and chinchillas. Further, synthetic non-speech sounds were also similarly perceived in an apparently categorical manner. If one example of so-called categorical perception can be seen to rely on the physical responses of the peripheral auditory system, then is this also the case for other posited examples of a special speech mode of perception?

2) Analysis-by-synthesis model

Stevens and Halle (1967) have postulated that
"... the perception of speech involves the internal synthesis of patterns according to certain rules, and a matching of these internally generated patterns against the pattern under analysis. ..moreover, ...the generative rules in the perception of speech [are] in large measure identical to those utilized in speech production, and that fundamental to both processes [is] an abstract representation of the speech event."

In this model the incoming acoustic signal is subjected to an initial analysis at the periphery of the auditory system. This information is then passed upward to a master control unit and is there processed along with certain contextual constraints derived from preceding segments. This produces an hypothesised abstract representation defined in terms of a set of generative rules. This is then used to generate motor commands, but during speech perception articulation is inhibited and instead the commands produce a hypothetical auditory pattern which is then passed to a comparator module which compares this with the original signal which is held in a temporary store. If a mismatch occurs the procedure is repeated until a suitable match is found. (see figure 4.)

Figure 4. Analysis-by-synthesis Model (after Stevens, 1972)

References

Abbs, J.H. & Sussman, H.M., 1971, "Neurophysiological feature detectors and speech detection. A discussion of theoretical implications", J.Speech and Hearing Res., 14, 23-36

Carney, A.E., Widin, G.P., & Viemeister, N.F., 1977, "Noncategorical perception of stop consonants differing in VOT", J. Acoust. Soc. Am., 62(6), 961-970.

Fant, G., 1960, Acoustic Theory of Speech Production, Mouton.

Fant, G., 1962, "Descriptive analysis of the acoustic aspects of speech", Logos 5, 3-17

Fant, G., 1967, "Auditory patterns of speech", in Models for the Perception of Speech and Visual Form, ed. W. Wathen-Dunn, Cambridge, Mass.: MIT Press.

Jakobson, R., Fant, G. and Halle, M. 1952. Preliminaries to speech analysis: the distinctive features and their correlates. Cambridge, Mass.: MIT Press. (MIT Acoustics Laboratory Technical Report 13.)

Jakobson, R. and Halle, M. 1956. Fundamentals of Language. The Hague: Mouton.

Liberman, A.M., 1970, "The grammars of speech and language", Cognitive Psychology, 1, 301-323.

Liberman, A.M., Mattingly, I.G., & Turvey, M.T., 1967, "Language codes and memory codes", in Coding processes in Human Memory, eds. Melton, A.W., & Martin, E., Washington: V.H. Winston

Pisoni, D.B., 1977, "Identification and discrimination of the relative onset of two component tones: Implications for voicing perception in stops", J. Acoust. Soc. Am., 61(5), 1352-1361.

Sanders, D.A., 1977, Auditory Perception of Speech: An introduction to principles and problems, Prentice-Hall, London.

Selfridge, O.G., 1959, "Pandemonium: a paradigm for learning", In D. V. Blake and A. M. Uttley, eds., Proceedings of the Symposium on Mechanisation of Thought Processes, pp 511-529, London, 1959. H. M. Stationary Office.

Stevens, K.N., 1972, "Segments features and analysis-by-synthesis", in Language by Ear and Eye, eds. J.F. Kavanagh & I.G. Mattingly, Cambridge, Mass.: MIT Press.

Stevens, K.N. & Halle, M., 1967, Remarks on analysis-by-synthesis and distinctive features", in Models for the Perception of Speech and Visual Form, ed. W. Wathen-Dunn, Cambridge, Mass.: MIT Press.

Uttley, A.M., 1959, "Conditional probability computing in the nervous system", In D. V. Blake and A. M. Uttley, eds., Proceedings of the Symposium on Mechanisation of Thought Processes, London, 1959. H. M. Stationary Office.

Uttley, A.M., 1966, "The transmission of information and the effects of local feedback in theoretical and neural networks", Brain Research, 2, 21-50.

Warren, R.M., Obusek, C.J., Farmer, R.M., & Warren, R.T., 1969, "Auditory Sequence: Confusions of patterns other than speech and music", Science, 164, 586-587.