Skip to Content

Department of Linguistics

The Perceptual and Auditory Implications of Parametric Scaling in Synthetic Speech

Robert Mannell

Chapter 2: Auditory Processing and Speech Perception

Original Article: Mannell, R.H., (1994), The perceptual and auditory implications of parametric scaling in synthetic speech, Unpublished Ph.D. dissertation, Macquarie University (Chapter 2)

2.0 Introduction

Before being processed linguistically, speech sounds must pass through the auditory system where the perceptually-salient cues or features present in the acoustic signal are transformed in various, mostly non-linear, ways. This chapter examines, from psychoacoustic, physiological and phonetic perspectives, the auditory system and its non-linear transduction of the acoustic dimensions of frequency, time, intensity and phase. The goal of this chapter is to give an overview of what is currently known about the auditory transduction of sounds in general and of speech in particular with a view to determining the nature of the representations of speech that are passed to phonetic and linguistic centres of the brain. A particular focus of this chapter is an examination of the various auditory scales that each describe some aspect of auditory transduction with the goal of determining which of those scales represent the most phonetically meaningful representations of speech across the relevant acoustic/auditory dimensions.

This chapter commences with an examination of auditory physiology and theories of hearing but focussing especially on the mechanical and neural transduction of sound. This is followed by an examination of psychophysical approaches to auditory perception and a discussion of the potential relevance of psychophysics to the phonetic categorisation of speech. There is also an examination of various measurements of neural representations of sound and of speech from the auditory nerve to the auditory cortex, again with an emphasis upon the representations of speech that may be presented to hypothesised phonetic processing centre(s) in the brain. These possible representations are then compared to various auditory models of speech processing as well as to the results of studies that examine the perception of parametrically scaled speech or the relationship between auditorily-modelled spectral distance measures and speech perception.

"[T]he crucial thing is to determine the extent to which general auditory processes can account for the phenomena observed in speech [perception], and where they are inadequate." (Rosen & Fourcin, 1986, p439)

2.1 Auditory Physiology and Theories of Hearing

The ear is a non-linear transducer of sound. This non-linearity causes time, phase, frequency and amplitude distortions during the transduction of sound from acoustic energy in the air to electro-chemical energy in the auditory nerve. For more than 50 years, the linear display of the speech spectrogram has given a distorted picture of the speech signal. A linear display of the physical dimensions of time, intensity and frequency is a true representation of the physical signal, but it is a very misleading representation of the type of information that is presented to higher processing centres after passing through the peripheral auditory system.

"As far as speech perception research is concerned, it is not inconceivable that the sound spectrograph has had an overall detrimental influence over the last 40 years by emphasising aspects of speech spectra that are probably not direct perceptual cues (and in some cases may not even be resolved by the ear)." (Klatt, 1982b)

When sound enters the outer ear it is affected by the resonances of the pinna (ear lobe), concha (funnel-like opening to the outer ear canal), and external auditory meatus (outer ear canal). The main effect of these resonances is to produce a broad peak of 15-20 dB at 2500 Hz and spreading relatively uniformly from 2000-7000 Hz (Pickles, 1988). This has the effect of amplifying the sound pressure of the mid-frequencies relative to the low and high frequencies. (1) Another important effect is the pattern of directionally sensitive interferences in the pinna and the concha which, when coupled with binaural phase differences, are largely responsible for our sense of directionality in hearing. These interference patterns also modify the acoustic signal. The function of the middle ear is to overcome the normal effect of a transfer of energy from a low impedance medium (air) to a high impedance medium (fluid) which would normally result in the reflection of a great deal of the acoustic wave. This would result in about a 30 dB loss in sound pressure (Glattke, 1973) but this is overcome by the middle ear which "...increases sound pressure by approximately 30 dB." (Borden & Harris, 1980, p165). This impedance transformer action results from a combination of middle ear ossicle lever effects and (more importantly) as a consequence the ratio of about 17:1 (Békésy, 1960, cited in Glattke, 1973) between the surface area of the tympanic membrane and the oval window. What is of particular relevance to the present discussion of the non-linear transduction of the ear is the finding that the pressure gain transfer function of the middle ear is not uniform but shows a peak at 1000 Hz and gradually drops off to about 20 dB below peak level at 100 Hz and 10,000 Hz (Nedzelnitsky, 1980). This peak is relatively flat, however, over most of the frequency range containing speech cues (<10 dB variation from 300 Hz to 7000 Hz).

The fluid displacement generated at the oval window propagates almost instantaneously throughout the cochlear, travelling up the scala vestibuli to the cochlear apex and then back down the parallel scala tympani to the round window, a membrane which relieves the pressure by bulging into the middle ear. The round window is necessary as the fluid in the inner ear is "...virtually incompressible..." (Gulick, 1971, p34) and so the round window provides a pressure relief point. The movements of the oval and round windows are therefore reciprocal (Gulick, ibid.). This fluid movement generates a travelling wave of displacement on the basilar membrane. The mechanics of this travelling wave forms the basis of the place principle of frequency perception (see below). This travelling wave is communicated to the completely enclosed third parallel passage, the scala media, or cochlear duct by causing movements in either of the two membranes which separate it from the scalar vestibuli and the scala tympani. The cochlear duct is separated from the scala vestibuli by the very thin Reissner's membrane and from the scala tympani by the basilar membrane (2). The organ of Corti which is the ultimate organ of hearing is found in the cochlear duct and rests on the basilar membrane. The most prominent feature of the organ of Corti is the arch of Corti, adjacent arches of which form a tunnel extending the entire length of the basilar membrane. This arch divides the organ of Corti into an inner and outer region each with its own set of hair cells. These hair cells are supported (via certain specialised cells) by the basilar membrane on one side and by the very fine reticular lamina on the other side. Extending from each hair cell, beyond the reticular lamina, are about 40 to 60 cilia which give the hair cells their name. These cilia of the outer hair cells are "...shallowly but firmly embedded in the under surface of the tectorial membrane" (Pickles, 1988, p 30, citing Engström & Engström, 1979) which, unlike the basilar membrane, is only firmly attached at the one end, to a structure on the medial side of the arch of Corti. The inner hair cells "... are probably not embedded ..." (Pickles, 1988, p30) in the tectorial membrane and form a single row which runs from the base to the apex of the cochlear and number approximately 3,400 (Gulick, 1971). The outer hair cells are in three to five rows (the number increases closer to the cochlear apex) and number about 12,000 (Ulehlova et al., 1987, Pickles, 1988). "The hair cells play a critical role in the transduction of acoustic energy into a graded electrical trigger potential which initiates neural impulses in the auditory nerve" (Gulick, 1971, p41). In order for hearing to occur, the pressure waves must in some way be communicated to the hair cells. The travelling wave on the basilar membrane " ... begins at the basal end and travels toward the apex, increasing its amplitude until it reaches a particular point [which depends on the wave's frequency]. The amplitude of the wave then falls off sharply... . The point of maximum displacement along the basilar membrane is a continuous function of the frequency of the stimulus, with the higher frequencies stimulating the basal region and the lower frequencies stimulating the more apical portions" (Sanders, 1977). At the same time, the pressure wave displaces the tectorial membrane. Because of the different manner of attachment of the two membranes they are caused to slide past each other and the resulting shearing motion causes the cilia on the hair cells to be bent. This bending directly triggers the production of a cochlear potential.

Humans have approximately 30,000 afferent auditory nerve fibres as well as about 1,800 centrifugal or efferent fibres (Pickles, 1988). About 90-95% of the afferent, or spiral ganglion, fibres are connected to the inner hair cells with only the remaining 5-10% connecting with the outer hair cells (Spoendlin, 1972, cited in Pickles, 1988). There are two very distinct types of afferent spiral ganglion fibres, with type-I (bipolar, myelinated) innervating the inner hair cells and type-II (monopolar, non-myelinated) innervating the outer hair cells (ibid). About 20 afferent fibres innervate each inner hair cell and about 6 afferent fibres innervate each outer hair cell, further, each type-I fibre innervates only one inner hair cell, whilst each type-II fibre innervates about 10 outer hair cells (Pickles, 1988). About 800 efferent fibres terminate at the outer hair cells and about 1000 terminate in the region of (but not contacting) the inner hair cells (Warr, 1978, cited by Pickles, 1988). These two sets of efferent fibres are also morphologically distinct, originating in different parts of the superior olivary complex in the brain stem, suggesting that they are functionally distinct (Warr, 1975, cited by Pickles, 1988).

Modern theories of hearing combine two classical notions, firing frequency and place of stimulation, which date back to last century (see Wever, 1949, for a review of early theories). The frequency principle assumes that the entire time domain signal is coded in the firing rate of the nerve fibres and that frequency analysis will be performed at a higher level. This is only possible, however, up to about 300-500 Hz which is the maximum firing rate of the fibres. To overcome this, it was assumed that if several phase-locked fibres fired in a volley, they would together be able to encode much higher frequencies. It has been found, however, that phase locking can only maintain synchrony up to about 1000 Hz (Russell & Sellick, 1983, cited in Pickles, 1986), although a less accurate picture can be given up to about 5000 Hz (Pickles, 1988). The place principle, on the other hand, takes into account the property of the travelling wave which moves up the basilar membrane to peak in amplitude at a position on the membrane which is related to the frequency of the stimulus. The frequency principle is now assumed to handle frequencies up to say 3000-5000 Hz whilst the place principle handles frequencies from perhaps as low as 200-500 Hz up to 20000 Hz. Thus, at low frequencies the two principles overlap. Javel (1984) stated that "... the auditory system uses both place and time mechanisms to extract information about stimulus frequency. Place-coded information is best at low intensities, and is lost at high intensities." This intensity relationship refers to the increased width of the activated area with increased intensity causing lower frequency resolution. The time mechanisms that he refers to are a function of the "...up-and-down motions of the basilar membrane..." and that "...time coding of frequency is derived from the fact that auditory receptor cells are activated only when the basilar membrane is moving upward." This has the effect of phase locking the responses of adjacent nerve fibres. "Phase-locking persists to 3-4 kHz, thereby allowing specific frequency information to exist when place-coding is poor" (ibid). Clearly, when place-coding is poorest, that is when intensity is greatest, the up-and-down movement of the basilar membrane will be greatest and so the time mechanism will be at its strongest.

Eddington et al (1978, cited in Klinke, 1979) examined multi-channel cochlear implants and were able to demonstrate directly for the first time "the existence of both place and periodicity pitch. ...[Firstly] multielectrode stimulation of the cochlear leads to pitch sensations that are clearly related to the place of stimulation. ...[Secondly,] for a certain place of stimulation the perceived pitch was higher with higher repetition rate of the electrical stimulus. ... [Further, it was shown that] ... place could be traded against periodicity." (Klinke, 1979, p9)

The reaction of neurons to bands of noise (or speech) show quite complex patterns which depend on the bandwidth of the noise and on whether any rapid changes occur. For example, Koch and Piper (1979, pp131-132) showed that some "...neurons may exclusively respond to fast changes of the envelope of spectral signal components that fall into the tuning range of this neuron. Different neurons may represent different parameters of the envelope such as rise time, repetition rate and precision in time. Such a mechanism may make auditory signal analysis to some extent independent of carrier frequency as long as the temporal pattern is preserved." This appears to imply that there are some feature detectors at the auditory periphery that specialise in temporal patterns, and that this response is only active when bands of sound, rather than pure tones, are the stimulus. Miller (1979) also noted similar results, stating that "many nerve cells in the cochlear nucleus respond poorly or not at all to bandpass filtered noise when the bandwidth is larger than a certain value... . When, however, the centre frequency of such a band of noise is varied rapidly, these nerve cells respond vigorously and show a pronounced frequency selectivity. This is so even for units that do not respond at all to the same noise when its centre frequency is varied slowly." (ibid, p52) Such results clearly demonstrate that the response of such a neuron to a pure tone could not be used to predict the response to a complex sound such as speech.

Flock (1982) on examining the sensory hairs and their responses identified three different hair bundle structural types and found that these three structures showed three different response times to stimulation, with one group returning rapidly to its zero position, a second group responding slowly, and the third group responding at an intermediate rate. He further found that the stereocilia bent at the base and were stiffer when bent in the excitatory direction than when bent in the inhibitory direction. Further, he found that they increased in stiffness from the apex to the base of the basilar membrane, and from the inner to the outer row of the outer hair cells. He felt that "this observation may relate to nonlinear properties in mechanics as well as in neural responses noted by several investigators" (ibid, p3).

Flock (1982) also found protein fibres in the stereocilia which are similar to muscle fibres, and further, he found that conditions that induce relaxation in muscles produced relaxation in the cilia, and that conditions that normally induce muscular tension also induced tension in the cilia. This, compared with the observation in other studies that the stimulation of the efferent nerve fibres connected to the outer hair cells caused changes in the responses of the system to acoustic stimulation, led him to comment that " this implies that efferent innervation possibly controls organ of Corti mechanical properties through a contraction-like mechanism in the sensory hair region, exerting its influence through coupling to the tectorial membrane." (ibid, p5). He also observed rigid protein "cables" which provide a supporting frame in the organ of Corti, mainly supporting the three outer hair cells. The inner hair cells were not attached to this framework. He felt that when this structural evidence is combined with the evidence that the inner hairs cells receive about 95% of the afferent innervation whilst the outer hair cells receive mainly efferent innervation, and further, that stimulation of the outer hair cells via these efferent fibres produces a definite mechanical effect on the operation of the sense organ, one is led to wonder that "...if the inner hair cell system is mainly sensory perhaps the outer hair cell assembly serves a "motor" function". This means that it is possible that there is "a motor capacity at the periphery involved mechanical sensitivity and its modulation" (ibid, pp6-7).

More recent work suggested that this motor function might be responsible for the sharper tuning of the peripheral auditory system than is indicated by studies of passive cochlear mechanics.

"It is current consensus opinion that the sharp tuning of the [cochlear] mechanics is probably due to some physiologically vulnerable mechanism which depends upon and modifies somehow the 'passive' or linear wave motion of the cochlear. ... It would appear possible that some active process in the cochlear, capable of supplying energy to the BM [basilar membrane], is providing a positive feedback of mechanical energy into the travelling wave. A possible model for this would be a source of mechanical energy, perhaps at the organ of Corti, which detects the presence of a travelling wave and pushes the energy back into the BM in phase with the wave. Such a mechanism would need to determine that the frequency of the travelling wave motion matched the local CF frequency in order that the feedback could take place at the correct point along the partition." (Yates, 1986, pp21 & 30)

LePage (1989) showed, by direct measurement, that changes in efferent activity to the outer hair cells result in changes in outer hair cell mechanics. LePage (1987) provided evidence for stimulus intensity-dependent dynamic sharpening of the tuning curve and allocation of frequency to place. Further, LePage (1990) proposed a physical model, effectively a modified Helmholtz model, that explains how dynamic mapping of frequency to place in the cochlear occurs. In this model the outer hair cells (OHC) :-

"...generate shear between the reticular membrane and the tectorial membrane ... the force acting at the base of the OHC stereocilia constitutes a turning moment about the base of the inner pillar cell, the point at which the arches are hinged. So, the OHC act to cause rotation of the rigid arch. ... a resolved part TR of the tonic force exerted by the OHC on the arch will appear in the plane of the basilar membrane. ... Modulating OHC tonus varies the shear, changing the radial tension both in the tectorial membrane and in the Pars pectinata, resulting in small variations in the angle of the arch, and in turn, the position of the tectorial membrane and the deflection of the inner hair cell stereocilia. ... The implication of the model is that quite small variations in OHC tonus ... can result in substantial local changes in the gradient of the frequency-place map." (LePage, 1990)

One of the potential implications of LePage's model is that it now seems possible that top-down processes may selectively modify the localised tuning of the cochlear. This would then allow for the possibility of selective attentional processes at the very peripheral level of the basilar membrane. It also means that the frequency tuning curves obtained by physiological and psychoacoustic methods may only give a partial view of peripheral auditory behaviour. The function of these processes in some kind of intensity-related feedback loop seems to be well established. Their function in the sharpening of the frequency response of the basilar membrane is also accepted by many workers (see Pickles, 1988, pp 51-53 and pp 136-148). Their function in a selective attentional process is, on the other hand, highly speculative (see Pickles, 1988, p245).

Schreiner (1979) examined the phenomenon of "poststimulatory inhibition" or "temporal suppression" in which the response of a neuron to a stimulus is reduced if it is preceded by another stimulus. This effect is "an inhibitory process and not an effect of peripheral adaptation or even of mechanical/ cochlear origin" (ibid, pp138-139)

The ability of the peripheral auditory system to be affected by feedback and perhaps also by excitatory and inhibitory control from higher auditory levels suggests that there exist pathways to this most peripheral level which might be utilised by top-down processes. "Very probably, both [bottom-up and top-down] modes are simultaneously engaged in any real act of speech perception..." and the mistake of thinking that either mode may predominate at any time "...derives from our limited ability to conceptualise any process in a highly parallel network" (Haggard, 1975, p8).

Many speech perception models predict the existence of feature detectors and there is a growing body of evidence to suggest that such feature or property detectors may exist even at the level of the basilar membrane and its innervation, although it is acoustic (or auditory) features and not phonetic features that they detect. Pisoni and Sawusch (1975, p21) commented that "...there seems to be fairly good evidence for the existence of property detectors which respond to certain types of acoustic information in the signal." Newman and Symmes (1979) described studies that appear to have located neurons in the primary auditory cortex of the squirrel monkey that responded specifically to a certain temporal feature of the animal's call and so could be described as feature detectors. Koch and Piper (1979, see above) also noted physiological evidence which suggested that certain neurones may respond selectively to certain temporal properties. Further, Flock (1982, see above) discovered that different stereocilia respond to different rates of stimulation and further because of the different response rates of different hair cell clusters different parts of the periphery respond with different time resolutions. Most of these studies seem to suggest that it is the temporal changes in the spectrum that are most closely attended to by peripheral auditory processes.

Creutzfeldt (1979, pp XV-XVII ) went so far as to say that

"...only the transients within a complex stimulus (or signal) are represented in the primary sensory projection areas of the cortex, with preservation of the spatial (or spectral) location of the appearance of such transients relative to each other. ...A complex auditory signal will thus be dissected into frequency regions as well as into temporal sequences of transients. is my feeling that the information contained in the sequences of transients in complex auditory signals including language appears to be under represented in neuro-physiological considerations about hearing and speech. The shadow of Helmholtz and the fascination by spectral tuning curves and feature detection lie heavily over this field. The problem of spatial, that is spectral, AND of temporal integration is, of course, the major problem of what we may call higher order analysis of auditory signals."

2.2 Psychoacoustics

2.2.1 Frequency

There are three major types of auditory behaviour that are of interest when examining auditory processing of the frequency dimension. They are frequency discrimination, frequency selectivity (or resolution) and judgements of relative pitch. These three types of frequency perception, the relationship between them and their relation to physiological processes will be examined below. The relationship of these three types of auditory frequency processing to speech perception will also be examined. Frequency Discrimination

"Frequency discrimination ... refers to our ability to detect differences in the frequencies of sounds which are presented successively." (Moore & Glasberg, 1986, p265). They are also referred to as frequency difference thresholds (df), frequency difference limen (DLF) or just noticeable differences in frequency (or frequency jnd). Shower and Biddulph (1931) and Wever and Wedell (1941) examined the frequency jnds (or df) of pure tones at different frequencies presented sequentially. These studies showed that from 125-2000 Hz df is constant at about 3 Hz. It rises to about 12 Hz by 5000 Hz, 30 Hz by 10000 Hz, and 187 Hz by 15000 Hz. This suggests that phase-locking is a more accurate process than the place principle for resolving frequencies since frequency thresholds are smallest at frequencies where the phase-locking principle predominates over the place principle. These studies also showed that df at any frequency increased as sensation level decreased. In general (and especially for frequencies below about 5000 Hz) this tendency is only marked for intensities below about 25 dB SL. Zwicker & Fastl (1990) imply, more generally, that there is no marked intensity effect above 25 dB SL for any frequency. This would imply that the 40 dB curve in figure 2.1 should also be valid for higher sensation levels (but see Zwicker's (1970) data in figure 2.1).

Figure 2.1: Frequency jnds (df) for pure tones at three presentation levels. (diagram after Gulick, 1971, p129, data from Shower & Biddulph, 1931; Wever & Wedell, 1941). Zwicker's (1970) data appears as the dotted line.

Zwicker & Fastl (1990), in a summary of work on frequency jnds, state that df is about 3.6 Hz at low frequencies and increases approximately in proportion to frequency above 500 Hz, where df is about 0.7% of frequency. It is important to note that all of these results refer to frequency modulation experiments. When frequency jnds are measured for tones presented in sequence with intervening short gaps the size of df decreases by about a factor of 3, so that below 500 Hz listeners are able to discriminate frequencies with a difference of only about 1 Hz and above 500 Hz can discriminate (approx.) 0.2% differences in frequency. (ibid, p 166) Zwicker's (1970; Zwicker & Fastl, 1990) curve is closer to the 10 dB SL (< 1000 Hz) and 5 dB SL (>3000 Hz) curves (see figure 2.1) but his curve represents df at higher intensities. The differences are probably due to experimental methodology, with Zwicker & Fastl (ibid) continuously modulating the frequency (at a rate of 4 Hz) whilst the earlier studies (Shower & Biddulph, 1931; Wever & Wedell, 1941) simply slid the frequency from one value to the other. A single change may be easier to perceive than a continuous modulation resulting in lower frequency jnd values. It is certainly true that earlier experiments (eg. Luft, 1888; Vance, 1914: both cited by Gulick, 1971) which involved sudden changes from one tone to another produced lower jnds (presumably because of audible transients at the point of change).

Moore et al (1984) examined the frequency jnds for the harmonics of a 12 tone equal-amplitude complex and found that the jnds for the first three harmonics was similar to the frequency jnds of pure tones (0.25-0.3 %) but for increasing harmonics (excluding the highest) the jnd dropped gradually to 2-5% for harmonics 9-11. The discrimination for the entire complex was about 0.13-0.22%. The lower harmonics were found to contribute most greatly to the pitch discrimination of the entire tone when all harmonics are of equal amplitude. Increasing the amplitude of a harmonic, relative to the surrounding harmonics increases its contribution to pitch discrimination.

Flanagan (1955a, 1955b, 1957, 1972) examined formant synthesised vowels to determine the jnds for formant centre frequency (3 - 5%), formant bandwidth (20 - 40% of BW), formant intensity (1-3 dB), and inter-formant valley depth (10 dB) and Flanagan & Saslow (1958) similarly determined the jnd for F0 (0.3-0.5 %, average 0.32 Hz). Terhardt (1979) predicted formant centre frequency and F0 jnds from Zwicker's (1970) model which stated that a 1 dB change in the spectrum at any point will cause a just noticeable change in the signal. These predictions are comparable to Flanagan's (1955) measurements at low frequencies but are much smaller than his measurements at higher frequencies. All of the above jnd measure were obtained from steady-state vowel signals. Klatt (1973) determined F0 jnds for synthetic speech with steady F0, a ramp F0 and a steep rate of F0 change (32 Hz/s). The F0 jnd for steady F0 was 0.3 Hz, for the ramp F0 the jnd was 2 Hz, and for the steep F0 the jnd was 4 Hz (an order of magnitude greater than the steady state values). Various estimates of pitch contour discrimination range from 7-12 Hz (in the region of ~ 120 Hz) have been found (Pierrehumbert, 1979; t'Hart, 1981). Harris & Umeda (1987) determined F0 jnd for natural sentences of between 5 and 16 Hz or 20 times the results of Flanagan & Saslow (1958) with significant variations of F0 jnd as a function of stimulus complexity and speaker. Ghitza & Goldstein (1983) asked whether similar increases in the jnds of other dimensions would occur for more natural dynamically changing speech and to what extent such jnds are frequency dependant. They utilised an LPC vocoder and a quantisation model of parameter jnds based on Flanagan's jnd data and came to the conclusion that:-

"Spectral distortions in synthesized speech sounds corresponding to more than 4 times the steady state JNDs can be tolerated without distortion in quality." (ibid, p356)

Ghitza added, in the discussion that followed, that the jnd approach is relevant to considerations of speech quality and not necessarily of speech intelligibility for which "...much broader distortions of different kinds are to be considered." (ibid. p357) Further, Rosen & Fourcin (1986) argue that

"it is highly unlikely that barely perceptible difference (only 75% detectable) would be used to convey information. ... Clearly, experiments which are executed in the psychophysical tradition overestimate the degree to which changes in fundamental frequency are used linguistically. It seems likely that this disparity is yet another reflection of the way in which speech is robust, using only acoustic contrasts which are highly discriminable." (ibid., pp 398-399) Pitch

"Pitch is a qualitative dimension of hearing which varies primarily as a function of frequency" (Gulick, 1971, p135) In other words, pitch is the perceptual correlate of frequency. The pitch of pure tones has been derived in two main ways. One method involves adjusting the frequency of one tone until it sounds half or twice as high as a second tone (Stevens et al, 1937). The second method involves the selection of one (or more) tone(s) that divide the interval between two reference tones into two (or more) perceptually equal intervals (Stevens & Volkmann, 1940). These two methods result in similar scales (the 1940 paper was able to reconcile the differences). They proposed the unit "mel" for that scale and arbitrarily defined 1000 mels as the pitch at 1000 Hz and one mel to be 1/1000 of that pitch on the subjective scale. The results of the 1940 experiment are displayed in figure 2.2.

Figure 2.2: Pitch (mels) versus frequency. (data from Stevens & Volkmann, 1940) Lines indicate arbitrary reference of 1000 Hz = 1000 mels.

Pitch can also be perceived for noise bands. Zwicker & Fastl (1990) note that noise with steep spectral slopes illicits pitch sensation with pitch corresponding to both the high-pass and the low-pass frequencies if the band is broad enough or to the centre frequency of narrow band noise.

As stated above, pitch varies primarily with frequency. The pitch of pure tones also varies with intensity (eg. Stevens, 1935). Gulick (1971) showed that for tones below 2500 Hz pitch decreased as intensity increased and that above 2500 Hz pitch increased with increasing intensity. Gulick's results showed variations with intensity of about 2-3% (up to 7 jnds) as intensity increased from 30 to 70 dB SL. Verschuure & van Meeteren (1975) found variations with intensity of less than 1% for tones in the 1000-2000 Hz range and outside that range variations of up to 5% occurred. Fortunately, the pitch of complex sounds such as speech and music is not as affected by changes in intensity (Durrant & Lovrinic, 1984), presumably because pitch is extracted from several harmonics. Moore & Glasberg (1986) show that subjects can "hear out" the first five to eight harmonics of complex tones and that higher harmonics can sometimes be heard out if their amplitude exceeds that of adjacent harmonics. Moore & Glasberg (1986) present a model that assumes that pitch perception for complex tones is achieved by first passing the complex through a bank of filters (place analysis) and then carrying out a temporal analysis of the phase locking patterns at the output of each filter. The time analysis searches across the characteristic frequencies (CFs) of the auditory nerve fibres for the most common inter-spike interval which represents the period of the perceived pitch.

Pitch perception also varies with tone duration, with tones of a few milliseconds being heard as clicks. "Stable and recognizable pitch quality requires some minimal tonal duration [or critical duration]. ... Below 1000 [Hz] the critical duration is a fixed number of cycles (6 ± 3), whereas above 1000 [Hz] the critical duration is a fixed length of time." (Gulick, 1971, pp141-142) We will return to a consideration of pitch below, after considering some aspects of frequency selectivity. Frequency Selectivity

"Frequency selectivity refers to the filtering processes which take place in the auditory system and which underlie our ability to detect one sound in the presence of another." (Moore & Glasberg, 1986, p264-265) Frequency selectivity is also referred to as frequency resolution (ie. the ability to resolve two or more frequency components in a complex sound). Behind this behaviour is the commonplace experience that one sound may obscure another sound and so render it inaudible, or, to put it another way, one sound may mask another sound. Given this fact, it should not be surprising that much early work on frequency selectivity was based on masking experiments which attempted to show the extent to which one sound (tone, tone complex or noise band) at one level and frequency would cause the threshold of audibility of a second sound at a second frequency to be raised. One of the earliest scientific references to these phenomena was by Mayer (1894, cited by Moore, 1977) who observed "... that it is easier to mask a tone by a second tone of lower frequency than by one of higher frequency, and that frequencies near the signal are more effective than frequencies farther removed." (Moore, 1977, p93) The earliest experiment (Wegel & Lane, 1924, cited by Moore, 1977 and by Gulick, 1971) examined the masking of one pure tone by another and attempted to relate their results to inner ear frequency scaling. This and later studies confirmed Mayer's (op.cit.) observations. Most later studies (eg. Fletcher, 1940, 1953; Egan & Hake, 1950; Greenwood, 1961a, 1961b; Scharf, 1961) utilised narrow noise bands as the masker to avoid experimental artefacts in the Wegel & Lane (op.cit.) study which were caused by beating when the two tones approached each other in frequency.

For example, Egan & Hake (1950) examined the threshold of pure tones in the presence of a narrow band of noise as a function of the frequency of the tone. This is repeated at several noise band intensities producing a masking pattern (masked audiogram) which is assumed to correlate with the shape of the auditory filter at various stimulus intensities at the cochlear position with a CF equal to the centre frequency of the noise band. At high stimulus intensities the masking pattern spreads more at high than at low frequencies.

Psychophysical tuning curves (PTC) (Zwicker, 1974), on the other hand, utilise a stationary probe tone of low intensity (eg. 10 dB SL) and a moving masker whose intensity is adjusted so that the tone remains just at threshold. This results in an inverted version of the previous masking pattern, with very large masker intensities required at frequencies remote from the tone, a much lower masking intensities required near the tone frequency, and with a masking intensity minimum at the tone frequency. These curves "... appear very similar to the tuning curves of auditory nerve fibres." (Pickles, 1988, p260)

Fletcher (1940) originated the concept of critical bands which referred to the effective range of frequencies that each place on the basilar membrane responded to. His measurements were indirect and based on a false assumption (a tone is masked by a critical band of noise of the same SPL as the tone) and so his results (which are too narrow) are now referred to as critical ratios. (see Moore, 1977 for an overview) Since then there have been numerous experiments that have examined critical bandwidths directly in experiments on the threshold of complex sounds (Gäsler, 1954; Zwicker & Fastl, 1990), on masking (Zwicker, 1954; Zwicker & Fastl, 1990; Bilger, 1960; Greenwood, 1961), on the perception of phase (Zwicker, 1952; Zwicker & Fastl,1990) and on the loudness of complex sounds (Zwicker & Feldtkeller, 1955; Zwicker et al, 1957; Zwicker & Fastl, 1990; Sharf, 1959a, 1959b, 1961). The procedure for many of these experiments is summarised by Zwicker & Fastl (1990). Perhaps the best known procedure (Zwicker & Feldtkeller, 1955; Zwicker et al, 1957; Zwicker & Fastl, 1990) involves increasing the bandwidth of a band of noise without changing its SPL. There is no change in perceived loudness until a critical bandwidth is reached, beyond which loudness continues to increase. The noise band is assumed to excite an auditory filter with a characteristic frequency (CF) equal to the centre frequency of the band of noise. Whilst the bandwidth of the noise is less than the bandwidth of the auditory filter, the same set of nerve fibres are being excited by the same stimulus SPL and so the loudness remains constant. When the bandwidth of the noise increases to a value greater than the bandwidth of the auditory filter, adjacent filters and so extra nerve fibres are activated with a resulting increase in perceived loudness. The critical bands measured by this method are the same as those produced by the various methods referred to above and typical values are summarised in Zwicker (1962) and appear as one of the two curves in figure 2.3.

Figure 2.3: Critical bandwidth (Zwicker, 1962) and Equivalent Rectangular Bandwidth (ERB: Moore & Glasberg, 1983) versus auditory filter centre frequency.

A more recent, and quite successful, technique for measuring auditory filter shapes psychoacoustically is based on notched noise maskers (Patterson, 1976; Patterson et al, 1982; Moore & Glasberg, 1983; Patterson & Moore, 1986; Moore et al, 1990; Shailer et al, 1990; Glasberg & Moore, 1990). Such noise bands have notches in them which are centred at the frequency of a pure tone. Tone threshold is then measured as a function of notch width. Patterson & Nimmo-Smith (1980) have also measured auditory filter asymmetry "... by placing the notch in the noise both symmetrically and non-symmetrically about the signal frequency." (Patterson & Moore, 1986, p138) They concluded "... that the filter has the same basic shape above and below its centre frequency but the lower half is stretched with respect to the upper half." (ibid. p139) From the filters so derived it is possible to determine the equivalent rectangular bandwidth (ERB) of the auditory filters (Patterson et al, 1982; Patterson & Moore, 1986; Moore & Glasberg, 1983, 1986). ERBs are typically broader than psychoacoustic tuning curve (PTC) bandwidths (Patterson & Moore, 1986). This is due to off-frequency listening and suppression effects on the skirts of the filters during the measurement of the PTCs. (but see Moore & O'Loughlin, 1986). On the other hand, ERBs are narrower than traditional critical bandwidths (see figure 2.3). This tendency is only slight at high frequencies (> 1000 Hz, where the curves are roughly parallel) but is quite marked at low frequencies. Patterson & Moore (1986) argue that the differences in BW are due to decreases in "... the efficiency of the detection processes that follow the filter ..." (ibid, p150) as frequency decreases below 1000 Hz. Actual filter bandwidths do decrease but efficiency also decreases cancelling out the effects of narrower bandwidth and causing bandwidths determined by traditional methods to appear to level out at 100 Hz.

Shailer & Moore (1983) demonstrated that bandwidths determined from the inverse of the gap detection thresholds (see section 2.2.2 below) of narrow band noise centred below 1000 Hz were almost identical to ERB values even at the lowest frequencies measured (200 Hz) and thus provided an independent support for the accuracy of the ERB relative to the critical band (but see section 2.2.2, below).

Patterson & Moore (1986) summarise their work on auditory filters as follows:-

"The shape of the auditory filter can be characterised by a rounded exponential function; the skirts of the passband are close to exponential but the top is flattened. For a young normal listener, a moderate level, and a 1.0-kHz centre frequency, the equivalent rectangular bandwidth of the filter is about 130 Hz and it is approximately symmetrical on a linear frequency scale. The filter applies an attenuation of about 25 dB 300 Hz above or below the signal frequency. To a first approximation, filter bandwidth is a constant proportion of the centre frequency; however, as centre frequency decreases below 1.0 kHz there is some increase in the relative bandwidth. The filter broadens slowly with age, the equivalent rectangular bandwidth rising from about 11% of the centre frequency at age 20 to around 18% at age 60. As stimulus level increases the filter becomes asymmetric, primarily because the lower skirt of the filter becomes shallower. There is a corresponding increase in bandwidth." (ibid, p173)

Note that such a proportional increase in bandwidth between ages 20 and 60 is approximately equivalent to an increase from about 1 ERB to about 1.2 Bark (at least for frequencies above 500 Hz). If cues to speech perception require a frequency resolution equal to that of a young ear then speech intelligibility would be expected to decrease over that age range for normal listeners. The Relationship between Frequency Discrimination, Pitch and Frequency Selectivity

Figure 2.4: Bark (Zwicker, 1962) compared with ERB-rate (Moore & Glasberg, 1986). ERB-rate has also been rescaled to Bark-scale range to facilitate curve comparison.

The auditory filter was likened to a bank of overlapping bandpass filters by Helmholtz (1863). A simplification of that approach is to assume that the auditory filter can be modelled by a series of adjacent filters which when added together result in a flat response (see also chapter 3, for vocoder channel filter design). These filters can be characterised by a function that specifies filter number (numbering from filter 1 at the lowest frequencies) as a function of filter centre frequency. Such a relationship can be obtained by integrating the function of auditory filter bandwidth against centre frequency. Integration of the critical band scale results in the Bark or critical-band-rate scale (Zwicker & Terhardt, 1980) or Frequenzgruppen (Zwicker, 1962). Similarly, the ERB-rate scale can be determined by integrating the ERB function (Moore & Glasberg, 1986). These two functions are compared in figure 2.4. The ERB-rate scale has also been rescaled to the Bark curve range to facilitate curve shape comparison. It can be seen that in terms of the Bark scale there are 24 adjacent filters and in terms of the ERB-rate scale there are about 37 adjacent filters over the range of hearing.

Zwicker (1970; Zwicker & Fastl, 1990), who sees frequency discrimination, frequency selectivity and ratio pitch all in terms of cochlear place, presents a relationship between these phenomena that is outlined in table 2.1.

Number of Bark Distance along Basilar Membrane (mm) Frequency jnds
Number of steps
Number of pitch steps
Number of inner haircells
24 32 640 2400 3600
1 1.3 27 100 150
0.7 1 20 75 110
0.04 0.05 1 3.8 5.6
0.01 0.013 0.26 1 1.5
0.007 0.009 0.18 0.7 1

Table 2.1 Relationship of frequency selectivity, frequency discrimination and ratio pitch to cochlear place, according to Zwicker's model of their interrelationship (Zwicker, 1970; Zwicker & Fastl, 1990). Table adapted from Zwicker & Fastl (1990, p145)

Zwicker & Fastl (1990) cite a strong correlation between frequency jnd, ratio pitch and critical bands as evidence for a place explanation of frequency discrimination. Further:-

"Because just-noticeable variations in frequency ... lead to constant values of the corresponding steps in pitch ... we are able to construct a relationship between frequency and pitch by integrating just-noticeable variations. This way, a pitch function very similar to that constructed from data of pitch doubling or halving can be calculated from JND's." (ibid. p163)

They conclude that there are 640 adjacent frequency jnd steps in the frequency range and that, since there are 3600 inner hair cells, each step corresponds to about 6 inner hair cells or a distance of about 9 m along the basilar membrane. In other words, they see constant steps of both pitch and frequency jnds in terms of constant numbers of inner hair cells. This is a controversial position in more than one way as other workers (eg. Moore & Glasberg, 1986) see both frequency discrimination and pitch intervals as being determined by temporal information up to frequencies where phase locking is lost and then by place information only at higher frequencies (this will be discussed below).

Figure 2.5: Comparison of pitch values of Stevens & Volkmann (1940) with mels derived by multiplying Bark (Zwicker, 1962) by 100.

Another problem with the data in table 2.1 is that the mel scale implied here is not identical with the scale derived by Stevens & Volkmann (1940, see figure 2.2). It is not uncommon to see authors displaying the Stevens & Volkmann's data on one page and then shortly after equating 100 mels to 1 Bark (eg. Gulick, 1971). This derives from an observation that 1 Bark "approximately" equals 100 mels (this observation appears to originate in Zwicker (1956), and is repeated in Zwicker, 1970; Zwicker & Fastl, 1990). Figure 2.5 displays the Stevens & Volkmann (1940) data against mels derived by multiplying Bark by 100. Clearly, 100 mels is not approximately equal to 1 Bark. The relationship is closer to 135-140 mels equals 1 Bark. It can also be seen that when the Stevens & Volkmann (op.cit.) curve is rescaled to the Bark x 100 curve (so that the values at 4000 Hz are equal) the two curves are not exactly the same. Whether the deviations are significant or within the range of experimental error is not clear. However, Zwicker and colleagues appear to have redefined the mel. Zwicker & Fastl (1990, p104) choose 125 Hz (instead of 1000 Hz) as their arbitrary reference defining its pitch to be 125 mels. Their diagram (ibid, p104) indicates a linear function (Hz mels) up to about 500 Hz and overall their function appears to be a mathematical approximation of some unspecified experimental data which, when divided by 100, is equivalent to the Bark scale. Zwicker (1970; Zwicker & Fastl, 1990), in positing a pure place model for frequency selectivity, frequency discrimination and relative pitch is claiming that, ignoring obvious scale differences, the pitch curve and the integrated forms of the frequency discrimination and selectivity curves are not significantly different. In other words, the differences in the curves produced by psychoacoustic experiments are caused by experimental artefacts and normal subject variance.

Figure 2.6: Comparison of pitch in mels (Stevens & Volkmann, 1940) with two rescaled jnd-rate curves (derived from: Gulick, 1971; Zwicker, 1970)

We will first examine the relationship between integrated frequency jnd (jnd-rate) curves with the pitch curves derived from the experiments of Stevens & Volkmann (1940). These comparisons are made in figure 2.6. There are two jnd-rate curves derived from two sources of data. One source of data is the frequency jnd curve presented in Zwicker and Fastl (1990) and also Zwicker (1970). This data is labelled in figure 2.6 as "jnd-rate Zwicker". The second source of frequency jnd data is found in Gulick (1971) which combines data from Shower & Biddulph (1931) and Wever & Wedell (1941). The 40 dB SL curve has been integrated and is labelled "jnd-rate Gulick". These two curves appear unscaled in the bottom right hand corner of figure 2.6 and are clearly very different in scale. Both of these curves have been rescaled to fit the maximum and minimum values of the Stevens & Volkmann (1940) pitch curve. This facilitates curve shape comparison. Again, clearly the three curves are not identical and it may be argued that the deviation is possibly non-significant but in the mid part of the curve the rescaled Zwicker curve would result in up to 250 mel errors when used to predict mels as derived by Stevens & Volkmann. Both Zwicker & Fastl (1990) and Moore & Glasberg (1986, see their conclusion below) consider ratio pitch (musical intervals) and frequency jnds to be mediated by the same sort of information. They disagree however on the type of information with Zwicker & Fastl (1990) favouring place mechanisms and Moore & Glasberg (1986) favouring timing mechanisms below 5000 Hz.

Moore & Glasberg (1986) examine both place (Zwicker, 1970; Zwicker & Fastl, 1990) and temporal explanations of frequency discrimination. They examine the two explanations of frequency discrimination in a number of different ways and test the predictions of Zwicker's (1970) model. Zwicker predicted a large number (640) of greatly overlapping filters spread evenly along the basilar membrane. Two tones varying by one filter step (one jnd) presented in sequence can be discriminated because at some place on the skirts of the filter response (most likely the steeper low frequency slope) there is a greater than 1 dB change in response and it is this minimally perceivable change in activity for the corresponding nerve fibres that results in the discrimination of the two tones. Moore & Glasberg (1986) demonstrate a lack of agreement between the predictions of Zwicker's place model and actual experimental results. Firstly, they point out uncertainties in the low-frequency filter slope owing to factors such as suppression. They then demonstrate that Zwicker's prediction that f/f should be constant (0.007) over the range 500-8000 Hz is contradicted by their version of Zwicker's model (based on careful measurements of filter responses) which indicates a marked increase in f/f over that frequency range. Zwicker's predictions are also examined with respect to short duration tones and to monaural versus dichotic modes of presentation and Zwicker's model is only found to hold at frequencies above the point where phase-locking fails. Perhaps the most compelling evidence that they cite are various studies of the relationship between frequency selectivity and frequency discrimination in the hearing impaired. Some patients who show poor frequency selectivity show almost normal frequency discrimination. Tyler et al (1983), for example, show that when the effects of threshold are removed, there is no significant relationship between frequency selectivity and frequency discrimination. Moore & Glasberg (1986) come to the conclusion that:-

"... frequency discrimination and musical interval recognition are mediated primarily by temporal information for frequencies up to 4-5 kHz. Above this only place information is available, and performance worsens." (ibid. p279)

Figure 2.7: Comparison of Bark (Zwicker, 1962), ERB-rate (Moore & Glasberg, 1986), and two jnd-rate curves (derived from Gulick, 1971; Zwicker, 1970). The jnd-rate and ERB-rate curves have been rescaled to the Bark curve range.

In figure 2.7 two integrated frequency jnd curves ("jnd-rate Zwicker" and "jnd-rate Gulick", see the explanation of figure 2.6, above) are compared to two representations of filter frequency, viz. Bark (Zwicker, 1962) and ERB-rate (Moore & Glasberg, 1986) rescaled to the Bark-scale range (see figure 2.4 above). The two jnd-rate curves have also been rescaled to match the Bark range. The jnd-rate curve derived from Zwicker's (1970) jnd data, when rescaled to the Bark curve's range is identical to the Bark curve. The jnd-rate curve derived from Gulick's (1971) jnd data do not coincide with the Bark curve in the frequency range where phase locking is active and converge with that line only as phase locking breaks down. The deviations between the two curves would suggest that the rescaled jnd-rate curve derived from the data reviewed in Gulick (1971) would result in errors of between 0.5 and 1 Bark if it were to be used to predict the Bark scale. These deviations of the jnd-rate curve from the Bark curve favour the position of Moore & Glasberg (1986), but only if the differences between the two curves can be shown to be significant We must ultimately rely on other types of experimental work to determine the comparative merits of the two models, and, as outlined above, Moore & Glasberg (1986) have been able to criticise Zwicker's (1970; Zwicker & Fastl, 1990) model from a number of independent perspectives. It would appear that Zwicker's model has resulted in curves for pitch and frequency jnds that are mathematical abstractions derived from his critical band measurements following observations that the three curves are similar.

The model that describes the peripheral auditory system as a bank of overlapping filters is not discarded as a result of the findings of Moore & Glasberg (1986) and others. What is very much in doubt is that aspect of the model that suggests that the spacing of those filters can be defined by the frequency jnd steps which in turn are defined by uniform steps along the basilar membrane. Frequency discrimination (jnd) steps below 5000 Hz are increasingly defined by temporal information (phase locking) as frequency decreases. If frequency jnd data is used to determine filter spacing then the resulting filter bank centre positions are not a linear function of cochlear place. Alternatively, if it is assumed that the filters are spaced linearly with respect to cochlear place then this spacing cannot be determined at low frequencies by frequency jnd experiments as jnds are not a simple function of cochlear place at such frequencies. This uniform spacing can only be implied by measurements of jnd at high frequencies where only place information is available and then assuming the same spacing in terms of inner hair cells (or cochlear distance) for the low frequencies.

One problem remains when determining the characteristics of the overlapping filter bank described in the two models referred to in the preceding paragraph. That problem is the determination of which filter bandwidth model to use, critical bands or equivalent rectangular bandwidths (ERBs). Both models of bandwidth are derived from psychoacoustic tests. ERBs are calculated from a detailed and more directly measured model of filter shape. ERBs are consistently narrower than critical bandwidths and this is especially so at low frequencies. On the other hand, the determination of critical bands from very diverse experimental methods has resulted in reasonably consistent bandwidths. Moore & Glasberg (1986) explain that the increasing discrepancy between the two models (as frequency decreases below 1000 Hz) is due to a gradual decrease in the detection efficiency of processes following the filter (see description of ERB, in section, above) causing broader bandwidths to be measured in the critical band experiments. Moore et al (1990) suggest that one possible reason for the discrepancy between critical bands and ERB is "... the strong variation of absolute threshold with frequency at low centre frequencies ... [which] could have different effects on different psychoacoustical measures." (ibid. p 132) Experiments with a rippled-noise masker (3) (Houtgast, 1977) and two tone maskers (ie. a notched masker utilising pure tones instead of noise bands: Patterson & Henning, 1977) have produced bandwidths very similar to those derived from notched noise experiments. Patterson & Henning (1977) were able to show that notched masking experiments produce stable filter shapes regardless of the stimulus type whilst in unnotched masking experiments the filter shape varied with stimulus type. This suggests that bandwidths derived from notched masking experiments are more reliable than those derived from other methods. The rippled-noise direct masking experiments (Houtgast, 1977) produced very similar bandwidths to critical bands at high frequencies, but they continued to decrease at low frequencies (parallel to ERBs). When the rippled noise was used in forward masking or pulsation threshold experiments (ibid.) the bandwidths derived were very similar to ERBs at all frequencies. Further, the ability to "hear out" partials in complex tones at low frequencies (Plomp, 1964) suggests a greater selectivity at low frequencies than that suggested by classical critical bandwidths. Moore & Glasberg (1986) were able to simulate loudness summation experiments for bands of noise centred at 1000 Hz using their filter shape model. They determined excitation levels for the various bands of equal intensity noise using their filter shape formulae, converted these excitation patterns into specific loudness patterns, and then integrated the specific loudness patterns to estimate the loudness of the band. The resulting bandwidth versus loudness curves showed constant loudness up to 160 Hz (the critical bandwidth for 1000 Hz) and then a linear increase in loudness above that bandwidth. In other words, they were able to demonstrate why loudness summation experiments result in broader bandwidths than the ERB for filters of known shape. Moore et al (1990) claim that "... most workers would view the ERB of the auditory filter as corresponding to the CB ... [and] that the 'classical' values of the CB need to be revised." (ibid, p139)

The critical band scale was selected for the experiments described in the present study. One reason for this was that most relevant previous studies (see especially section 2.5) utilised this scale and so the use of the Bark scale in this study would permit the more ready comparison of this study with previous studies. The narrowest filterbank modelled utilising the Bark scale had a bandwidth of 0.75 Bark. This is very close to 1 ERB for frequencies above 500-600 Hz and so the results for this filterbank should be very similar to thosefor a 1 ERB filterbank although there will be some inaccuracy below 500 Hz where the 0.75 Bark filter is somewhat broader than a 1 ERB filter. It is likely, in any case, that the degrees of discrimination required by speech intelligibility tasks described in the following chapters are somewhat coarser than either bandwidth scale.

When attempting to analyse the transduction of complex signals such as speech it is necessary to determine the extent to which our knowledge of auditory filter shapes and bandwidths and the results of psychoacoustic experiments utilising pure tones can be used to predict the type of information presented to the auditory cortex for acoustic cue analysis prior to eventual higher level phonetic analysis. There are probably two main approaches to this task. One approach would be to examine the representation of speech sounds somewhere in the nervous system. It would be necessary to ensure that the point in the nervous system utilised is not followed by filtering processes that would affect frequency resolution. The other approach involves presenting listeners with parametrically manipulated speech signals and testing the response of those subjects to the distorted speech signals in terms of changes in speech quality or speech intelligibility. Section 2.3 will examine the first approach whilst the experiments carried out in this study, as well as experiments outlined in section 2.5, examine the second approach.

2.2.2 Time

Although it is convenient to separate time and frequency dimensions in hearing and in speech perception for the purposes of examining auditory parametric representations, it must be remembered that it is quite artificial to do so as time and frequency domain effects constantly interact even for simple stimuli, but even more so for dynamically changing speech signals.

"We must remember ... that the contrast between time- and frequency-based structures is more a dichotomy of our explanatory framework than a dichotomy in the reality we seek to describe. Auditory events are spectrotemporal ..." (Haggard, 1985, p215)

Green (1985) distinguished two broad classes of temporal auditory phenomena, temporal integration and temporal acuity (see also Moore, 1989).

Studies of temporal integration attempt to discover the length of the interval over which the auditory system integrates acoustic information. This is of particular importance in the detection of barely audible signals and is realised psychoacoustically by the decrease in threshold of a signal with increasing duration up to about 200 ms. For example, Fastl (1976) showed that the effects of simultaneous, forward and backward masking of a tone by a noise reduced as tone duration increased up to 200 ms, beyond which there was no change in the masking threshold. Similar effects of stimulus duration are also found in experiments on frequency discrimination and pitch (see section 2.2.1 above). Beyond a certain duration integration fails and detection is based entirely upon intensity. Chistovich (1985) discussed several experiments that examined two hypotheses relating to temporal integration of spectral information over the total duration of vowel stimuli. One hypothesis suggested that frequency ("spatial") integration occurs first and then the results of that integration (analogous to the two formant stimuli described in sections 2.3.3 and 2.3.4 below) are then temporally integrated over the stimulus duration. The second hypothesis involved the possibility of temporal integration of each frequency band followed by frequency integration of the results. Neither hypothesis was supported by her experimental results and so she concluded that the central auditory system is involved in "running recognition" by "detectors" with a "small time constant".

Studies of temporal acuity attempt to discover how quickly the auditory system can respond to brief acoustic events. What is the minimum temporal separation that will allow us to determine that two events have occurred and what was their order of presentation?

If auditory filters were similar to simple filters (resonators, electronic filters, etc.) then there would be a simple inverse relationship between filter bandwidth and temporal resolution as defined by filter ringing (ie. the filter's impulse response). As auditory filters increase in frequency there is a progressive broadening of their bandwidth and so there should be an equivalent decrease in their ringing time. This is presumably so with respect to the mechanical filtering processes of the basilar membrane. The problem with any relationship between psychoacoustic temporal acuity and filter bandwidth is the possibility of the effects of some sort of neural non-linearity which may appear to reduce or even neutralise the inverse relationship between filter frequency resolution and filter time resolution. One might ask, for example, why most hearing impaired people, with their generally broader auditory filters, don't also have better temporal acuity than normal hearers for the frequency regions so affected (as would be predicted by those broader bandwidths). These issues will be examined below.

Perhaps the most powerful method for examining auditory temporal acuity is the determination of gap duration thresholds in otherwise continuous noise (narrow or broadband) or sinusoidal signals. Gap durations are gradually increased until either a subject can distinguish between two signals, identical except for the presence of a gap, or alternatively can actually hear the gap as a discontinuity in the signal. One major problem is that a gap in a signal is associated with a broadband spectral artefact (spectral splatter). This is not a major problem for broadband noise experiments which are already characterised by noise with similar characteristics to spectral splatter. Broadband noise is not, however, useful in the determination of any relationship between frequency and temporal acuity. In such experiments it is necessary to use narrow band noise or sinusoids. The effects of spectral splatter become more pronounced as the bandwidth of the test stimulus is decreased and this provides an extra cue for the subjects to attend to. Because of the high degree of perceptual salience of the spectral splatter this could lead to significant underestimation of the gap threshold. This potentially misleading utilisation of irrelevant cues is generally avoided by lightly masking the signal so that the spectral splatter is just obscured without lowering the gap threshold by masking it as well (eg. 40 dB S/N, Moore et al, 1993, see below)

Fitzgibbons & Wightmann (1982) found that hearing impaired subjects had significantly greater gap detection thresholds than normal hearing subjects. This is contrary to what would be predicted by the inverse of their broader filter bandwidths. Further, gap thresholds reduced as noise bandwidths increased and increased as stimulus level increased. Fitzgibbons (1983) demonstrated that for intensities above 25-30 dB, gap detection thresholds in noise bands reduced (temporal acuity increased) as frequency increased.

Buus & Florentine (1985) compared the gap detection of impaired subjects and normally hearing subjects. The normal subjects were given signals masked in such a way that their thresholds simulated impaired audiograms. Some of the impaired subjects showed greater gap thresholds than their simulated-loss counterparts which was interpreted as evidence that elevated thresholds are not sufficient to explain enlarged gap detection thresholds.

Glasberg et al (1987) examined the gap detection thresholds in narrow band noise at 500 Hz, 1 kHz, and 2 kHz for 9 unilaterally and 8 bilaterally cochlear impaired subjects. Impaired ears showed gap detection thresholds ranging from normal values up to 20 ms at 500 and 1000 Hz and up to 35 ms at 2 kHz. There was a significant correlation between gap thresholds and absolute (audiometric) thresholds with gap thresholds tending to deteriorate with deteriorating absolute threshold. For some subjects, however, elevated absolute threshold did not entirely account for increases in gap threshold. For a few subjects, impaired ears had better gap detection at some frequencies than normal ears at the same SL. This was only true at equivalent SPL, however, for 2 subjects. Subjects with unilateral impairment show a slower rate of recovery to forward masking in their impaired ear than in their normal ear for comparisons made at equal SPL but this difference is much reduced for comparisons made at equal SL. Further, large gap thresholds are related to slow rates of recovery from forward masking but as both tend to be related to absolute threshold it may mediate both effects.

Shailer & Moore (1983) examined the frequency and intensity dependant aspects of auditory time resolution by examining subjects' gap detection ability in narrow band (critical bandwidth) noise centred at a number of frequencies. They showed that temporal acuity is higher a higher centre frequencies owing to increased filter bandwidth. They compared the reciprocal of the gap detection threshold with filter bandwidths (ERBs) determined using Patterson's (1976) method of bandwidth determination. There is a very close correlation between measured ERBs and the inverse of the gap detection thresholds from the 200 Hz (80 ms) to 1000 Hz (8 ms). These results suggest that the gap detection thresholds below 1000 Hz are limited by the impulse response of the auditory filters and the inverse of the gap detection thresholds provide a close estimate of the filters' ERBs below 1000 Hz. Above 1000 Hz the gap detection thresholds are greater than the filter impulse response (inverse of ERB) which, they suggested, is probably due to the limitation of neural processes rather than peripheral filter characteristics.

Shailer & Moore (1983, 1985) and Eddins et al (1992) found that gap detection thresholds in constant bandwidth noise are independent of band centre frequency and that gap detection is dependent upon bandwidth. Eddins et al (1992) examined bandlimited noise with HP cutoff at 600, 2200 and 4400 Hz and with bandwidths ranging from 50 to 1600 Hz. Their results for six normally hearing subjects showed that gap detection improved with increasing bandwidth but that for equivalent bandwidths there was no significant change in gap detection for different frequencies. They explain that as bandwidth increases "...relative fluctuations of the sample noise energy decreases, and a perturbation in the signal, such as a gap, becomes easier to detect." (ibid., p1073)

Moore et al (1993) examined the detection thresholds for gaps in sinusoids (presented in 40 dB S/N background noise to mask spectral splatter associated with the gap). For frequencies from 400 to 2000 Hz gap thresholds were approximately constant at 6 - 8 ms, whilst gap thresholds are slightly higher at 200 Hz (8-10 ms) and markedly higher at 100 Hz (~18 ms). They measured the auditory filter bandwidths of each subject at each of the test frequencies to determine the relationship between filter bandwidth (ERB) and gap threshold. The data showed no significant correlation between ERB and gap thresholds. They also examined the relationship between individual differences in the efficiency of the central detection processes. The efficiency factor (K) is determined as the signal-to-noise ratio at the output of the auditory filter required for threshold and is the reason postulated by Moore & Glasberg (1986) for the differences between ERB and critical bandwidths at low frequencies (where efficiency is low). Moore et al (1993) found a significant relationship between gap detection thresholds and detection efficiency. They propose the existence of a "central sliding integrator" which increases in integration time at lower frequencies.

Glasberg & Moore (1989) in an examination of subjects with unilateral and bilateral hearing loss, found that the ability to detect gaps in sinusoids is significantly correlated with measures of intensity discrimination. They present a model in which

"... the output of each auditory filter is subject to a non-linearity (e.g. rectification) and is then smoothed by a sliding temporal integrator. A gap in a stimulus causes a temporary decrease in the output of the temporal window, and it is assumed that the gap will be detected when this decrease exceeds a criterion amount, which may vary from subject to subject. Measures of intensity discrimination may be related to this criterion amount." (ibid., p 11)

On the other hand, they found no significant relationship between gap detection thresholds in noise and intensity discrimination measures. They also found that gap detection thresholds in noise are highly correlated with the frequency discrimination of pure and complex tones, which suggests a common underlying factor (such as the relationship between filter bandwidth and impulse response).

A second approach to the examination of auditory temporal acuity involves experiments that attempt to discern whether events with small temporal separations are interpreted as simultaneous or non-simultaneous. Pisoni (1977) examined subjects' ability to both identify and discriminate between two tone (500 and 1500 Hz) stimuli that only differed by the onset of the 500 Hz tone relative to the onset of the 1500 Hz tone. In one of the experiments reported in that paper he examined the ABX discrimination behaviour with stimuli which consisted of pairs of tokens differing in relative onset by 20 ms. Even though these 12 subjects were completely unfamiliar with the stimuli and had not been trained to identify (categorise) the stimuli nevertheless they most showed peaks in their ABX discrimination functions corresponding to relative tone onsets of -20 and +20 ms. These results, as well as the results of Stevens & Klatt (1974), suggest that

"... 20 msec is about the minimal difference in onset time needed to identify the temporal order of two distinct events. Stimuli with onset times greater than about 20 msec are perceived as successive events; stimuli with onset times less than about 20 msec are perceived as simultaneous events." (Pisoni, 1977, p 1360)

These estimates are more than twice the gap detection threshold for sinusoids or noise bands (see above).

A third approach to the determination of auditory temporal acuity is to examine the extent to which non simultaneous signals can mask each other (ie. forward and backward masking).

Fastl (1976) examined the temporal effects of forward and backward masking. The masker used in the experiment was a noise band centred on 8.5 kHz with a bandwidth of 1 Bark (1.8 kHz). The test tones examined were at 6.5, 8.5 and 11 kHz. For a 1 ms test tone centred on the masking noise (ie. 8.5 kHz) the forward masking threshold was about the same as for simultaneous masking for a separation of up to 2 ms and then dropped gradually (-10 dB at 10 ms, -20 dB at 20 ms) to an asymptote at a separation of about 200 ms. The backward masking threshold was about the same as for simultaneous masking up to a separation of about 3 ms, dropping (-10 dB at -6 ms, -20 dB at -10 ms) to an asymptote at a separation of about 20 ms. The gap detection threshold at 8.5 kHz is about 2-3 ms (Shailer & Moore, 1983) and so separations of less than that amount should (and do) produce masking thresholds similar to those for simultaneous masking. It is clear from these results that the auditory system responds rapidly to the onset of an impulse or step function but its response dies away much more slowly at the offset of an impulse or step function. A stop burst would therefore have only a minimal effect on the detectability of a preceding occlusion unless the occlusion length was very short (< 6-10 ms). It is also possible that a stop burst could effectively forward mask the first 10 ms or so of a following formant transition, especially when it is remembered that the first one or two glottal cycles are relatively weak. A preceding vowel could potentially have forward masking effects on a following occlusion but this would only be significant in the very unlikely event that the vowel intensity remained high until just before the occlusion and then dropped very quickly in intensity into the occlusion.

Studies of frequency transitions for tone-glides have indicated that perception may be based on different cues depending upon the duration, frequency extent or context of the transition (Pollack, 1968; Nabelek & Hirsh, 1969; Tsumura et al., 1973; Fujisaki & Sekimoto,1975; see Porter et al, 1991 for a survey). When the duration of the transition is greater than about 300 ms the jnds for the frequency extent of the transition approaches the frequency jnd of frequency modulated tones. As durations decrease below 300 ms transition detection thresholds increase and transition frequency extent discrimination becomes poorer (Pollack, 1968; Nabelek & Hirsh, 1969; Nabelek, 1978; Collins & Cullen, 1978; Porter et al, 1991) in a similar way to the deterioration in frequency discrimination with decreasing stimulus duration. Very short tone-glides (<50 ms) appear to display detection and discrimination performance which is due to a combination of temporal integration limits and other factors (Porter et al, 1991). The threshold for the detection of a transition depends upon both the direction of the frequency change and also the rate of frequency change (Collins & Cullen, 1978; Nabelek, 1978; Cullen & Collins, 1982) whilst the discrimination of tone glides is more dependent upon the rate of frequency change than upon the extent of the frequency change (Nabelek & Hirsh, 1969). Porter et al (1991) suggest that :-

"...the changes in performance seen at extremely short durations, and/or high rates of change, suggest listeners use psychoacoustic cues based on the dispersion of energy along the cochlea (e.g., perceived signal bandwidth) for discrimination of these signals rather than the pitch/timbre cues available at longer durations and/or lower rates of change." (ibid, p1299)

Lacerda (1987) examined stop locus discrimination in vowel-like stimuli with transitions of varying duration and rate. Abrupt transitions were better discriminated than gradual transitions or steady-state vowels. Abrupt, gradual and steady-state tokens were better discriminated as their durations increased. Gradual and steady-state tokens were approximately the same in their discrimination functions. He concluded that the auditory system adapted similarly to steady-state and very gradual transitions.

Porter et al (1991) utilised synthetic speech-like complex signals which contained analogues of 45 to 120 ms F2 transitions in order to examine the influences of transition duration, extent, rate of change, and direction upon the discrimination of transition onsets. The discrimination results approached but did not reach steady-state frequency jnds for the 120 ms transitions and the jnd increased as the transition duration decreased to 45 ms. As with the tone-glide experiments reported above, falling transitions had smaller jnds than rising transitions, increments in rate (relative to a comparison standard) are discriminated better than decrements in rate, and 30 ms transitions with high rates of change are discriminated better than 30 ms transitions with low rates of change. Better discrimination for increments in rate seems to be related to a general tendency for any physical increase in a signal to result in a greater change in excitation than an identical physical decrease in the signal. The 30 ms stimuli were interpreted as approaching the auditory system's temporal resolution so that changes in frequency of very short times are effectively interpreted as a non-time-distributed wide-band signal. The lower rate 30 ms signals would appear to have narrower bandwidths than the faster 30 ms signals because they traverse a smaller frequency extent and so would be expected to be better discriminated because their frequency extents ("cochlear dispersion") would overlap to a lesser degree. Since the 30 ms signals seem to be processed as wideband single events then rising and falling transitions should, and do, produce identical discriminations.

Pols & Schouten (1987) examined the discrimination and identification of single tone sweeps, "band sweeps" (bandwidth of 200 Hz sweeping through the harmonics of a 200 Hz pulse train) and synthetic formant sweeps (single formant sweeping in four formant tokens). They noted training effects with trained listeners having a greater sensitivity for tone and band sweeps. There was a tendency for zero-sweep rate and low-sweep-rate to be identified as "down" sweeps, but this tendency was weaker for band-sweeps than for tone-sweeps. The subjects ability to label tone and band sweeps is inferior to their ability to label stop place of articulation based on the synthetic speech tokens. They concluded that there were two main reasons for the differences between tone-sweep labelling and speech labelling. Firstly, the tone sweeps lacked a following steady-state component which is common in speech, and secondly, and they felt more importantly, the subject training effect on the tone-sweeps suggests that life-long training on speech formant-sweeps may be a most important factor.

Jamieson (1987) asked whether the auditory system processes 40-60 ms formant transitions in a way that makes them more salient than shorter or longer transitions. He points out that with isolated transitions, it is easier to discriminate between two short transitions differing by 10 ms (20 vs 30 ms) than it is to discriminate between two longer transitions of the same difference (40 vs 50 ms). However, although the vocal tract is capable of faster transitions, 40-60 ms transitions in voiced syllable initial stops resist compression. When transitions are followed by steady state continuations of the final frequency 30 ms glides are better discriminated than 10 or 100 ms glides (Nabelek & Hirsh, 1969) Jamieson & Slawinska (1983, 1984) extended that experiment to include glide lengths of 10 - 90 ms in 10 ms steps (for frequency ranges approximating F1 and F2 transitions in /ba/) and found the there was a discrimination maximum at 40 - 60 ms. /ba/ and /wa/ percepts in these experiments are differentiated by transition duration with the /wa/ perceptions dominating above 60 ms. When the vowel duration is reduced significantly, however, the durational boundary between /ba/ and /wa/ shifts to shorter durations. This was interpreted by Miller & Liberman (1979) as being due to rate normalization. Jamieson (1987), on the other hand, presented evidence that as the vowel intensity was attenuated the duration above which /wa/ perceptions dominated was reduced. He concluded that backward masking was responsible for the preference of 40-60 ms transitions over shorter transitions. Shortening of the durational boundary between /ba/ and /wa/ is caused by both intensity attenuation and shortening of the vowel as both of these changes result in a reduction of the backward masking effect. Durations of 40-60 ms maximise the perceptual salience of cues to direction and rate of transition change. Shorter durations are masked by following steady state signals and longer durations become increasingly less salient as they become more like steady-state signals. Sorin (1987) examined the auditory representations of stop bursts and the effects on them of backward masking of /k/ in /ækæ/ (with and without gaps inserted between the stop and the following vowel). Backward masking effects of 5 to 10 dB on the burst spectrum below 1400 Hz were noted as was the possibility of <5 dB masking from 1500 - 3000 Hz. Gaps between burst and vowel > 15 ms resulted in negligible masking and, in general, backward masking was found to be weaker than forward masking.

The perception of the temporal dimension of sound appears to involve a range of scales, depending upon the task being examined. Temporal integration occurs over periods of up to about 200 ms. Forward masking thresholds gradually reduce over a period of about 200 ms and so there seems to be a relationship between forward masking and temporal integration. Backward masking, on the other hand, has no effect on a stimulus preceding the masker by more than 20 ms. Forward masking at a separation of 10 ms has a masking threshold 10 dB lower than simultaneous masking involving the same stimuli, whilst backward masking at a separation of 6 ms has a masking threshold of -10 dB relative to simultaneous masking. It is clear from this that the auditory system responds much more quickly to stimulus onset than it recovers after stimulus offset. Gap detection results appear contradictory but the differences can be traced to differences in methodology. There is some evidence for reduced gap thresholds at higher frequencies but results supporting that idea are contaminated by the effects of noise bandwidth. There is clear evidence for better gap thresholds when utilising broader bands of noise and this is explained to be an artifact caused by noise fluctuations interfering with gap detection. For noise of constant bandwidth there appears to be no frequency effect on gap detection thresholds. Gap detection in broadband noise is possible for gaps as small as 2 ms whilst just discernable gaps in sinusoids are 6-8 ms for medium frequencies (400-2000 Hz) with gap thresholds increasing greatly below 200 Hz. Moore et al (1993) explain that the increased gap thresholds at lower frequencies is correlated with a reduction in detection efficiency. Gap detection, and temporal acuity in general, is probably related to temporal integration and to nonsimultaneous masking. If the gap is too narrow then the preceding stimulus forward masks the gap and the following stimulus backward masks the gap. At a certain separation the effects of forward and backward masking will combine to sufficiently fill the gap so that it is not perceivable. The temporal acuity and temporal integration of the auditory system is not strongly related to peripheral filter ringing times (as predicted from the inverse of their bandwidths) but rather to neural processing and efficiency. Some psychoacoustic studies of tone sweeps and simulated formant transitions have suggested that the auditory system is more sensitive to dynamic stimuli than it is to static stimuli. This behaviour is even more evident in the studies of neural representations of dynamic and static sounds and so this issue will be examined in more detail in sections 2.3.1 and 2.3.2.

The auditory system is thus composed of a bank of overlapping bandpass filters with increasing bandwidth as frequency increases. This is not accompanied by filter time resolutions that become finer with increasing frequency. Neural processing limitations result in approximately constant temporal acuity at most frequencies. The time resolution of the auditory system appears to be of the order of 6-8 ms for sinusoids and narrow band signals with perhaps even finer resolution for broadband noise. Only around 100 Hz does the time resolution appear to be worse than 10 ms. Further, the temporal response of these filters to stimuli is non-symmetric with a much faster response to stimulus onset than to stimulus offset.

2.2.3 Intensity

Ernst Weber (1795-1878) formulated a "law" of psychophysics, now know as "Weber's law" which states that just noticeable increases in sensation are not related to fixed stimulus quantities, but rather to the ratio of the just noticeable increase to the original stimulus quantity delta S on S, a ratio known as Weber's ratio. Weber's law can be stated in its general form as:

Delta R equals K times delta S on S.
Weber's law. General form.

where delta R is the just noticeable change in psychological response, k is a constant of proportionality and delta S on S is Weber's ratio which is constant for constant conditions (eg. frequency) but varying stimulus level. Weber's law can be restated for acoustic intensity discrimination as follows:

Delta L equals K times delta I on I.
Weber's law. Acoustic intensity discrimination.

where delta L is the just noticeable change in loudness, k is a constant of proportionality and delta I on I is the ratio of the just noticeable intensity change to the original acoustic intensity.

Gustav Fechner (1801-1887) formulated a "law" of psychophysics, "Fechner's law" which he presented in his classic 1860 book Elements of Psychophysics (English translation: Fechner (1966)). Fechner's law states a general relationship between the magnitude of a physical stimulus (S) and the magnitude of the perceptual response (R):

R equals K times the log of S.
Fechner's law. General form.

where k is a constant of proportionality. This relationship was intended to apply generally to sensation and can be rewritten to apply to the relationship between acoustic stimulus intensity and loudness:

L equals K times the log of I.
Fechner's law, relating loudness and acoustic intensity.

where L refers to loudness and I to acoustic intensity.

More recently Stevens (1959) proposed a psychophysical relation which has come to be known as "Stevens' power law" which can be expressed as follows:

R equals K times S to the power of theta. Or.
Delta R over R equals theta times delta S over S. Steven's Power Law, the general forms.
L equals K times I to the power of theta. Or.
Delta L over L equals theta times delta I over I. Steven's Power Law, the acoustic intensity forms.

where theta is the exponent of the power function and is approximately 0.3 for the relationship between loudness and intensity above 40 dB SL (but see section below).

The decibel scale was originally devised by early telephone engineers as a convenient scale for sound amplitude as it was considered to relate closely to human perception of loudness as expressed by Fechner's law. This logarithmic relationship of intensity to loudness was not derived from a thorough empirical measurement of human loudness perception as the technology of Fechner's time did not allow accurate measurements of sound intensity. Close observation by Weber, Fechner and others, however, had led to a general recognition that the relationship between human loudness perception and intensity was not a linear function of either intensity or pressure but rather something more closely approximating a logarithmic relationship.

Since that time there have been a number of detailed examinations of human perception of intensity. There is more than one way of measuring human perception of sound intensity. Apart from the measurement of intensity thresholds (audiograms), there are three main procedures. One procedure involves the measurement of just noticeable differences (jnds) (Knudsen, 1923; Reisz; 1928). The second procedure involves the examination of which intensities are equivalent at different frequencies, the equal loudness or phon scale (Fletcher & Munson, 1933). The third procedure asks what changes in intensity are required to produce a doubling (for example) in the perceived loudness and the derived scale is known as the Sone scale (Stevens, 1938). In the following sections the current state of research in acoustic intensity perception will be examined as will the relationship between intensity discrimination and loudness perception. Intensity Discrimination

There are two main methods of examining intensity jnds. The first involves the presentation of one sound followed immediately by a second sound that is identical to the first in every way except that it may differ slightly in intensity. Subjects are asked to report whether they can detect a difference in loudness. Zwicker & Fastl (1990) point out that any abrupt change in intensity results in a relatively broad band noise which may be heard as an audible click. This problem can be partly overcome by placing a gap between the two signals which then results in a click regardless of whether the levels of the two signal has been varied or not. Some researchers (Zwicker & Fastl, 1990) have preferred the method of amplitude modulation to avoid these problems (cf. the frequency modulation method for frequency discrimination determination, section above). In amplitude modulation the signal amplutude is constantly modulated at a low rate (a few Hz) and the subjects' task is to identify whether they can hear the continuous fluctuations or not.

Perhaps the earliest systematic measurements of intensity discrimination were made by Knudsen (1923) who showed that delta I on I is not constant, in contradiction of Weber's law. Knudsen's experiments were carried out using sinusoids which abruptly changed in level and so it is very likely that the subjects were able to attend to spectral splatter artefacts and thus the results cannot be considered reliable. Reisz (1928) utilised the amplitude modulation method to obtain more reliable results across a number of intensities. This study also showed delta I on I to be non-constant with intensity and to decrease gradually with increasing intensity. Numerous more recent studies have examined intensity discrimination using discrete sinusoids presented with a silent gap between them (eg. Harris, 1963; Bilger et al, 1971; Florentine et al, 1987) or pulsed sinusiods (Jesteadt et al, 1977; Florentine, 1983). Several of these studies have also confirmed what has become known as the "near miss" to Weber's law (eg. McGill & Goldberg, (1968b); Henning, 1970; Rabinowitz et al, 1976; Jesteadt et al, 1977) and Carlyon & Moore (1984) showed a "severe departure" from Weber's law for very short stimuli (30 ms) at high frequencies.

The "near-miss" to Weber's law has been demonstrated for sinusoids, but not for intensity discrimination in noise. Miller (1947b) examined frequency discrimination in white noise and showed that Weber's law holds up to 100 dB SL.

Figure 2.8 compares delta I on I values at 1000 Hz for Reisz (1928) and Jesteadt et al (1977). The two curves are comparable for 20 dB SL and above and only deviate at low SL. There is also a slight tendendency to frequency dependency with Reisz's data (not shown on figure 2.8) with higher values of delta I on I at high and low frequencies than at middle frequencies.

Figure 2.8: Comparison of delta I over I values taken at 1000Hz from Reisz (1928) and Jesteadt et al (1977) demonstrating the "near-miss" to Weber's law.

This tendency for higher values at low SL in Reisz's data can also be seen in figure 2.9 which compares delta I versus I at 1000 Hz as derived by Reisz (1928), Jesteadt et al (1977) and Florentine et al (1987). The two lowest data points on the Reisz curve represent 10 and 20 dB SL. Reisz's amplitude modulation results are clearly higher at 10 dB SL than results derived from pulsed sinusoids (Jesteadt et al, 1977) or discrete sinusoids (Florentine et al, 1987). Reisz's results at 20 dB SL are well within 1 standard deviation of the results of Florentine et al. The results of Reisz and of Florentine et al are in very close agreement for 20 dB SL and higher with the results of Jesteadt et al only converging with them at around 40 dB SL.

Figure 2.9: Comparison of intensity discrimination results at 1000 Hz from Reisz (1928), Jesteadt et al (1977) and Florentine et al (1987).

Figures 2.10 (Reisz, 1928) and 2.11 (Florentine et al, 1987) compare intensity jnd measures across a range of intensities and frequencies. Florentine et al (1987) showed a large range of inter- and intra-subject variation with typical standard deviations of about 1.5 dB. It should be evident that Reisz'z data is within 1 standard deviation of the data of Florentine et al over the frequency range of 200 Hz to 1 kHz at all intensities except at 10 dB SL (the lowest frequency data point on each curve). Florentine's data shows a slight tendency for higher delta I at higher frequencies (unfilled symbols). Reisz shows this tendency for non-central frequencies (70, 200and 10,000 Hz) and this is especially marked at the lowest frequencies measured, 35 Hz (not displayed) and 70 Hz.

Figure 2.10: Intensity discrimination results across a range of intensities and frequencies (Reisz, 1928)

Figure 2.11: Intensity discrimination results across a range of intensities and frequencies (Florentine et al, 1987)

"Relative intensity levels at different regions of the spectrum, the definition of peaks and valleys in the spectrum, and the frequency region where the energy is present are thought to be the most important aspects of the speech code" (Green & Bernstein, 1987, p314). Green & Bernstein (1987) call the discrimination of spectral shape changes "profile analysis" and contrast this with "pure intensity discrimination", which is the simple discrimination between two spectra on the basis of either global or local differences in intensity. They point out that for many experimental designs it is difficult to be sure whether a listener is discriminating between two spectra on the basis of global changes in shape or is basing the discrimination on a simple local change in intensity. Zwicker's (1970) model of spectral discrimination, on the other hand, is based on the assumption that it only takes an intensity difference of about 1 dB anywhere in the spectrum for the two spectra to be discriminated, thus implying that spectral discrimination is actually based upon what Green & Bernstein (1987) have referred to as "pure" intensity discrimination. Green & Bernstein (1987) see "profile" or spectral shape discrimination as a distinct from simple intensity based discrimination. They utilised an experimental design which was intended to force the listener to make a similtaneous comparison of two or more spectral regions. They were able to demonstrate (citing an earlier study: Green, Mason & Kidd, 1984) that "...profile analysis is a global process which relies upon the integration of information across many critical bands" (Green & Bernstein, op.cit., p323) by showing that profile discrimination improved as extra components were added that were remote from the centre frequency in terms of critical band distance. They also found a large variation in observer performance with "profile experience" subjects performing better on profile tasks than on the "pure-intensity" discrimination of a 1000 Hz sinusoid, whilst "profile inexperienced" subjects performed far better on the "pure-intensity" discrimination task. The signals utilised in this study, however, don't closely resemble speech signals and so it is not clear whether they can reliably be used to predict speech perception behaviour. Nevertheless, the results do suggest that large-scale spectral integration may be an important factor in the whole spectrum or profile classification of speech (see section 2.3.3) as well as emphasising the importance of subject familiarity with the spectral profiles being analysed (as would be the case in the perception of speech by a normally hearing, linguistically competent subject). Loudness

Figure 2.12: Equal-loudness or phon contours. (after Robinson & Dadson, 1956)

Fechner's law would predict a direct logarithmic relationship between intensity and loudness, or alternatively a linear relationship between decibels and loudness, and indeed, Fletcher & Munson (1933) state that one of them (1921, details not specified) had proposed that decibels above threshold (dB SL) should be used as a measure of loudness. Implicit in this proposal is the assumption that stimuli at threshold are equally loud (presumably loudness is uniformly just above zero at threshold, but called zero for convenience). Some models of loudness (eg. Lim et al, 1977) have taken this assumption one step further to assume that stimuli at the lower limit of hearing are equally loud and also that stimuli at the upper limit of hearing are equally loud. Fletcher & Munson (1933) showed that points of equal loudness at different frequencies are not equal numbers of dB above threshold and developed a set of equal loudness contours that described the points of equal loudness across the range of human hearing. Subjects were asked to determine the intensities at various frequencies that sounded as loud as standard intensities at 1000 Hz. Loudness level, in phons, has the same numerical value as the level in dB of an equally loud 1000 Hz tone. Robinson & Dadson (1956) updated the loudness level scale, correcting numerous prior inaccuracies at high and low frequencies, and their results (see figure 2.12) now form the basis of standard definitions of loudness level (ISO/R226-1961). Fletcher & Munson (1933) suggest that the 120 phon contour is very close to the threshold of feeling. Unfortunately, since 1000 Hz was used as the standard, the 0 phon contour is actually below threshold and the lowest contour displayed in figure 2.12 is not 0 phons, but the threshold (minumum audible field or MAF).

Figure 2.13: Loudness levels (phons) versus loudness (sones) after Fletcher & Munson (1937). The dotted line is derived from equation VII.

The loudness level or phon scale enables the equating of loudness across frequencies, but it makes no predictions about relative levels of loudness. What does it mean to say that a sound is twice as loud or half as loud as another sound? 40 phon is not twice as loud as 20 phon, neither is it half as loud as 80 phon. Fletcher & Munson (1937) calculated loudness levels and relative loudness from masked audiograms, a function relating shifts in pure tone threshold as a function of distance from a noise band and the intensity of the noise band. The major assumption is that these audiograms are a function of neural excitation and that loudness is also a function of neural excitation. They produced a relationship between loudness level (phon) and loudness that is shown in figure 2.13. Note that there is a perfect linear relationship between phons and the log of loudness for intensities above 40 phon with loudness tending to zero as loudness level approaches zero. The relationship for intensities above 40 phon is described by the formulae:

Equation seven is, N equals 0.046
times 10 to the power of, open brackets, L N over 30, close brackets. Equation seven B is,
N equals 2 to the power of, open bracket 1, open bracket 2, L N minus 40, close bracket 2,
divided by 9, close bracket 1.

where N is loudness and Ln is loudness level. The two formulae are approximately equivalent. It can readily be seen from equation VII(b) that a doubling in loudness results from an increase in loudness level of 9 phon.

The unit of loudness, the Sone was proposed by Stevens (1936). One Sone was defined as 40 phons, 2 sones as sounds that were perceived as being twice as loud, and so forth. Stevens & Davis (1938) derived a relationship between loudness level and loudness which has been developed through several iterations to a standardised form (ISO/R 532-1966(E)) as described by Stevens (1961) and designated "Mark VI".

Fletcher (1953) provided approximately equivalent formulae which more accurately describe this relationship than formula VII:

Equation six is, N equals 0.0625
times 10 to the power of, open brackets, 0.03 times L N, close brackets. Equation six B is,
N equals 2 to the power of, open bracket 1, open bracket 2, L N minus 40, close bracket 2,
divided by 10, close bracket 1.

It can readily be seen from equation VI(b) that a doubling in loudness results from an increase in loudness level of 10 phon. Figure 2.14 shows Stevens (1961) "Mark VI" data with the dotted line indicating equation VI.

Finally, Stevens (1972) presented a revised version ("Mark VII") of the relationship between phon and Sone scales which he showed to be even more in accord with psychoacoustic results and which took into account other studies of perceived magnitude, "loudness", "annoyance", "noisiness", etc. This relationship is displayed in figure 2.15 against the line derived from formula VII. It is clear that formula VII is a closer match to "Mark VII" than is is to the data of Fletcher & Munson (1937) with the Mark VII values only beginning to diverge from the formula below 20 phons. It would appear that the formula (VII) which predicts a doubling in loudness with each increase of 9 phons is more in accord with the most recent data than the altenative formula which predicts a doubling in loudness with each increase of 10 phons. (Equation VII is used subsequently in this study for the modelling of loudness. see section

Loudness is independant of duration for tone bursts longer than 200 ms but for reductions in length by a factor of 10 for durations less than 200 ms there is a reduction in loudness level of about 10 phons. (Zwicker & Fastl, 1990).

Figure 2.14: Loudness level (phon) versus loudness (sone) "Mark VI" after Stevens (1961). The dotted line is derived from equation VI.

Uniform exciting noise (equal intensity per Bark accross the frequency range of hearing) is louder than a 1000 Hz tone of the same SPL whilst narrow bands of noise (less than 1 Bark bandwidth) sound as loud as a 1000 Hz tone of the same SPL. (Zwicker & Fastl, 1990) (This observation is the basis for one of the methods of estimating critical bandwidths. See section above.)

Figure 2.15: Loudness level (phon) versus loudness (sone) "Mark VII" after Stevens (1972). The dotted line is derived from equation VII.

Some studies (eg. Marks, 1987) have showed that diotic loudness summation can, under "appropriate" conditions be complete. Marks (ibid.) found that when a 1000 Hz tone was presented in quiet and at the same intensity to two ears that the loudness in sones was twice the loudness of the same tone presented to one ear. For complete summation to occur the sound had to be a pure tone or a noise band narrower than a critical bandwidth and loudness needs to be computed in sones. Zwicker & Zwicker (1991), on the other hand, showed that summation of loudness for diotic presentation is only partial. That is, sound presented diotically is louder than the same sound presented monaurally but it is perceived as being less than twice as loud. When the difference in the level presented to the two ears is zero the loudness is about 1.5 times the loudness for one ear and as the difference in level between the two ears increases the increase in loudness becomes smaller. This behaviour showed no dependence upon frequency, level or spectral shape.

Zwicker and collegues (see Zwicker & Fastl, 1990) have argued for a specific loudness scale (sone/Bark) which is used in the determination of the loudness of complex signals. Moore & Glasberg (1986) have also used specific loudness, but defined as sone/ERB. The relation of specific-loudness versus critical-band-rate is refered to as a "loudness distribution" and is related to auditory nerve excitation. Integration of a specific loudness versus critical-band-rate curve results in the loudness of the total signal in sones. For example, a 1000 Hz tone spreads to a greater extent as intensity is increased, according to a pattern which is defined by filter shape. The height of the peak of the loudness distribution for a 70 dB 1000 Hz tone is not 10 sones/Bark (a 70 dB 1 kHz tone has a loudness of 10 sone) but rather a little under 3 sones/Bark. Because of the spread of excitation, this total number of sones are spread over more than one critical band (or ERB) and it is the total number of sones under the distribution that equals 10 sone. An alternative approach (Bladon & Lindblom, 1981) is to Bark filter and scale a spectrum, convert dB to the phon and then to the sone scales. The conversion to phon and then to sone should involve an integration over one Bark (or preferably an auditory filter) for each frequency step and it is the intensity in each Bark band that is converted, not the intensities at each arbitrarily spaced frequency point. The latter approach is the one followed in the present study. Relationship Between Intensity Discrimination and Loudness

Houtsma et al (1980) suggested that two sounds are equally loud if their intensities are the same number of intensity jnds above the threshold at their respective frequencies.

Zwislocki & Jordan (1986) examined the relationship between loudness and intensity discrimination for both normal listeners and listeners with monaural hearing loss (to allow loudness matching between the normal and impaired ear). They found no relationship between the slope of the loudness function and the Weber fraction (delta I on I ), but found that jnds are approximately equal when the loudnesses across ears or subjects are equal for any given frequency. In other words, intensity jnds are closely correlated with loudness but the relationship contradicts Fechner's law (L equals K times the log of I.), Weber's law (delta L equals K times delta I over I.) and Steven's power law (delta L over L equals theta times
delta I over I, where theta is the exponent of the power function, ie. L equals K times I to the power of theta.).

Hellman et al (1987) examined loudness growth of a tone in narrow and wide band noise. This produced two loudness functions with different slopes. When intensity jnd was measured for these two stimuli at the point where the two loudness functions crossed (equal SPL and equal loudness) they were found to be the same in spite of the very different slopes of the loudness functions. This confirmed the conclusions of Zwislocki & Jordan (1986) that there is no relationship between intensity discrimination and the slope of the loudness function.

Rankovic et al (1988) presented evidence that pure tones of the same frequency (masked to simulate recruitment, or in quiet) that are judged to be equally loud do not necessarily have equal intensity jnds. Instead, their evidence supports instead the model of Houtsma et al (1980) which claims that the relationship between intensity jnds and loudness is not a relationship between loudness and the number of jnds above threshold, but rather between loudness and the proportion of jnds contained in the range.

Zwicker & Fastl (1990) consider intensity jnds to be "...received as just-noticeable differences in loudness ... [and that] it is not the absolute increment of loudness that is responsible for JNDL, but a relative increment." (ibid, p180) They also note that loudness cannot be directly derived from intensity jnd values directly, in contrast to their claim that frequency discrimination (Bark) is directly related to frequency jnds (which represent the separation of the overlapping 1 Bark filters).

Johnson et al (1993) concluded that equal loudnesses are associated with equal jnds regardless of the SPL or SL which, they argue, suggests that jnds are coupled directly to loudness. They confirmed that there is no relationship between the slope of the loudness function and the size of jnds, ie. the jnd is not the derivative of the loudness function. They also suggest that the equal-loudness, equal-jnd theory and the proportional-jnd theory are not mutually exclusive to a first approximation.

It seems clear from the above that the relationship between loundess and intensity discrimination is not yet fully resolved. Another question which has not been resolved is the relationship between loudness, intensity discrimination and timbre perception, especially as it relates to speech perception. As Lindblom (1986) points out, "It is by no means an established fact that timbre (vowel quality) and loudness judgements use identical inputs. They might tap partly parallel and partly different processes." (ibid., p29)

The present study examines both intensity-jnd and sone scaling in experiments that examine the intelligibility of parametrically scaled speech and in a second series of experiments that examine the use of both scales in spectral distance measures. The intention of the present experiments is to examine the extent to which these two ways of scaling the intensity dimension are related to the the phonetic processing of auditorily transduced speech.

2.2.4 Phase

Hermann Helmholtz, in the mid to late 1800's, extended Ohm's Acoustic Law by observing that the ear is effectively insensitive to phase. Helmholtz, however, "confined his conclusion to the 'musical' portion of the sound" (Wever 1949, p419). Wever outlines various early studies of phase perception which generally support Helmholtz's notion when confined to periodic signals with steady state phase and amplitude characteristics. Some of the studies he reviewed, however, indicated that rapidly changing phase can be perceived due to the effects of phase relationships (reinforcement and cancellation) on signal amplitude.

Licklider (1957) altered the phase of first 16 harmonics of a complex tone and found that these changes nearly always resulted in a discriminable difference in timbre. The effect was greater for higher harmonics rather than low ones and for lower fundamentals. Plomp & Steeneken (1969) presented triadic comparisons of complex tones of equal pitch and loudness but different phase patterns and found that the maximum timbre differences occurred when comparing complexes with all harmonics on sine or cosine phase with otherwise identical complexes with alternating sine and cosine phase. Complex tones with slope (-6 dB/octave) and pitch (150 Hz) resembling speech had a maximal phase effect on timbre which was less than the effect of changing the slope by 2 dB/octave. The effect was strongest for lower fundamentals and was independent of amplitude pattern and SPL.

Green & Mason (1985) produced stimulus pairs with components made up of logarithmically spaced tones but with different spectral shapes ("profiles") and presented the pairs either with fixed phase or with random phase. They found that phase had no effect on their discrimination results. This may suggest that the effect of phase changes on harmonic spectra (Licklider, 1957, above) is perceptually significant whilst for non-harmonic spectra (Green & Mason, ibid.) it may be less significant (if at all).

Carlson et al (1979) noted that the randomisation of the phase of vowel harmonics caused large changes in vowel psychoacoustics and non-phonetic quality, but only very small effects on phonetic quality. Palmer et al (1987) demonstrated, however, that the phase shifting of a single harmonic just above F1 can elevate perceived F1 by about 20 Hz. This effect is of a similar order to increasing the same harmonic's amplitude by 4 dB.

Traunmüller (1987) demonstrated that phase information alone is sufficient to convey enough information to allow vowel discrimination at low (adult male F0 and below) but not high F0. This discrimination is based on the ear's sensitivity to spectral components (harmonics) within the same critical band (preferably at least three partials, although two may suffice). Presentation through speakers in a normal room of "phase vowels" (that were intelligible when presented via headphones) resulted in unintelligible signals, because of the phase distortions created by the room.

It is well known that the shape of the time domain waveform of a complex signal is very sensitive to changes in the phase relationships of its various frequency components. For example, the waveform of the vowel /i/ varies greatly with changing phase. This variation occurs without any significant change in the intelligibility of the vowel. This well known perceptual insensitivity to phase in vowels has been used as one of the basic assumptions of channel vocoder techniques. Channel vocoders (which are used extensively in the present study) typically consist of a bank of bandpass filters with linear zero phase spectra. The omission of phase from the transmitted signal is one of the major reasons for the moderate transmission bandwidth savings available when using this type of vocoder.

Gold (1964) cited two earlier studies which indicated that "there is some evidence that severe phase distortion introduced by both pitch-excited and voice-excited vocoders causes deterioration in the quality of the synthetic speech" (ibid. p1892). Gold's study itself produced an unintelligible wave derived from three simple formant trackers, but which had "speechlike phase" and used this wave to excite the vocoder synthesiser (rather than the usual zero phase buzz and random phase hiss). Gold found that his listeners (in informal listening tests) reported that the output speech sounded natural and no longer possessed the then typical vocoder quality.

Flanagan and Golden (1966) extracted both phase and amplitude information from natural speech in their "phase vocoder" and recombined the two signals on resynthesis, thus avoiding the need for separate voicing and pitch analysis and excitation. In this system, both phase and amplitude were band limited and transmitted (rather than just the amplitude information as in normal channel vocoders). The main difficulty with this system was the need to produce a differential phase spectrum which could be band-limited (unlike the normal phase spectrum, which is unbounded). The bandlimited differential phase values could then be transmitted (along with the bandlimited amplitude values) and the phase could be restored by integration before recombining with amplitude at resynthesis. The synthetic speech produced by this system was claimed to "considerably surpass[ ]" the quality of normal channel vocoders.

Oppenheim et al (1979) demonstrated that speech with the amplitude spectrum set to unity and the phase spectrum retained intact produced "phase-only" speech with unnatural (noisy) quality but with a high degree of intelligibility. Spectrograms of this speech showed that the formant structure of the speech had been maintained. These results parallel similar findings with phase in image reconstruction. Oppenheim & Lim (1981) further demonstrated that speech with phase set to zero (amplitude maintained) is less intelligible than speech with amplitude set to unity (phase maintained). These results were obtained by manipulation of the long-time Fourier transforms of these signals. He concluded that phase in short-time spectra is insignificant. This is presumably because short-time spectra could be said to be modelling speech as a quasi-stationary signal in which the frequency components could be said to be stationary. Further, in the case of long-time spectra, speech is no longer being modelled as a quasi-stationary signal, but a signal in which the frequency components change dynamically in time. Oppenheim concluded that for both speech and images phase information preserves the "location of events" such as "lines edges and other narrow events" (ibid. p534). In other words, the long-time phase spectrum encodes the location of major changes in the signal, and in speech this implies changes in amplitude of frequency components. It is not surprising, therefore, that "phase-only" speech is reasonably intelligible as normal continuous speech is continuously changing and it is these changes that are preserved by the long-time phase spectrum. The long-time amplitude spectrum only encodes the average values for each of the frequency components and so effectively time-smears the signal when the phase information is omitted.

Even though channel vocoding might be considered to be a short-time analysis and as such, the phase spectrum should therefore be unimportant, there are several phonemes which involve rapid changes in amplitude (associated with rapid opening, closing, coupling or uncoupling of resonator chambers). It is reasonable to assume that such rapid changes might constitute important perceptual cues. Further, it is possible that at least some of these cues might require precise identification of their location (analogous to edge location in images) or of the shape of their time-domain waveform amplitude envelope. It is exactly these phonemes for which the short-time fourier analysis is able to least accurately model as a quasi-stationary system. It is therefore likely that it is these phonemes which most require the edge location information supplied by the phase spectrum and these phonemes which will suffer most from inadequate phase information. Such consonants include the stops and affricates (opening or closing of the oral cavity), the nasal consonants (coupling or decoupling of the nasal cavity and opening or closing of the oral cavity), and the lateral /l/ (coupling or decoupling of the two parallel oral cavities).

2.3 Central Auditory Processing of Non-speech and Speech Sounds

2.3.1 Auditory Nerve Representations of Sound

There are two main problems with the examination of the representations of speech signals at some point in the nervous system.

Firstly, these methods generally involve surgical intervention and so the experiments are limited to work on animal auditory systems. This most often utilises the cat as its auditory system is judged to be similar to the human auditory system, but there are numerous know differences. For example, there are about 30,000 auditory nerve fibres in humans and about 50,000 in cats (Pickles, 1988). It is difficult to predict the ways in which the neural representation of speech sounds in the cat (or other experimental animals) will diverge from neural representation in humans. It is likely, however, that it will be possible to determine many general principles from animal studies, even if some of the details will remain problematic.

The second problem arose during early studies of auditory physiology and psychoacoustics (see Pickles, 1986, 1988 for overview). Cochlear mechanics was at first considered to be broadly tuned (von Békésy, 1943, 1947). Furthermore, early studies (Tasaki, 1954) of frequency selectivity in auditory nerve fibres showed similar broad tuning. This was contrasted with much sharper psychoacoustic frequency selectivity. Von Békésy (1960) suggested that lateral inhibition in the cochlea and more centrally might be responsible for observed narrower psychoacoustic tuning curves and Katsuki et al (1959) suggested that progressively sharper tuning occurred as the signal moved to the auditory cortex. This was sometimes characterised as a two filter system with a poorly tuned peripheral mechanical filter and a second, sharper neural filter. The search for the "second filter" has proven to be fruitless as many of the early studies utilised poorly preserved specimens and in vitro experiments were unable to observe the mechanical feedback via the outer hair cells which result in much sharper mechanical tuning on the basilar membrane (see discussion in above). The early observations of poor peripheral tuning have been shown to be incorrect.

"Recent results have shown that there is a close relationship between the mechanical frequency resolution of the basilar membrane, the frequency resolution of inner hair cells, the resolution of the auditory nerve, and psychophysical frequency resolution. The behaviour of the auditory nerve as a function of intensity and the nonlinear behaviour of the auditory nerve also have close relations to the corresponding mechanical behaviour. ... The frequency selectivity of auditory nerve fibres can be seen both in the mean rate of firing and, for stimulus frequencies below about 5 kHz, in the phase locking of their responses. At low intensities, the mean rate of firing in the auditory nerve array as a function of fibre CF (the rate-place profile) gives an adequate representation of the stimulus spectrum. At higher intensities, where mean rates have saturated, temporal information provides a more robust representation of the stimulus spectrum." (Pickles, 1986, p111)

There have been numerous studies of the responses of auditory nerve fibres to speech sounds. These studies have examined steady-state vowels in quiet (Young & Sachs, 1979; Sachs & Young, 1979, 1980; Delgutte & Kiang, 1984a, Sachs et al, 1988), whispered vowels in quiet (Voight et al, 1982), steady-state vowels in noise (Sachs et al, 1983; Delgutte & Kiang, 1984d), speech syllables in noise (Silkes & Geisler, 1991; Geisler & Silkes, 1991), consonants (Miller & Sachs, 1983; Delgutte & Kiang, 1984b, 1984c; Carney & Geisler, 1986; Deng & Geisler, 1987; Sinex & Geisler, 1983, 1984; Sinex & McDonald, 1988, 1989; Sinex, 1993), and the coding of voice pitch (Miller & Sachs, 1984; Geisler & Silkes, 1991).

These studies usually rely on the plotting of results in some kind of nerve fibre response histogram. The post-stimulus-time (PST) histogram (Kiang et al ,1965) is a summation of the response of a fibre over numerous presentations and is intended to represent the group response of a group of CF related fibres. The PST histogram gives a representation of the average rate of the nerve fibre. The period histogram (Rose et al, 1967; Young & Sachs, 1979; Delgutte & Kiang, 1984a) is similar to the PST except that it plots the time between each spike and the start of the current fundamental period. Again, the data is computed from numerous presentations and is intended to represent the group response of a group of similar fibres. The period histogram gives the instantaneous or synchronised rate of the fibre. For example, if five clear peaks appear in the period histogram (whose time scale spans exactly one fundamental period) it means that the peak response of the neural unit occurs five times each fundamental period and that therefore the nerve fibre is synchronised to the fifth harmonic of the fundamental. A Fourier transform of the period histogram normally shows a major peak which indicates the frequency to which the neural unit is synchronised. The interspike-interval histogram (Kiang et al, 1965; Rose et al; 1967; Sachs & Young, 1980) registers the period between successive firings of a fibre and plots a histogram of the number of occurrences of each interspike-interval time. Whilst both the period histogram and the interspike-interval histogram both

"... provide information about the relative effectiveness of the stimulus components activating a fibre, the latter technique has the advantage that it does not require a reference signal from the stimulus. It may correspond more closely to the analysis done by the central nervous system, which of course does not have such a reference signal." (Pickles, 1986, p96) Auditory Nerve Representations of Signal Frequency

There is now considerable physiological evidence to support the inadequacy of the place principle as the sole process involved in frequency discrimination and selectivity. At high intensities the place encoding of frequency deteriorates as measured by the mean-rate profile of auditory nerve fibre activity (across fibres of the same CF) and yet for frequencies below 5000 Hz a similar deterioration of frequency selectivity is not displayed psychoacoustically.

"The inability of the mean-rate profile to code the spectra of complex auditory stimuli at high intensities is the single most important piece of evidence in favour of the temporal coding of spectral information, and against the pure place coding of auditory stimuli." (Pickles 1986, p102)

"... electrophysiological experiments show that frequency resolution, when measured by mean-rate codes in primary neurones, deteriorates at intensities at which psychophysical frequency resolution is maintained. On the other hand, frequency resolution as shown by temporal information does not deteriorate to the same extent. This has led to the suggestion that temporal information is used as a basis of psychophysical frequency resolution. However, closer analysis of the electrophysiological experiments shows that factors such as the loss of the middle ear reflex and loss of outer resonances in the electrophysiological experiments may have accounted for some of the differences. The inadequacy of the mean rate-place code cannot therefore be taken as proven." (Pickles, 1986, p105)

Sachs & Young (1979), utilised mean firing rates (PST histogram) to examine the response of many cat auditory fibres to synthetic three formant vowels. As this method best simulates the place/rate principle the patterns obtained when the stimulus presentation intensity was varied showed a marked saturation even at 68 dB at which point the separate peaks disappeared. F1 was seen to suppress F2 and F3 whilst simultaneously F1 evoked responses in the F1-F2 gap (due to spread of excitation of spectral peaks to higher frequencies) so that the gap was enhanced. Simultaneous F1-F2 gap enhancement and F2 and F3 suppression led to the loss of peak resolution.

Delgutte & Kiang (1984a), utilised the Fourier transform of the period histograms of more than 300 cat auditory nerve fibres in response to two formant synthetic vowels. They found that

"...the largest spectral components in the response patterns of auditory nerve fibers for the two-formant vowel stimuli usually are harmonics that are close in frequency to a formant, the fundamental or the CF. The principal factor that determines which of these components will be the largest is the relation of the fiber CF to the formant frequencies." (ibid, p870)

There were characteristic patterns on their histograms that followed the line described by the function "f = CF" (frequency equals characteristic frequency) with local maxima near the formant frequencies, as well as prominent peaks corresponding to F0 and F1 for peaks of higher CF. The F0 and f = CF peaks tended to occur only for fibres with CF remote from F1 and F2 and then mainly below F1 for vowels with high F1 and between F1 and F2 for vowels for large F1-F2 spacing. F1 excitation also spread to higher frequencies but, unlike the results of Sachs & Young (1979) and Young & Sachs (1979), F1 excitation dropped off for fibres with CF close to F2. In other words F1 did not suppress F2 (possibly, they explain, because their F2 was quite high in amplitude). They also found that

"... when a harmonic was sufficiently close to a formant frequency so that its amplitude exceeded that of its neighbors by about 6 dB, the components synchronized to the neighboring harmonics were considerably suppressed. When the two largest harmonics near a formant frequency had amplitudes within 6 dB, considerable response components were found at both harmonics, so that a central averaging scheme [to determine the 'true' formant value] would in principle be possible." (Delgutte & Kiang, 1984a, 875)

Young & Sachs (1979) performed a similar analysis but they discarded all spikes with inter-spike periods that were more than ¼ octave from the f = CF line and then performed an average over the remaining ½ octave band to compute a measure called ALSR (average localised synchronized rate). This representation overcomes the problems of saturation at high intensities as they found that peak contrast actually increased with intensity. They attributed this to suppression. High frequency peaks such as a high F1 suppress CF responses at higher frequencies such as the frequencies in the F1-F2 gap (with fibres in that region responding instead to F1). ALSR then rejects F1 responses outside the selected ½ octave band and so the F1-F2 gap responses are much reduced, increasing the distinction between peaks and dips. Although F1 will suppress F2 and F3 to some extent, the responses to these formants are relatively stronger than responses to frequencies in the inter-formant gaps and so F2 and F3 peak to dip contrast is also increased. ALSR can also maintain vowel spectra in noise whereas mean-rate profiles don't (Sachs et al, 1983; Delgutte & Kiang, 1984d) and also produces good spectral contrast for whispered vowels (Voight et al, 1982). Pickles (1986) argues that ALSR is not currently a good candidate for central auditory processing for three reasons. Firstly, there is no known neural mechanism which could perform the required Fourier transform. Nevertheless, Sachs et al (1988) suggest that there is some evidence that this type of information is being processed in some way. Further, Shamma (1985, 1988) demonstrates how physiologically more plausible lateral inhibition neural networks (LIN) are able to simulate processes that are functionally analogous to the ALSR (but do the same task in a quite different way to the ALSR). In other words the ALSR may simulate what is done in the cochlear nucleus without simulating how it is done. Secondly, ALSR is insensitive to distortion components (including, for example, harmonics of F1) which would greatly reduce the clarity of the profiles, especially at high intensities and it is likely (he argues) that the nervous system, unlike ALSR, would include these distortion components if it carried out an analysis comparable to the ALSR. The LIN approach of Shamma (ibid.) does not seem to suffer from this problem. The Thirdly, ALSR is also not appropriate for voiceless fricative consonants which have high frequency peaks at frequencies above the limit of phase locking. ALSR doesn't produce peaks for these fricative peaks whilst mean-rate profiles do and even at high intensities they are visible for responses to the stimulus onset (Delgutte & Kiang, 1984b).

Auditory nerve fibres "... discharge in the absence of external acoustic stimulation" (Greenberg, 1988a, p9) and are often divided into three groups on the basis of their spontaneous rates of discharge: high (>18 spikes/s), medium (0.5-18 spikes/s) and low (<0.5 spikes/s). (ibid.) The fibres with high rates comprise about 60% of the total, medium fibres 25% and low spontaneous rate fibres the remaining 15% (Liberman, 1978; Greenberg, 1988a). Low and medium spontaneous fibres are often grouped together for practical purposes as medium fibres tend to be more like low spontaneous fibres (Greenberg, 1988a). Compared with the low and medium rate fibres, the high spontaneous rate fibres have a lower average-rate thresholds (Geisler et al, 1985), they have a high rate of adaptation (high activity at stimulus onset with a rapid decline 10-15 ms later) whilst low spontaneous fibres don't show a high rate of adaptation (Rhode & Smith, 1985), they have relatively less ability to strongly phase lock (Greenberg, 1986, 1988a; Horst et al, 1986), and they are less affected by rate suppression than medium and low spontaneous units (Schalk & Sachs, 1980). The differing behaviour of these fibres affects their frequency selectivity and their participation in place versus timing processes. The high spontaneous rate (high-SR) fibres are likely to participate more effectively in place rather than timing mechanisms, they have a smaller dynamic range, they are more susceptible to the effects of noise (Silkes & Geisler, 1991), they saturate more readily and so are vulnerable to deterioration of frequency selectivity at high stimulus levels, but they show greater selectivity at stimulus onset than they do for steady-state stimuli presented at high levels. Low spontaneous rate (low-SR) fibres, have a larger dynamic range, are less susceptible to saturation at high stimulus levels (and are therefore more able than high-spontaneous fibres to maintain a place representation at high stimulus levels (Sachs et al, 1988)), show similar response levels for both stimulus onset and steady state stimuli (ie. they do not exhibit rapid adaptation) and they are far more capable of supplying useful information to timing processes because of their more reliable phase locking. Low-SR fibres have a lower sensitivity than high-SR fibres and so "... have a smaller effective response area for any given sound pressure level. ... the lower-SR fibre has a narrower bandwidth at any absolute signal level." (Silkes & Geisler, 1991). Low-SR fibres have also been shown to have superior ability (cf. high-SR fibres) to encode the fundamental frequency of voiced speech sounds. (Geisler & Silkes, 1991)

Sachs et al (1988) suggested that a pure place principle (rate/place) can operate at low intensities (high-spontaneous fibres) and high intensities (medium- and low-spontaneous fibres). Seneff (1988) proposed a joint synchrony/mean-rate model (synchrony/place) in which some hypothesised neural processor would determine the magnitude of only those responses with synchrony appropriate to fibre CF. Shamma (1988) proposed a lateral inhibitory network (synchrony/quasi-place) in which adjacent channels with common activity cancel out whilst adjacent channels with abrupt changes between them would result in a large response, thus enhancing signal edge detection. Ghitza (1988) suggested a process whereby synchrony is processed independently of fibre CF (synchrony/ place-independent) and patterns of synchrony are computed across a large number of channels.

Greenberg (1988b) proposed that:-

"The optimal representational form may vary as a function of the acoustic environment. It is suggested that the rate/place representation operates primarily at low sound-pressure levels and for encoding the gross spectral characteristics of the high frequency portion of aperiodic signals, such as stops and fricatives. The synchrony place and synchrony/quasi-place representations may also play an important role primarily at low sound pressure levels and under conditions of low signal-to-noise ratio in the high-frequency auditory channels. The place-independent synchrony pattern may provide the basis for the representation of voiced sounds (particularly vowels) at moderate-to-high sound-pressure levels and for conditions where the signal-to-noise ratio in the low-frequency channels is particularly poor. (ibid., p139)" Auditory Nerve Representations of Signal Temporal Attributes

Miller & Sachs (1983) examined auditory fibre responses to /ba/ and /da/ and found peaks in the mean-rate profile whilst formant frequencies were changing which disappeared when the formants became steady state. This is probably related to the findings of Smith & Brachman (1980) who showed that auditory fibres have a greater dynamic range in response to temporally changing stimuli. Pickles (1986) concluded that stimuli that are rapidly changing result in mean rate profile peaks even at high intensities where the mean rate profile for steady state stimuli shows saturation.

Delgutte & Kiang (1984c) examine the effects of short-term auditory nerve fibre adaptation on the dynamic characteristics of consonant-like sounds. The profiles of the average nerve fibre discharge rates show a strong dependency on preceding context. If the CF being examined has had a high level of excitation immediately preceding a test interval then fibre adaptation will result in a lower response to the test interval than would occur if there had been low preceding stimulus level at that frequency. Also, the discharge pattern tends to be very prominent for the first 10 ms or so following a rapid onset and much less prominent following a slow onset. The prominent discharge at a rapid onset is followed immediately by an adaptation mechanism resulting in a rapid reduction of discharge rate often to to ½ of the onset discharge rate in the case of stops or affricates. This strong adaptation on the discharge rates in response to transient stimuli is, they argue, likely to be of importance in distinguishing certain consonants that are contrasted especially with respect to their onset characteristics (eg. // and /t/). They argue that "... short-term adaptation tends to increase contrast between successive speech segments separated by an abrupt change in spectral characteristics." (ibid, p 904) These results are similar to the results of Kiang et al (1965) who examined cat auditory nerve discharge rate in response to short (50 ms) tone pips (see figure 2.16).

Figure 2.16: Simplified representation of the discharge rate of a cat auditory nerve fibre in response to a 50 ms tone pip (after Kiang et al, 1965). Auditory Nerve Representations of Signal Intensity

It has long been thought that loudness is a direct function of auditory nerve activity or mean net firing rate (eg.Wever, 1949; Bekesy, 1960). "[L]oudness depends upon the number of nerve fibers acting and their individual rates of firing". (Wever, 1949, p 302) Pickles (1983a,b) tested this classic model by assuming the hypothesis that the total number of action potentials should vary in a similar way to loudness when bands of noise were varied in bandwidth but not in SPL and (Zwicker et al, 1957) He found that integrated fibre array firing rate did show an increase in firing rate with increasing bandwidth, but there was no steady-state region followed by a break-point beyond which firing-rate increased. Unlike the psychoacoustic data, the firing-rate data showed a steady increase with increasing bandwidth. Also, unlike the psychophysical results, the slope of the whole relation increased with intensity. These results do not support a direct relation between loudness and total amount of activity of the auditory nerve.

Florentine & Buus (1981) demonstrated that a multiple channel model of intensity discrimination produces predictions that are significantly closer to experimental results than did an otherwise equivalent single channel model, suggesting that spread of excitation is significant in pure tone intensity discrimination. Cacace & Margolis (1985) provided supporting evidence and concluded that loudness " proportional to the area of excitation." (ibid. p1572)

Teich & Lachs (1979) and Lachs & Teich (1981) present a model of pure-tone intensity perception based upon neural counting that takes into account both spread of excitation and firing rate nonlinearity caused by saturation (which is a result of fibre refactoriness or "dead-time"). They assume that phase locking is not strongly related to intensity discrimination. Intensity discrimination is, they argue, more closely related to mean-rate counts for each neural channel (ie. for fibres of the same CF) or more accurately, to "...the number of neural impulses observed on a collection of parallel channels during an unspecified, but fixed, counting interval (observation time)." (Teich & Lachs, 1979, p1740) Intensity jnds could be determined, they argue, by either determining which produces the larger number of counts in a given time or, alternatively, which signal results in a given number of counts in the shortest time. The model results in predictions that are in good agreement with experimentally derived pure tone intensity jnds. They suggested that future versions of the model might include a representation of varying neural density at different frequencies. Lachs & Teich (1981) showed that the above model could also account for experimental pure-tone loudness results. They simply assigned the average total number of impulses observed across a number of adjacent channels during a fixed (but unspecified) counting period in line with the much earlier suggestions of Békésy (1960) and Fletcher & Munson (1933). In other words, their model demonstrates a single neural mechanism underlying intensity jnds and loudness.

"...loudness is proportional to sound intensity near threshold. At somewhat higher stimulus levels, from about 10 dB to 35 dB, saturation from refactoriness sets in, driving the slope of the intensity discrimination curve towards unity... . This decrease of discriminability is manifested in the loudness function as a gradual decrease in the slope below unity. ... At levels above 35 dB, channels with characteristic frequencies near the stimulus frequency are largely saturated. Nevertheless, the loudness function continues to grow because of spread of excitation." (Lachs & Teich, 1981, p777)

In other words, the shape of the loudness curve, especially at higher intensities, is dependent upon a nonlinear encoding of energy into frequency spread. The slope of the intensity discrimination curve (log(E) vs log(E)) is given by the formula (m = 1 - 1 / 4N), where N is the number of poles required to describe the filter shape (and is thus an indirect function of the filter skirt characteristics). In other words, Weber's law is approached as N approaches . The slope of the loudness curve is related to the intensity discrimination curve by the formula (m = 1 - p / 2), where m is the slope of the intensity discrimination function and p is the slope of the loudness function.

Delgutte (1987) examined the intensity difference limens DLs for single cat auditory-nerve fibres of different threshold and dynamic range characteristics and incorporated the measurements into a model of intensity discrimination that included "...both multiple frequency-selective channels and a physiologically realistic distribution of fiber thresholds within each channel ... [which] provides a stable representation of the spectra of speech sounds over variations in stimulus level" (ibid., p 334). Medium-SR (spontaneous discharge rate) fibres have, on average the smallest intensity DLs and the broadest dynamic range. Neither DL nor dynamic range showed any CF related trends. Single fibre DLs and psychophysical DLs only correspond over a small range of intensities and so it is necessary to account for psychophysical DLs by combining intensity information of several fibres of different sensitivities. Delgutte's (1987) model combines a model for predicting single fibre intensity DLs at different levels, intensity information for fibres with the same CF but different sensitivity ("single-channel" stage) and information across fibres with different CF ("multiple-channel" model). Initially, the model assumed 30,000 auditory fibres consisting of 60% high-SR fibres, 25% medium-SR fibres and 15% low-SR fibres but this predicted values which showed a degradation of performance relative to psychophysical results at high intensities. He then incorporated an assumption that low-SR fibres processed intensity more efficiently than high-SR fibres. He achieved this by weighting the proportional numbers of low-SR (x 3) and high-SR fibres (x 0.5) to produce predictions that didn't deteriorate relative to psychophysical data as intensity increased. The predictions, although producing a function the paralleled the psychophysical results nevertheless predicted performance that well exceeds psychophysical performance. He concluded that

"...there is more than enough information in the discharge rates (spike counts) of auditory-nerve fibers to account for psychophysical performance in intensity discrimination. In other words, psychophysical performance is not limited by saturation of auditory-nerve fibers, but by the processing efficiency of more central stages of the auditory system" (ibid., p 347)

The results also suggest that low-SR fibres are more efficiently processed centrally than are medium-SR and high-SR fibres and this conclusion is supported by the observation that low-SR fibres branch more profusely into larger numbers of endings per fibre in the cochlear nucleus than do high-SR fibres. This model is a "rate-place" model which Delgutte (ibid) considers could, if it included suppression and efferent feedback effects, provide a spectral representation sufficient to account for the processing of speech sounds under certain conditions although he allows that representations based on fine temporal patterns of discharge may still be necessary under certain conditions.

Hellman & Hellman (1990) derived a number of mathematical relations between neural spike counts, loudness and intensity jnds. They determined that there is a high degree of relationship between log-neural-count vs intensity (dB) and log-loudness vs intensity (dB) functions, with both functions having about the same shape and slope at higher intensities (0.25) for a 1000 Hz pure tone. Further, the neural counts (over an assumed 200 ms count time) required to encode loudness is 14 counts at 20 dB IL and 2000 counts at 90 dB IL which translates into 70 spikes/s to 10,000 spikes/s over the same range. As the average saturation firing rate for auditory fibres is about 200 spikes/s it follows that as few as 50 active fibres distributed across several channels should be sufficient to encode the loudness of a pure tone at 90 dB IL.

2.3.2 Representations of Sound in the Central Nervous System

The auditory cortex "... refers generally to that temporal region of cerebral cortex made up of a number of highly organised and interrelated fields, each containing neurons responsive to acoustic stimulation. ... Acoustic information processed in the auditory centers of the brain stem and thalamus reaches these areas over highly organized pathways. At the level of the cortex, this information is distributed over networks of associational and commissural fibers that link together, often in reciprocal fashion, the cortical fields of the same and opposite hemispheres." (Brugge, 1985, p353)

The majority of neurons in the auditory cortex are responsive to a restricted range of stimulus frequency and intensity (their response area). Neurons of the same CF are organised within each of the auditory fields into columns of cells orthogonal to the surface in what are referred to as isofrequency lines. Brugge (ibid) describes tracer experiments which have shown that reciprocal connections to cells with the same CF in other auditory fields of the same and opposite hemisphere. The heaviest pattern of interconnections is along the isofrequency lines of the same auditory field indicating that these columns of cells are preferentially connected. Similarly, connections have been traced between cells in the auditory centres in the midbrain and cells of the same CF in the auditory cortex.

A few studies have been able to detect specialised behaviourally-important feature detectors in the auditory cortex. The best examples are of animals that have highly specialised auditory systems, such as bats. Suga and colleagues (Suga 1990; Suga et al, 1978, 1979; Suga & Tsuzuki, 1985) have been able to map neurons in the bat auditory cortex that are specific for different aspects of the bat's echolocation signals. Some studies have been able to find evidence of auditory neurons specialised to the species-specific calls of animals without special auditory adaptation (eg. Guinea fowl calls: Scheich et al, 1979; "talking" mynah birds and vowel sounds: Langner et al, 1979; squirrel monkey: Newman & Symmes, 1979; Müller-Preuss, 1979).

Watson & Foyle (1985) consider the resolution of specific fine details of complex sounds to involve a combination of bottom-up and top-down processes. They argue that fine resolution of specific spectral components of complex sounds requires subject familiarity with overall stimulus patterns and knowledge of the specific components that will be subject to change. In other words, fine resolution requires "minimal stimulus uncertainty". Under conditions of "high stimulus uncertainty" the subjects are expected to discriminate change in any component of the stimulus. Time and intensity resolution deteriorated by up to an order of magnitude as stimulus uncertainty increased from low to high whilst frequency resolution deteriorated by up to two orders of magnitude. Tasks requiring discrimination in high stimulus uncertainty conditions required enormous amounts of training (10 - 30 hours for discrimination tasks). Large-set identification tasks of complex non-speech stimuli (analogous to phoneme identification in speech) can involve training for up to thousands of hours before asymptotic performance is achieved. One of the functions of top-down processing in the perception of complex signals is, they argue, to reduce the stimulus uncertainty by attending to selected components of the signal.

2.3.3 Central Spectral Integration

Chistovich (1985) suggested that the three major types of information conveyed by a vowel sound, phonetic quality, personal quality and transmission quality, are likely to be processed by "...different programs of peripheral auditory pattern processing". (ibid., p789) Chistovich and colleagues carried out numerous experiments on a hypothesised large-scale integration of vowel sounds with integration bandwidths of the order of 3-3.5 Bark. They argued that this large scale spectral integration was the process responsible (wholly or in part) for the phonetic analysis of vowels.

Chistovich & Lublinskaja (1979) used two stimuli, a variable two formant stimulus and a standard one formant stimulus with F* (this symbol refers to the single formant in a one formant stimulus) set to the mid-point of F1 and F2 of the two formant stimulus. The subjects were asked to adjust the relative amplitudes A2/A1 so that the two formant stimulus was closer to the standard rather than to the end point stimuli (ie. where either A1 or A2 equalled zero). A critical distance between the two formants was reached where 50% of responses indicated that matching was not possible. This distance was found to be 3-3.5 Bark. When peaks were separated by more than 3.5-4 Bark F* matches could only be made when one formant was much more intense than the other, with F* being set to match the more intense formant. One possible explanation (Chistovich, 1985) is that phonetic rules require that if local peaks are closer than the critical distance then use the centre of gravity, otherwise treat the peaks separately. Another experiment (Lublinskaja et al, 1981, reported by Chistovich, 1985) found that spectral tilt also affected the detection of formant centre of gravity contradicting the hypothesis that spectral centre of gravity is only a function of peak amplitude. The centre of gravity of the entire spectrum, however, does not seem to be involved in vowel identification (Chistovich, 1985).

Chistovich (1985) reported a further experiment which examined the phonetic categorisation into vowel phonemes of three types of stimulus:

  • S0 - one formant (F*) stimuli
  • S1 - two formant stimuli with a fixed 350 Hz formant separation and the F2 peak 5dB lower than the F1 peak
  • S2 - as for S1 except that the F1 peak was 5 dB lower than F2.

The experiment examined the notion of spectral centre of gravity by matching S1 and S2 responses to S0 responses. The S0 responses that best matched the S1 responses were those that had F* a little higher (~ 80 Hz) than the F1 value of the S1 token. The S0 responses that best matched the S2 responses were those that had F* a little lower (~60 Hz) than the F2 value of the S2 token. In other words, the two formant decisions were made largely on the basis of spectral centre of gravity which Chistovich (ibid) concluded suggests that some process of frequency scale integration was occurring.

Traunmüller (1982), utilising two formant noise and comparison tones, came to a similar conclusion to Chistovich and colleagues, to the effect that 3 Bark spectral integration seemed to be occurring, but the matches were closer to F1 when it was more that 3 Bark from F2 whilst the reverse occurred in Chistovich's experiments. He also found considerable inter-subject variation, with some subjects seeming to integrate over bandwidths up to 6 Bark.

Schwartz & Escudier (1987) examined whether 3.5 Bark large scale spectral integration is required for an understanding of F'2 (F'2 is the second formant in two formant stimuli and represents the large scale integration of F2-4) matching tests, whether large scale spectral integration can explain some apparently contradictory results about the part played by formant amplitude in vowel identification and whether it could contribute to an understanding of the auditory representation of labialisation in French front vowels. They attempted to answer these questions in the light of four possible models of F'2 generation.

  1. Large-scale 3.5 Bark spectral integration interposed between peripheral auditory analysis and phoneme classification which outputs a 3.5 Bark representation as the only input into the phonemic classifier. F'2 is a part of the 3.5 Bark representation.
  2. Large-scale 3.5 Bark spectral integration interposed between peripheral auditory analysis and phoneme classification, which outputs a 3.5 Bark integration as one of several spectral representations (including 1 Bark) which feeds into the phonemic classifier. F'2 is a part of the 3.5 Bark representation.
  3. F'2 is a mere by-product of a spectral distance computation carried out in the phonemic classification module (and potentially used directly in the classification).
  4. Unspecified (presumably not a 3.5 Bark integrator) distance extractor placed between the peripheral analysis and the phonemic classifier. In that model F'2 was a representation inherent in the output of the distance extractor.

Since it had been shown (Bladon & Lindblom, 1981) that the F'2 values can be predicted from sone/Bark representations, Schwartz & Escudier (op.cit.) suggested that a reasonable conclusion was that F'2 values corresponded to the model (#3) which did not have some kind of large scale integration between peripheral analysis and phoneme classification and that F'2 was merely a by-product of a distance calculation internal to the classifier and "...hence would have no intrinsic reality." (ibid. p285)

Carlson et al (1970), when studying F'2 variations in the [i]-[y] region with 4 formant vowels with fixed F1 (255 Hz), F2 (2000 Hz) and F4 (3350 Hz) and F3 values varying between 2300 and 3000 Hz, found a sudden 900 Hz shift in F'2 from 2400 to 3300 Hz when F3 was varied by 300 Hz around 2700 Hz (the approximate mid-point in Bark between F2 and F4). They assumed that it would be difficult to explain this sudden shift purely by auditory mechanisms and so assumed that it was a by-product of central phonemic categorisation (cf. model #3, above). That is, subjects would first identify the vowel, and then adjusted F'2 values according to their knowledge of each category. Schwartz & Escudier (1987) reanalysed the data and found that it could be explained entirely by a spectral integration model with F3 movements causing sudden changes in the main energy mass from F2-F3 to F3-F4 as would be predicted by 3.5 Bark integration as the F2-F3 distance increased to a point where it was less than the F3-F4 distance which in turn was less than 3.5 Bark. When they repeated the experiment with a different F1 (450 Hz instead of 250 Hz) but with the same values for F2 and F4 and the same variations in F3 they got exactly the same F'2 even though the /ø/ - /e/ and the /i/ - /y/ categorisation boundaries occurred for completely different F3 values. This suggested to them that F'2 was not dependent upon categorisation and supported model #2 which had 3.5 Bark spectral representation produced before the classification module and utilised by that module as one of its inputs. In other words the boundary was defined by different F'2 values ( /i/-/y/: F'2 = 2700 Hz; /ø/-/e/: F'2 = 2100 Hz) and so F'2 was a prerequisite of categorisation rather than a by-product of categorisation. Schwartz & Escudier (1987) argue that the F'2 parameter may be a useful perceptual parameter for phonemic classification but admit that various problems remain linked to the exact choice of critical distance.

Carlson et al (1970) also found that their F'2 results were almost the same even when F2 and F4 levels were varied by up to 24 dB. They felt that it would be difficult to explain this by a large scale integration procedure, that it would probably be dependent on the vowel identification process and so be a by-product of high-level process dependent upon completed vowel identification. Aaltonen (1985), on the other hand, found a clear formant level effect. Schwartz & Escudier (1987) claim that the difference between the two results is based on the fact that the Carlson et al (op.cit.) data is for four formant vowels and the Aaltonen (op.cit.) data is for three formant vowels. In the three formant data there can be no F'2 switch from the F2-F3 to the F3-F4 mid-point as F4 does not exist and so amplitude effects are not "hidden by more important energy grouping characteristics" (ibid, p288) as is the case (so they conclude) with the four formant data.

Karnickaya et al (1975) claim that the largest two peaks in their sones/bark function are closely correlated to vowel identity. This is similar to the two formant model of Carlson et al (1970, 1975) which showed that four formant vowels could be modelled by two formant vowels. The F'2 peaks in the two formant vowels often did not correspond to formants in the four formant model but to intermediate values for closely spaced vowels (F2-F3-F4 region) which are not resolved by a bark-scaled filter. Carlson et al (1970) also presented F1 to one ear and F2, F3, F4 to the other ear and found that vowel identity was retained. They interpreted this to indicate that the vowel timbre integration occurs at a non-peripheral level of the auditory system. They also presented evidence which suggested that listeners used some sort of harmonic interpolation method to determine the true resonance frequency (see discussion on auditory nerve representation, above) rather than simply select the highest amplitude harmonic as suggested by Chistovich (1971)

Mathematical models of two formant vowels which numerically derive the effective second formant from some sort of weighted average of the first four formants have been presented by Carlson et al (1970, 1975), Bladon & Fant (1978) and Paliwal et al (1983). Hermansky (1990) has presented a perceptual linear prediction algorithm which has incorporated both 3.5 Bark spectral integration and the concept of effective F2.

Espinoza-Varas (1987) suggested that several different bandwidths of spectral integration are simultaneously allowed by the auditory system and are utilised as the task requires.

  • Spiegel (1979) for example, reported spectral integration as wide as 3 kHz. This degree of spectral integration would only preserve the spectral tilt in speech.
  • The next level of spectral integration is a multiple critical band integration of the kind suggested by Karnickaya et al (1975) and Chistovich & Lublinskaya (1979). They suggested an integration bandwidth of 3.5 Bark which removes all harmonic information but also fuses formants of that order of separation (eg. formants with closely spaced F2 and F3). This level of integration would allow peak picking for low frequency formants without the distraction of harmonic peaks, especially for female and young speakers with large harmonic separation.
  • The next level of integration discussed by Espinoza-Varas (op.cit.) is critical band integration.
  • Finally there is the possibility of sub-critical bandwidth integration either as a result of lateral inhibition (Houtgast, 1974; Karnickaya et al, 1975) or possibly extraction from auditory nerve temporal patterns (see above). This narrow bandwidth integration seemed necessary (Karnickaya et al, 1975) as the difference limens for formant frequency are reported by Flanagan (1955) to be about 30 Hz below 1000 Hz which is much finer than critical bandwidths and even ERBs (although they approach that bandwidth at lower frequencies) for that frequency range .

Espinoza-Varas (1987, p87) concluded

"... changes of phonetic identity require a magnitude of stimulus change approaching the broader integration bandwidths, and the magnitude of stimulus change required to just notice a difference approaches the narrower integration bandwidths."

2.3.4 Formant versus Other Representations of Speech

The work carried out on large-scale spectral integration brings into question the traditional view of formant based vowel perception with its emphasis on F1, F2 and F3. There is evidence to support the idea that large scale integration, perhaps with bandwidths up to 3.5 Bark, is involved at some point in the processing of vowel phonetic identity . This scale of integration will cause the perceptual merging of many pairs of formants. Even 1 Bark integration involves the merging of some closely spaced formants (especially F2 and F3 for high front vowels). The spectral integration experiments of Chistovich and colleagues (eg. Chistovich, 1985, see discussion above) which showed local spectral integration but not global integration was claimed by Zahorian & Jagharghi (1993) to be consistent with the hypothesis that total spectral shape is important in vowel classification.

Pols and colleagues (Pols, 1970; Pols et al, 1969; Plomp et al, 1969; Klein et al, 1970; Pols, 1983) performed a principal-components analysis of a spectral-shape representation which produced a two component space which, when appropriately rotated, gave a vowel plot similar to a two formant vowel plot. They were also able to demonstrate automatic classification of vowels from this principal-components representation with an accuracy similar to formant-based classification.

Zahorian & Jagharghi (1993) showed speech recognition based on formants to be inferior to identification based on Bark / log-amplitude-scaled spectral features. With the addition of F0 to both lists of features the difference in recognition between formant-based and spectrum-based representations narrowed with spectrum-based (plus F0) representation maintaining a barely significant lead over recognition based on formants (plus F0). Further, it was also shown that the superior performance of the spectrum-based representation only occurred when ten or more features were used whilst no three spectrum-based features performed as well as the three formants. This is not a problem for the hypothesis that spectrum-based representations are superior because their superiority is supposed to be because there is more information available for classification. Jagharghi & Zahorian (1990) found that when human listeners were presented with conflicting spectral shape or formant cues they followed the spectral shape cues more closely. Zahorian & Jagharghi (1993) examined the correlations between listening experiments and speech recognition results and found that static spectral or formant cues correlated less well with human perception than did formant or spectral trajectory cues and that spectral cues correlated better with human perception than did formant cues.

Formant representations are generally considered to be the basis of the perception of vowels and vowel-like consonants with the perception of many consonants being based instead on whole spectral representations. Ohala (1980) questioned the desirability of a model that requires two different processes for vowel and consonant systems. Lindblom (1986) suggested that the two systems should be based on the same system of perceptual distance and that the discrepancy might only be due to our current state of knowledge of the nature of perceptual space.

Bladon (1982) presented four arguments against the representation of speech by formants as opposed to representations based on total spectral shape.

  • Changes in formant representations also result in changes in spectral-shape and so spectral-shape representations encompass formant descriptions.
  • A formant representation is a reduction of the total pattern
  • Formants are extremely difficult to determine in some contexts
  • Formant models are perceptually inadequate in that perceptual distance for vowels with widely spaced formants is much better predicted by spectral-shape models than by formant models.

2.4 Auditory Models of Speech Processing

Two-formant models of the kind described above are simple auditory models of speech perception that are based on the large scale auditory integration hypothesis. They represent one of three main approaches to speech perception (or recognition) that have weakened the classic formant-based model of vowel perception. The second approach (Zahorian & Jagharghi, 1993) involves auditorily-scaled spectral distance measures which have been shown to be superior to formant-based distances in speech recognition performance. The third approach (Pols and colleagues, see section 2.3.4) involves principal components analyses which can be applied uniformly to vowel and consonant spectral parameters and which extract a two component space for vowels which is analogous to F1/F2 space and which performs automatic vowel classification tasks as well as a formant model does. Modelling of speech based on auditorily-scaled whole spectrum distance approaches have produced better matches with human vowel perception or better speech recognition than models based on formants. This is, of course, even more the case for stops and fricatives which are not characterised by strong formant patterns (ignoring formant transitions from or to adjacent vowels).

The use of auditory models in speech recognition is only of interest in the context of the current study to the extent that it can help in the determination of which frequency, time or intensity scaling approaches provide the best performance and so only representative speech recognition studies that address issues relevant to the present study will be examined here (and then only briefly). It must be remembered, however, that time warping and other procedures such as dynamic programming utilised in speech recognisers may have effects that are convolved in some way with the frequency and intensity scaling effects of the auditory front-end and so obscure trends that might otherwise be observed (see Ghitza, 1993). Zahorian & Jagharghi (1993), for example, found that Bark-scaled spectral distances provided significantly superior speech recognition to Hertz-scaled distances. This was only true for the simpler classifiers (Euclidean and Mahalanobis distances), but there was no difference for the more complex classifiers (Bayesian maximum likelihood (BML) and artificial neural networks (ANN)). They concluded that "... BML and ANN classifiers are able to form complex decision regions that compensate for the lack of nonlinear scaling of the original features." (ibid, p1973) These results suggest that the most informative speech recognition approaches in the examination of auditory scales may not necessarily be the approaches that provide the best recognition (eg. hidden Markov models, ANNs, etc) as these approaches may well obscure the advantages of different auditory and non-auditory scales. Single word or nonsense syllable template based recognition utilising the simplest classifiers (especially Euclidean distance) may be the most suitable approach for this purpose. Klatt (1986) suggests, however, that whilst simple Euclidean metrics perform well for consonant noise spectra, they probably " too much attention to relative peak heights in the spectrum". (ibid, p308)

Zahorian & Jagharghi (1993) examined intensity as well as frequency scaling. Amplitude was also scaled, either linearly or logarithmically (dB, truncated 50 dB below peaks). For all classifiers and for both Hertz and Bark-scaled frequency the log-scaled amplitude produced significantly superior identification with Bark/log-amplitude scaling performing best of all. Blomberg et al (1986) also compared auditory and non-auditory models of frequency and intensity scaling as front ends to speech recognition. They examined Hertz versus Bark frequency scaling and dB versus phon versus sone intensity scaling and utilised a Euclidean distance metric for single word (nonsense syllable: /hV/ and /aCa/) recognition. Surprisingly, Hertz-scaling performed slightly better than Bark-scaling (97% & 95% respectively) but this may not be a significant difference between the two scales. For Bark-scaled spectra, dB-scaling performed slightly better than phon-scaling which performed considerably better than sone-scaling (95%, 92% & 87% respectively for all segments; 97%, 94% & 92% respectively for vowels; or 94%, 90% & 83% respectively for consonants). There may again be no significant difference between the dB-scaling and phon-scaling results but it does seem that sone-scaling is actually detrimental to consonant modelling although not to vowel modelling.

The utilisation of auditory physiological models as the front ends to speech recognition systems has been suggested by some authors. For example, Klatt (1986) and Carlson & Granström (1979) suggested that neural interspike interval data could be utilised. Blackwood et al (1990) and Meyer et al (1991) have developed a speech recognition auditory front-end that is based on a model of the behaviour of auditory nerve cells and five types of cochlear nucleus cells of varying temporal properties. Some of these cells are particularly sensitive to onsets only, some are sensitive to onsets and then after a short refractory period respond to following steady-state signal, whilst some are similar to many auditory nerve fibres and respond to an onset with a high excitation level and then drop to about half that level for the remainder of the signal. This model is clearly especially sensitive to speech sounds with highly salient temporal transient cues, such as stops. They report 100% accuracy for single speaker recognition of voiced stops for one of their models and reasonable accuracy for multi-speaker identification.

Delgutte (1986) presented "... a functional model of the peripheral auditory system that simulates selected properties of the discharge rates of auditory nerve fibers" (ibid., p163) He warned that peripheral auditory processing is much more complex than implied by simple linear filter-bank models. He used a 28 bandpass filter bank with shapes derived from auditory nerve fibre tuning curves, followed by an envelope detector (1 ms smoother), and a non-linear function that simulated the rate-level functions of auditory nerve fibres and could be varied to simulate nerve fibres with different sensitivities. This was followed by a module which simulated the short-term adaptation of auditory nerve fibres. The filters were separated according to a model of characteristic frequency (CF) and cochlear place. This model was used to analyse voiced and voiceless French stop consonants and especially VOT (Lisker & Abramson, 1964), short-time spectrum at release (Blumstein & Stevens, 1979) and spectral changes between release and voice onset (Kewley-Port, 1983). In 507 out of 542 CV syllables both the stop release and the onset of voicing were accurately detected. He also compared his model with the same model without adaptation but having instead the 10-20 ms time resolution of a typical channel vocoder. There were some important differences in the spectra. For example, /ka/ when processed with the channel vocoder-like model had prominent 2 kHz peaks at both release and at voicing onset, but when adaptation was introduced the 2 kHz at voicing onset disappeared because those channels with CF close to 2 kHz were strongly adapted to the intense burst. Such effects were found to enhance the contrast between /pa/, /ta/ and /ka/ at voice onset.

Ghitza (1993) compared the performance of human subjects on the diagnostic Rhyme Test (DRT: Voiers, 1980, see discussion in section 3.2) with a "simulated DRT" utilising his speech recognition system. The human subjects and the recogniser were presented the DRT list in three levels of noise (10, 20, 30 dB SN) and the error rates were compared as a function of six phonetic features (Jacobson, Fant & Halle, 1952). When the recogniser had a front-end that simulated the behaviour of auditory nerve frequency and temporal responses the error pattern was somewhat closer to (but nevertheless significantly different from) that of human subjects than when a Fourier power spectrum front-end was utilised. The error patterns of the two recognisers were very different from each other and both produced error rates higher than for human subjects. Ghitza argues that it is not necessarily valid to evaluate the validity of a speech recogniser front-end as a model of human internal representations by examining the performance of the whole recogniser as this combines the effects of the front-end being tested with the performance of the back end recogniser.

Auditory models have also been applied to other areas of speech technology, such as speech enhancement (Cheng & O'Shaughnessy, 1991) and speech analysis/synthesis (Ghitza, 1987). Cheng & O'Shaughnessy (1991) successfully combined Bark scaling with a model of lateral inhibition to enhance noisy speech by sharpening peaks and reducing dips between the peaks. Ghitza (1987) designed an analysis/synthesis system that utilised a simplified model of the auditory nerve. The model has overlapping filters that have frequency responses similar to auditory nerve tuning curves. This is followed by a processor that models the temporal characteristics of band firing patterns to determine the extent to which adjacent bands are firing in synchrony with a stimulus periodicity and then suppresses those bands that don't have the appropriate CF to produce an in-synchrony-bands spectrum. This system was reported to produce speech "informally" assessed to be natural and highly intelligible.

There have also been a number of proposals for the incorporation of auditory models into general speech analysis tools. For example, Plomp (1970) presented a distance metric based on sones/Bark vs Bark and Karnickaya et al (1975) utilised loudness density (sones/bark) calculations (after Zwicker & Feldkeller, 1967) in their auditory processing model of vowels. Carlson & Granström (1982) proposed a number of auditory transformations which could be included in an auditory spectrogram, including Bark, phon and sone scaling, modelling of auditory filter shapes, and examining adjacent bands for common dominant frequencies (analogous to spread of excitation and auditory nerve period histograms). Also, Bladon et al (1987) describe a speech analysis software package which includes a number of psychoacoustic and auditory-physiological transformations, including a neural adaptation model. It is important to remember, when utilising such tools, to be very cautious when combining physiological and psychophysical scales simultaneously. For example, the Baldon software contains a middle ear transformation function and Bark, phon and sone scaling which must also include some of the effects of the middle ear transfer function. Combining the two in a single analysis session may result in the factoring in of a single effect twice.

Klatt (1982b) cautioned against over-optimism regarding the usefulness of auditory tools in the study of speech perception. He argues that

"Constraints imposed by the auditory periphery have little to say about strategies for phonetic processing of speech. There remain many candidate strategies to choose from." (ibid., p191)

He also questioned the use of Bark-scaled spectrograms in some types of perceptual studies because they actually increase male-female spectral differences by resolving low frequency harmonics but also cautioned against the use of broader band Bark-scaling until there is clear evidence for such a scaling in the auditory system, referring to previous research (Klatt et al, 1982) in multiple Bark-scaling in which critical band vocoders began to lose intelligibility when the bandwidth exceeded 2 Bark. Further, he pointed out flaws in most current distance metrics (including auditory metrics) which result in increased distances with changing spectral slope that were not accompanied by similar phonetic changes.

Pols (1970) examined the "perceptual space" of a number of Dutch vowels by rescaling the frequency dimension into 18 octave bands and then performing a principle components analysis to derive 3 perceptual dimensions which accounted for 81.7% of the variance. He was able to determine that F1 and F2 were the independent variables which correlated best with the perceptual space. This view of perceptual space is much more abstract than perceptual spaces that are really only models of auditory transduction. Auditorily scaled spaces are not strictly perceptual spaces but are instead the auditory representation of the speech signal prior to the central processes of speech perception. These auditory spaces provide limiting conditions for resulting perceptual spaces and auditory models of speech scaling that provide the best predictions of speech perceptual spaces are likely to be better models of auditory representations of speech. Lindblom (1975) proposed the notion that vowel systems originate through the maximum utilisation of perceptual contrast, which implies that there is a tendency for vowels to be equally spaced in perceptual space. A consequence of this idea is that there must be some distance metric which, on the one hand can be justified by studies of the auditory system and speech perception and, on the other hand, can be used to preferentially generate vowel systems which actually occur in the world's languages without generating systems which do not occur. (Lindblom, 1986) Liljencrants & Lindblom (1972) utilised the mel scale in order to approximate more closely the perceptual space of vowels. Lindblom (1986) took this process further by utilising a mathematical model of the auditory system originated by Schroeder et al (1979) and described in detail by Bladon & Lindblom (1981). He had reasonable success in predicting existing vowel systems, but felt that his model was unable to predict some vowel systems. Ohala (1980) applied the same notion of maximal perceptual distance in consonant perceptual space and showed that this resulted in very unnatural systems. He concluded that maximal perceptual distance as predicted by auditory models was not appropriate for vowels and that the maximal utilisation of distinctive features might be more appropriate for consonants. There is a degree of circularity in this argument, however, as distinctive features were usually selected in a way that produced the minimal system necessary to allow all possible contrasts that occur in languages.

A final issue should also be mentioned. That issue involves the two processes of auditory segregation (the separation of signals from separate sources) and streaming (the combination of interrupted, masked, etc. spectral features into a single signal hypothesised to belong to the same source). These processes are dealt with in Bregman (1990) including a review of research in the field that he calls "auditory scene analysis". These processes involve the identification and separation of the harmonics of different fundamentals in the interspike period histograms of auditory nerve cells (nb. intelligibility of a voice increases as its fundamental varies from that of a competing voice), the use of phase and amplitude differences relating to binaural audition, and various central processes that examine differences in voice quality, spectral peak trajectory constraints, linguistic context, etc.

It seems clear that some progress has been made in the auditory modelling of speech. One of the problems with most of the above models, however, is that they don't allow for parallel auditory representations, such as can arise from auditory nerve fibres or (for example) auditory nucleus cells which have different temporal response, intensity ranges, etc. It is clear from the preceding sections that there are still numerous competing models of auditory transduction at each of the currently measurable points in the auditory system. The representations of speech in the auditory nerve, for example, appear to be potentially analysable in a number of ways (mean-rate, inter-spike interval period, etc). The mechanisms for such central processing have still not been revealed but it seems most likely at the present that the auditory cortex would make optimal use of all the representations that are presented to it. The history of speech perception research has often involved analyses of how we process representations of speech. These representations were originally external to the ear and characterised by the patterns displayed on the speech spectrograph. The black box containing the processes of speech perception started at the outer ear. Now we have pushed the black box up into the auditory cortex (but with many remaining questions regarding the auditory nucleus, etc.). We are now able to base our theorising about what happens in that black box on more perceptually realistic representations.

It is difficult to determine which of the potential representations of speech are the most valid perceptually. If, for example, there is more than one bandwidth of spectral integration required most methods of auditory scale testing will only provide information about the narrowest bandwidths necessary for adequate perception. If both 3-3.5 Bark and 1 Bark representations are used in speech perception then performance will deteriorate when the bandwidth exceeds 1 Bark as one of the required representations becomes more impoverished. The models examined above seem to generally confirm that the Bark-scale is superior to the Hertz-scale in at least some applications and that intelligibility of speech degrades as the bandwidth exceeds 1 Bark (Klatt et al, 1982) but this does not rule out the possibility that broader integration is also used centrally. These results are based on phonetic classification tests (ie. intelligibility or speech recognition) and so these experiments do not rule out the possibility that narrower integration bandwidths play a part in the perception of speech quality. The verdict on the intensity scales (dB versus phon and sone) is much less clear. There appears to be evidence, for example, that the sone scale may not be an appropriate scale for the representation of speech. This conclusion is again based on phonetic judgement tests and not on speech quality or speech jnd tests which may result in different conclusions. Temporal scaling in speech is even less clear. It now seems that the temporal resolution of the auditory filters is not directly related to the inverse of their bandwidths because of higher level limitations. A model which assumes more or less similar temporal resolution at different frequencies may be no less inadequate than a model that assumes time resolution to relate to inverse bandwidths. Both approaches may be inadequate as it is now clear that various neurons in the auditory nerves and higher have different temporal responses and this behaviour may be more important than gross average temporal resolution. Gross average temporal resolution may still be worth examining, however, as it should provide some information of limiting conditions beyond which speech intelligibility begins to deteriorate.

2.5 Perception of Parametrically Scaled Speech

Pattisaul and Hammett (1975) conducted an experiment which examined the time-frequency resolution of a cepstrum or homomorphic vocoder and the effects of variations in time and frequency resolution on vocoder speech quality as measured by a 9 point judgement scale. This study concluded that a certain amount of time-frequency trading occurs with different vocoder configurations. A vocoder with poor frequency resolution and good time resolution was judged to have the same quality as another configuration with poor time and good frequency resolution. This relation only occurred within certain limits of time and frequency resolution. An attempt was made to produce vocoder configurations with adaptive time resolution, in which a better time resolution would operate during voiceless segments, voiced-unvoiced transitions and unvoiced-voiced transitions than during voiced segments. It was found that a nonadaptive time resolution of 20 msecs produced speech of the same quality as a nonadaptive 10 msec vocoder, and as an adaptive vocoder with 10 and 20 msec alternative time resolutions. Poorer time resolution than 20 msecs was shown to decrease perceived quality, whilst improved time resolution had no effect measurable by this test.

Klatt (1982) reports a study (Klatt et al, 1982) which examined the intelligibility of speech processed through a critical-band vocoder. They found a reduction in intelligibility as the bandwidth of the "lower frequencies" was made to exceed 200 Hz.

Summerfield et al (1985) examined the effect of narrowing and broadening formant bandwidths in a serial formant synthesiser on the identification of stop place of articulation in speech presented to impaired and normal listeners. Narrow bandwidths did not produce improvements in the performance of the impaired subjects but provided some improvement for syllable-final stops heard by normally hearing subjects. Broadening the bandwidths caused a decline in performance for all subjects.

Ter Keurs et al (1992, 1993) examined 1/8, 1/4, 1/3, 1/2, 1, 2 and 4 octave spectral envelope smearing in an attempt to model broader auditory bandwidths associated with hearing loss. This was achieved utilising a Gaussian-shaped filter followed by overlap-adding, and maintained the signal's harmonic and phase structure. Intelligibility of speech (Dutch sentences) deteriorated as bandwidths exceeded 1/3 octave with vowels and consonant place being the first phonetic categories to suffer. Performance in noise also deteriorated as the bandwidth was increased. Baer & Moore (1993) simulated hearing impaired frequency selectivity by utilising 3 "broadening" conditions, 1 ERB, 3 ERB and 6 ERB with both symmetric (upper and lower filter skirts identical) and asymmetric (one condition broader on the upper skirt and the other two broader on the lower skirts) broadening. Normal subjects were tested on speech smeared in quiet and speech with noise added prior to smearing. Unmasked smearing had no effect on sentence intelligibility even at 6 ERB whilst there was a significant reduction in intelligibility in noise at 3 ERB even for low S/N (-3 dB). Asymmetrical broadening of the lower filter skirts relative to the upper skirts (similar to the majority of hearing impaired) had a greater effect in intelligibility than broader upper skirts. Clearly (and they also suggest this), the intelligibility of low or zero context speech such as nonsense syllables is much more likely to be degraded by poor frequency resolution even in silence.

The use of speech synthesis systems in speech perception research allows us to simplify the experimental task by varying only one variable at a time. Repp (1987) warns that apparently simple representations may actually be perceived as complex since

"...perceptual complexity is defined not absolutely but in terms of deviations from expectancies. In the case of synthetic or degraded speech, an acoustically simpler signal may pose a perceptual problem." (ibid, p8)

This is because

" is not the stimulus as such (or its auditory transform) that is perceived, but rather is relationship to the phonetic knowledge base; perception is thus a relational process, a two-valued function." (ibid, p13)

None of the above studies have systematically compared frequency, intensity and time scaling as a function of their effect on fine phonetic distinctions as measured by the speech intelligibility of minimally contrasting speech tokens. The studies that examine the intelligibility of parametrically encoded sentences (ter Keurs et al, 1992,1993; Baer & Moore, 1993) are conflating the fine phonetic consequences of the scaling with the effects of context and other aspects of top-down linguistic processing. Minimally contrastive nonsense syllables, on the other hand, are very little affected by linguistic context and so provide a better test of the effects of the scaling (see section 2.5.5, below). Studies that examine the effect of parametric scaling on speech quality (eg Pattisaul & Hammett, 1975, on the other hand, are really testing its effect on the non-phonetic processing of the speech signal. The study that examined formant bandwidths produced by a serial formant synthesiser (Summerfield et al, 1985) is difficult to interpret in terms of parametric auditory models because of the difficulty in relating formant synthesiser bandwidth control values to actual signal spectral bandwidths.

The present study examines the effects of systematic manipulation of frequency, time and intensity scales (both auditory and non-auditory models) on detailed phonetic contrasts as measured by the intelligibility of minimally contrastive /h_d/ and CV tokens. Further, it utilises a channel vocoder system which, unlike a formant vocoder, is not a simplification of the speech signal but maximally represents the entire signal spectrum. This, hopefully, avoids the effects of signal over-simplification that Repp (1987) warned of and yet allows the systematic manipulation of one dimension at a time with minimal effects on the remaining dimensions of the signal (see the section 3.1 on vocoder design, for further information).

2.6 Spectral Distance Measures of Speech

One way of examining the parametric representations involved in speech perception is to observe the perceptual response of human subjects to speech distorted according to the representational model being tested. An alternative method is to utilise models of acoustic or spectral distance (from some standard spectrum or token) that incorporate the representations being tested and examine their ability to predict variations in perceptual response.

Algorithms that measure the acoustic difference between two signals are of interest in the areas of speech processing and speech recognition. The comparison may be between the smoothed spectral envelopes of the test and reference items (Gray & Markel, 1976) or they can be made directly on other parameters such as LPC coefficients or LPC-derived cepstral coefficients (Atal, 1974; Ikatura, 1975; Gray & Markel, 1976; Tohkura, 1987).

Most attempts at the comparison of the performance of spectral distance measures have entailed comparing the success rates of the speech recognition systems containing them (see section 2.4 above). Klatt (1982), on the other hand, examined the ability of auditorily-weighted (critical band) spectral distance measures to predict differences in the human perception of various spectrally distorted synthetic speech tokens. He varied such parameters as formant amplitudes, spectral tilt and formant centre frequencies and then examined the effect on listener identifications. This enabled him to rank the perceptual effects of the spectral distortions. He concluded (as did an earlier study by Carlson & Granström (1979)) that the global difference between two spectra are not particularly good predictors of phonetic quality, which instead seems to be predicted by peak location and spectral slope around prominent peaks.

In automatic computational methods of speech quality measurement it is typically necessary that the speech coder or vocoder being tested be able to be modelled as "an additive noise source" (ibid), and that any noise so produced be "uncorrelated with the input signal" (ibid). The notion of noise is central to this type of evaluation method and refers to the measured difference between the input and output signals. Clearly, only those systems which involve some sort of coding of an input speech signal can be evaluated in this way. Of the various synthesis systems of interest to phoneticians, only vocoders would be amenable to this sort of evaluation. Unfortunately, the noise introduced by a vocoder system is very difficult to define and measure (Makhoul et al,1976). For example, although the input and output signals of a vocoder may appear quite different the actual difference perceived by a listener may be judged to be insignificant. Markhoul et al (ibid) considered it necessary to relate any method of "objective" vocoder evaluation to the processes of vocoding and human perception. They identified "analysis, encoding, transmission and synthesis" as those components of a vocoder system which "contribute to the degradation of vocoded speech quality" (ibid).

Barnwell(1980a,1980b) and Barnwell and Quackenbush(1982) compared the success of various "objective measures" with the Diagnostic Acceptability Measure (Voiers,1977a), a measure of perceived speech quality, using a distorted and an undistorted speech database. The distorted database was created by

i) running the speech through various coding algorithms (eg. APCM, ADPCM, LP coding, vocoder etc.)

ii) filtering, additive noise, interruption, clipping etc., or

iii) frequency specific distortion and masking,

giving a total of 264 distorted tokens.

Four classes of objective quality measures were compared for their ability to predict the subjective results.

i) "spectral distance measures" (log and linear frequency domain distortion measures, using "spectral envelopes estimated using 10th order LPC analysis")

ii) "parametric distance measures" (comparison of measures extracted from the LPC analysis of distorted and undistorted speech, including, area ratios, feedback coefficients, PARCOR coefficients, energy ratio, and also log area ratios, log feedback ratios, and log PARCOR values.

iii) "noise measures" (S/N measurements)

iv) "Composite measures" (combinations of the above)

The tests were further divided into frequency variant and invariant tests, and unframed and framed tests. The frequency variant tests divided the spectrum into six bands and weighted these results according to frequency before computing a combined result. The framed, or short time tests divided the signal into 10-30msecs segments and weighted each segment`s results according to its energy before computing a combined result. In general, it was found that frequency-variant tests performed better than frequency-invariant tests, and that framed tests performed better than unframed tests. Frequency variance had the greater effect. The log-area-ratio distance and the energy ratio measures were the only frequency invariant parametric distance measures which performed well, and both performed better than the frequency invariant spectral distance measures. Frequency variant methods greatly improved the linear spectral distance scores to a level comparable with the moderately improved log spectral distance scores. Unframed S/N tests performed poorly, whilst framed S/N tests performed well. The best result of all was obtained by the frequency variant framed S/N test. A further result was that "often more improvement was obtained by combining a good measure with a bad measure of a vastly different type than from combining two or more similar good measures." (Barnwell and Quackenbush, 1982, p998)

Zahorian & Jagharghi (1993) examined the use of global Bark-scaled spectral shape features contrasted with formant-based features in speech recognition performance. Several types of classifiers were used, Euclidean distance (EUC), Mahalanobis distance (MAH), Bayesian maximum likelihood (BML) and an artificial neural network (ANN). Bark scaled spectral distances were shown to be significantly superior to linear (Hertz) scaled distances for the simpler classifiers (EUC and MAL), but there was no difference for the more complex classifiers (BML and ANN) (see discussion in section 2.4 above). Amplitude was also scaled, either linearly or logarithmically (dB, truncated 50 dB below peaks). For all classifiers and for both Hertz and Bark-scaled frequency the log-scaled amplitude produced significantly superior identification with Bark/log-amplitude scaling performing best of all. They also compared vowel recognition based on just the steady-state portion of the vowel (SV) and recognition based on the entire vowel from initial transition to final transition (IT-FT). Recognition was significantly higher for IT-FT based recognition than for SV alone and was substantially higher than static feature based recognition. Also dynamic spectral cues produced significantly better recognition performance than dynamic formant cues for IT-FT but only slightly better performance for SV alone.

Repp (1987) warns, however, that in relying upon auditory scaling in spectral distance measures and experiments on the perception of parametrically scaled speech, we may actually gain little in the way of predicting normal processes of human speech perception.

"Since a clear, unambiguous stimulus poses no challenge to the perceptual system and therefore cannot reveal its workings ...the principal question is how phonetic ambiguities created by realistic signal degradation or by deliberate signal manipulation are resolved (explicitly) by the perceiver in the absence of lexical, syntactic, or other higher order constraints. In such a situation, the perceiver must make a decision based on the perceptual distances of the input from possible phonetic alternatives (prototypes) stored in his or her permanent knowledge base. ... what is the phonetic distance metric, what are the dimensions of the perceptual space in which it operates, and what are the perceptual weights of these dimensions? There are opportunities for the useful application of psychophysical methods here, since the distance metric may be, in part, a function of auditory parameters... . However, ... it makes relatively little difference whether we think of the input as sequences of raw spectra and of the mental categories as prototypical spectral sequences..., or whether we consider both in terms of some auditory transform or collection of discrete cues. It is the relation between the two that matters, and that relation is likely to remain topologically invariant under transformations. Only nonlinear transformations will have some influence on phonetic distances." (Repp, 1987, p16)

The present study examines several auditorily and non-auditorily scaled spectral distance metrics. The distance measures compare the input natural speech with tokens from various digitally simulated channel vocoder configurations which have been designed to distort the input speech in various ways. The input natural speech and the output synthetic speech are precisely time aligned and so the distance measures can be obtained without the confounding influence of other aspects of a speech recognition system such as time warping. It also has the advantage over Klatt's experiment in that distorted synthetic speech can be directly compared to natural speech rather than to reference synthetic speech.

In the present study spectral distances are computed in 10 ms frames utilising Hertz and Bark scales and also dB, phon ("logsone"), sone, Pascal and intensity-jnd scales. Vowel distances are computed for the entire /h_d/ token and for the vowel only. Consonant distances are computed across the entire CV token, for the consonant target, and for the consonant target plus transition. Also examined are dynamic spectral differences based on the differentiation of frequency bands over time.

The advantage of using a vocoder in these experiments is that the difference measures have a perfectly time-aligned natural speech source that they can be measured against for spectral differences. The spectral difference measures, however, still require validation against some measure of resultant perceptual distance. In this study perceptual distance is defined as the difference in intelligibility of each vocoder configuration (either globally or phonetic token by token) compared to natural intelligibility. For example, if a particular phonetic class scored 90% correct intelligibility for natural speech and 81% correct for a particular vocoder configuration then the perceptual distance would be 10% ((90-81)/90%). This kind of perceptual difference can be computed globally (all tokens), by phonetic class or feature, or by phoneme. An alternative kind of difference measure would be a more psychoacoustic measure which examined perceived differences between tokens (similar to jnds) or measures of preference. These measures are not as phonetically valid as measures based on intelligibility which are a more direct measure of phonetic rather than psychoacoustic processing. The spectral distance measures can then be examined in terms of their degree of correlation with the perceptual distance measure. It is assumed that the best correlations will be evidence of the spectral scales and transformations most closely representing those utilised by human perception. The combination of scales so selected are then used to produce spectral representations (eg. auditory spectrograms) which are utilised in further acoustic analysis of the results.

This component of the present study is similar, in some ways, to a study by Bladon and Lindblom (1981), however they examined the correlation between a different (but overlapping) set of spectral distance measures with measures of perceived vowel quality differences whilst this study examines the correlation between spectral distance and the vowel and consonant intelligibility.

2.7 Summary

This chapter commenced (section 2.1) with an examination of auditory physiology and theories of hearing. An examination of the relative merits of the place and rate (or time) models of hearing in the context of the various models of auditory nerve representations of sound is necessary as it raises certain issues that tend to complicate the modelling of the auditory processing and representation of speech. A pure place model of hearing implies that information is extracted from the mean firing rate of fibres of the same characteristic frequency whilst a pure rate model of hearing assumes that information in the auditory nerve is extracted by some process analogous to an inter-spike interval histogram. Hybrid rate-place models of hearing imply that both auditory nerve representations are utilised, although this is only possible below 3-5 kHz where phase locking still operates. The consequence of these issues to studies of the auditory representation of speech is that there exists the possibility of multiple representations of speech at the level of the auditory nerve and thus the further possibility that phonetic processing either needs to deal with these multiple representations or indeed is dependent upon them for the effective processing of speech sounds. Modelling of the auditory representation of speech is further complicated by the possibility of peripheral attentional processes which would probably result in non-linearities which would be difficult to model.

Section 2.2 examined psychoacoustic studies of the perception of sound, including speech. The results of such studies indicate a number of psychoacoustic scales based upon linear and non-linear transformations in each of the dimensions of acoustic space. The aim of the present study is to determine which psychoacoustic scales are related to phonetically-relevant representations of speech. It is a premise of the present study that the central psychoacoustic and phonetic processing of speech are largely separate processes which, although probably based on overlapping auditory representations, have very different goals. Psychoacoustic processing of speech largely involves the discrimination between speech signals whilst the phonetic processing of speech is a classificatory process.

Psychoacoustic studies of frequency perception were examined in section 2.2.1. Zwicker's (eg. Zwicker & Fastl, 1990) model that equates frequency selectivity (Bark), frequency discrimination (frequency-jnds) and pitch (mels) to uniform divisions of cochlear place was discussed in some detail. The model assumes that all three processes are based on a pure place model and that they represent different degrees of fineness of frequency representation. Close examination of the data led to the conclusion that the model is an over-simplification of the actual relationship between these three types of frequency perception. This conclusion is reinforced by the results of more recent work on auditory filter bandwidth that has demonstrated that the critical bandwidths are too broad and particularly so at low frequencies and that the frequency selectivity of the auditory system is more appropriately represented by the ERB scale (see section The present study assumes that there are three independent psychoacoustic scales of frequency selectivity (Bark or ERB), frequency discrimination (frequency-jnd) and pitch (mels), but only examines frequency representations based on a number of different multiples of the Hertz scale (100, 200, 400 and 800 Hz) and the Bark scale (0.75, 1, 1.5, 2 and 3 Bark) with the 0.75 Bark representation being a very close approximation to the ERB scale above 500-600 Hz (but somewhat broader at lower frequencies). Both the pitch and frequency-jnd scales (as well as formant frequency jnds) are much finer than the Bark and ERB scales and are assumed to be much finer than the representations important to the phonetic processing of speech. This hypothesis is also testable in the present study in that if maximum intelligibility can be demonstrated for one of the much broader representations then it must be assumed that the finer mel and frequency-jnd representations are redundantly detailed.

Temporal psychoacoustics is examined in section 2.2.2. The process of temporal integration, which is effective over durations of up to 200 ms is related to the processes of both gap detection and non-simultaneous masking. The most recent studies show that across most frequencies of phonetic interest the auditory system has a time resolution, as measured by gap detection thresholds, that is uniformly 6-8 ms. The differences between forward and backward masking show that the temporal response of the auditory system is asymmetric with faster responses to stimulus onsets (20 dB in 10 ms) than to offsets (20 dB in 20 ms). Experiments that attempt to determine the temporal separation required before signal components are perceived as separated rather than simultaneous suggest time resolution values of 20-30 ms. Taken together, all of these experiments suggest that the phonetically significant time resolution will lie somewhere between 10 and 30 ms and will be close to uniform across all frequencies. Auditory nerve response and adaptational non-linearities may, however, complicate the process of determining a model of the temporal representation(s) that are phonetically meaningful. The present study examines a number of channel vocoder time resolutions which are uniform across all frequencies and that are defined in terms of both a model of gap detection and also in terms of the filter step response.

Human auditory perception of intensity (section 2.2.3) is also complex, being dependant upon stimulus frequency and duration. There are three main scales that might be considered in the parametric intensity scaling of speech. Those scales are the dB scale (Fechner's law), a scale derived by integrating intensity-jnds (intensity-jnd-rate) and the loudness or sone scale. The parametric scaling of speech is achieved in this study by quantising a speech signal according to varying degrees of separation of one of the scales. The 1 Bark channel vocoder is ideally suited for this purpose with the smoothed (LP filtered) output of each 1 Bark channel filter being quantised according to one of the intensity scales. The smoothed output of each channel filter is effectively the time varying intensity of each of the Bark filters and so the quantisation is effectively determined across a series of auditory filters (nb. sones need to be calculated from the integrated intensity of an auditory filter). In the present study, the quantisation levels were 1, 2, 4 ,8, 16, 32 dB; 0.5, 1, 2, 4, 8, 16 jnds; and 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6 sones (per Bark). Each of the quantisation levels were repeated at 4 presentation levels, 40, 50, 70, 90 dB SPL. The jnd and sone quantised tokens had to be recalculated for each presentation level as jnd or sone step sizes are intensity dependant. It was hypothesised that the appropriate scale for speech perception would be the one that had a constant intelligibility effect for a constant quantisation step size across the presentation levels. The sone values were calculated from the formula related to the Stevens (1972) Mark VII curve. The jnd values were taken from Reisz (1928). The deviations from the more recent jnd determinations of Jesteadt et al (1977) and Florentine et al (1987) are only really significant below 20 dB SL over the vocoder frequency range and are in any case below all of the signal peaks even for the lowest presentation level, except for the two lowest frequency channels.

The issue of the perception of vowels using formant or whole spectrum representations has also been raised in this chapter. This question is not directly addressed in this study. What is addressed is the related issue of the adequacy of a formant representation of speech in speech synthesis. This issue is examined by comparing the performance of the JSRU formant vocoder against the Hz- and Bark-scaled channel vocoders. It is expected that the formant vocoder will perform well with respect to vowel intelligibility but not with respect to consonant intelligibility. It is hypothesised that a formant representation will prove to be inadequate for consonants but not for vowels upon which the formant model is based.

Auditory perception has been shown to be sensitive to phase changes and further it has been shown that the "edges" or onsets and offsets of the speech signal are strongly encoded in the speech phase spectrum. Auditory phase psychoacoustics will not be examined directly. Rather, this study will examine the phase representation of the channel vocoder. It will be pointed out in section 3.1 that one of the major design assumptions of the channel vocoder is that the phase of the speech spectrum is not necessary and so is discarded. This produces a synthesised waveform that is the result of the addition of a number of channels in cosine phase. This can be corrected in various ways, including the extreme approach of maintaining the natural phase spectrum and reinserting it later. A number of approaches to modifying channel vocoder phase representations will be examined together with their effect on speech intelligibility.

In chapter 5 the relationship between spectral distance measures and a measure of perceptual distance is examined. The spectral difference measures are based on all of the perceptually tested frequency and amplitude scales, plus two additional scales, the logsone and the Pascal scales. A further distance measure based on spectral differentiation (or spectral velocity) is also examined. Finally a distance measure which attempts to model the temporal adaptation of auditory nerve fibres will also be examined. All of these distance measures will be assessed by the extent of their correlation with a perceptual distance measure which is based on the percentage error rate relative to natural intelligibility.

All of the experiments are assessed in terms of intelligibility rather than just perceived quality changes. The intelligibility tests are performed on minimally contrastive nonsense /h_d/ and CV syllables. The aim of this whole study is to examine the phonetic effect of parametric scaling and it seems reasonable to use intelligibility as a measure of that effect.

"...while many experiments in auditory perception, and sensory psychophysics have commonly focused on experimental tasks involving discrimination of both spectral and temporal properties of auditory signals, such tasks are often inappropriate for the study of more complex signals including speech. Indeed, in the case of speech perception and probably the perception of other complex auditory patterns, the relevant task for the observer is more nearly one of absolute identification rather than differential discrimination. Listeners almost always try to identify, on an absolute basis, a particular stretch of speech or try to assign some label or sequence of labels to a complex auditory pattern. Rarely, if ever, are listeners required to make fine discriminations that approach the limits of their sensory capacities." (Pisoni & Luce, 1987)


1. This is not a true amplification in that there is no overall increase in acoustic power ie. there is no mechanism for inserting extra energy.

2. The scala vestibuli and the scala tympani are filled with perilymph, a fluid resembling extracellular fluid. The scala media or cochlear duct is filled with potassium ion (K+) rich endolymph, a fluid more similar to intracellular fluid. (Pickles, 1988)

3. "Rippled noise, whose long-term spectrum varies sinusoidally on a linear frequency scale, is produced by adding a white noise to a copy of itself which has been delayed by T sec." (Patterson & Moore, 1986, p 147) Depending upon the phase of the copy relative to the original noise, this results in either peaks or dips in the spectrum at multiples of 1/T Hz.


Click here to view the bibliography for this chapter (and the rest of the thesis).