Skip to Content

Department of Linguistics


Perception of Parametrically Rescaled Speech

This project attempts to examine auditory processing of speech by utilising speech processed by a channel vocoder that has been specially designed to modify the frequency, phase, intensity and temporal properties of speech signals. These modifications are made to the filter characteristics of the analysis (input) and resynthesis (output) filterbanks or to the low-pass filter that defines the temporal response of the system. Figure 1 illustrates the design of the channel vocoder used in these experiments.

Figure 1: Channel vocoder used to manipulate frequency, time,
intensity and phase properties of speech signals.

Test Design Issues

Test design has been a key issue in this project. In early experiments subjects were presented with each condition without any preceding training condition. This resulted in much poorer intelligibility results for the more distorted frequency or time conditions as the subject had no opportunity to become accustomed to either the task, to the tokens, or to the appropriate orthographic responses. Such design issues interacted with the actual degree of distortion to result in a steeper reduction in intelligibility than is the case in later experiments. These results are not invalid, but they are also not readily comparable with later experiments in which subjects were presented with an undistorted natural speech condition first and then with a series of conditions with increasing degrees of distortion. This approach results in much more conservative measurements of the effects of distortion on intelligibility. In all of the preliminary reports listed under "Relevant Papers" below, results for Hertz-scaled or Bark-scaled frequency distortion and results for time distortion (or time "smearing") are based on the original experimental method.

The only paper where the two methodologies were mixed is Mannell (2002) where the older bark-scaled results (for non-overlapping filters) are compared with new results for overlapping Bark filters obtained using the new, more conservative, methodology. As a consequence, the comparisons made in that conference paper and the resulting conclusions are not valid.

Because of this, all experiments carried out in the earlier parts of this project have been repeated using the new methodology. The trained and untrained conditions have both been retained (for the Bark-scaled conditions only), as both methodologies produce interesting results, but great care is taken to ensure that appropriate conditions are being compared during data analysis.

Another issue is whether subjects should be given a closed list to select from (ie. a list of acceptable CD and h_d orthographic responses) or whether they should be permitted to make any response and should not be given any prior assistance regarding acceptable responses. After some experimentation it was decided to provide a list of acceptable orthographic responses and to require them as responses. This approach overcomes a number of lexical access issues. For example, real words are more easily perceived than non-words. Some untrained responses were ambiguous (eg. some responses failed to unambiguously distinguish between some vowel pairs). Some non-words or rare words were heard as neighbouring (non-h_d) real words (eg. "hod" was heard as "hog" by several subjects). The provision of a list of acceptable responses greatly improves the intelligibility of non-words (eg. "hud"), rare words (eg. "hod") and complex orthographic forms (eg. "who'd").

Frequency Distortion

Frequency distortion experiments manipulate the number and bandwidth of the filters in the bandpass filterbanks (input and output filterbanks are always identical). In a typical frequency distortion experiment, first a frequency scale is selected (eg. Bark) and then the various degrees of distortion are determined. For example, in the Bark experiments, 1, 2 and 3 Bark bandwidth filterbanks are constructed. Then, each filterbank is used in turn to generate a set of vocoded speech tokens. As 1 bark filterbanks are similar to auditory filterbanks, we might expect 1 Bark filtered speech to have intelligibility scores approaching that of natural (undistorted) speech whilst we might expect 3 Bark filtered speech to result in significantly degraded intelligibility.

The following frequency distortion conditions have been tested:-

  1. Non-overlapping Bark (1, 2 and 3 Bark) - untrained condition - open set responses
  2. Non-overlapping Bark (1, 2 and 3 Bark) - untrained condition - closed set responses
  3. Non-overlapping Bark (1, 2 and 3 Bark) - trained (conservative) condition - open set responses
  4. Non-overlapping Bark (1, 2 and 3 Bark) - trained (conservative) condition - closed set responses
  5. Overlapping Bark (1, 2 and 3 Bark) - trained (conservative) condition - closed set responses
  6. Non-overlapping ERB (1, 2, 3, 4 and 5 ERB) - trained (conservative) condition - closed set responses
  7. Overlapping ERB (1, 2, 3, 4 and 5 ERB) - trained (conservative) condition - closed set responses
  8. Non-overlapping Hertz (100, 200, 400 and 800 Hz) - trained (conservative) condition - closed set responses

Time Distortion

Frequency distortion experiments manipulate the impulse response time of the low-pass filter that smoothes the output of each of the input bandpass filter channels. Five low-pass filters were designed with half impulse response times of 10, 20, 40, 60 and 80 ms. The time domain characteristics of these filters have also been used to create five series of white noise tokens with "filtered" gaps in them. These tokens, as well as standard white noise and filtered noise (and tone complex) gap tokens have been used to determine the gap detection thresholds of a number of subjects with good hearing. In this way it has been possible to correlate the responses of these filters with actual auditory gap detection thresholds so that these experiments can be more closely related to existing measures of auditory temporal behavior for people with hearing loss.

The time distortion experiments have been carried out with 1, 2 and 3 Bark bandpass filter channels, so that the interaction between the effects of time and frequency distortion can be determined.

The following time distortion conditions have been tested (all conditions use the trained closed set methodology):-

  1. Time resolution 10, 20, 40, 60, and 80 ms (half impulse response times) with 1 Bark bandpass filters
  2. Time resolution 10, 20, 40, 60, and 80 ms (half impulse response times) with 2 Bark bandpass filters
  3. Time resolution 10, 20, 40, 60, and 80 ms (half impulse response times) with 3 Bark bandpass filters

Phase Distortion

This experiment examines the intelligibility of vocoded speech with good frequency (1 Bark) and time (10 ms) characteristics. In channel vocoded speech the phase spectrum is normally set to zero ("zero phase" condition). This results in a large proportion of the energy of a single glottal cycle to be concentrated in the first few samples of that cycle and this often results in overloads (peak clipping). As a compromise solution a pseudo-phase approach was adopted whereby the output of each channel (as we go from low to high frequency) was delayed by one sample relative to the channel immediately below it. For systems with large numbers of channels this procedure was modified to limit this delay effect to a maximum of about 18 samples (ie. 1.8 ms). This is referred to below as the "delay phase" condition. In a third condition the phase spectrum of the original natural speech was extracted and then added (following careful temporal alignment) to the vocoded spectrum ("natural phase" condition). In another two conditions a synthetic phase spectrum was added that simply applied linearly decreasing (as frequency increased) phase information to the spectrum (0° to -360° for on condition and 0° to -810° for the other condition). To summarise:-

  1. zero phase
  2. delay phase
  3. natural phase (reinserted)
  4. 0° to -360° phase
  5. 0° to -810° phase

The first three conditions were quite similar to each other in intelligibility. The final two conditions showed significant degradation in intelligibility relative to the first three conditions. These results justified the continued use of the delay phase algorithm in all other parts of this project. An initial report of these results were reported in Mannell (1990).

Intensity Distortion

These experiments use intensity quantisation of the output of each of the analysis channel filters. The procedure is to first select an intensity scale (eg. dB). Then decide upon the presentation level for the current condition (eg. 70dB SPL - ref 20µPa). Next, choose the size of the quantisation steps for this scale (eg. 2 dB steps). Then set each output sample from each channel to the nearest quantisation step value (converted back to a 16 bit number). These quantisation step values may differ according to the scale, the step size on that scale, the presentation level and the centre frequency of each channel. The internal RMS intensity value for the calibration tone is known and so the actual presentation level for each data point from each channel is also known. As a consequence it is possible accurately set each channel filter output value to the nearest quantisation value for that condition.

The intensity/loudness scales that were chosen were intensity jnd's (ie. just noticeable differences), Sones (measures of perceived relative loudness), log2Sones (related linearly to the phon loudness level scale), and deciBels. Quantisation was carried out on tokens destined to be presented at 40, 50, 70 and 90 dB (spl - ref 20µPa). To summarise, the conditions were:-

  1. intensity jnd @ 40dB
  2. intensity jnd @ 50dB
  3. intensity jnd @ 70dB
  4. intensity jnd @ 90dB
  5. Sone @ 40dB
  6. Sone @ 50dB
  7. Sone @ 70dB
  8. Sone @ 90dB
  9. log2Sone @ 40dB
  10. log2Sone @ 50dB
  11. log2Sone @ 70dB
  12. log2Sone @ 90dB
  13. deciBel @ 40dB
  14. deciBel @ 50dB
  15. deciBel @ 70dB
  16. deciBel @ 90dB

It was predicted that the intensity/loudness scale that most closely matched the neural encoding of signal intensity would result in the most similar intelligibility curves for all presentation levels.

Preliminary results for these experiments are reported in Mannell (1991a, 1992, 1994a). It should be noted that the deciBel quantisation step sizes reported in these papers were mis-scaled by a factor of two so that, for example, a step size of 32 dB was misreported as a step size of 16 dB. This has been been verified by replicating the 70 dB condition (ie. by producing new quantised vocoded samples) and testing against a new set of subjects.

Spectral and Perceptual Distance

The experimental design of this project has resulted in a very large number of time-aligned but acoustically differentiated tokens which are associated with an equally large amount of perceptual (ie. intelligibility) data. This permits an analysis of the spectral distance between each vocoded token and its time-aligned equivalent natural token and a comparison of these spectral distances with the intelligibility of each vocoded token (relative to the intelligibility of the original natural token). This provides a useful tool for probing intelligibility data and particularly confusion matrix data and relating these data to the spectral distance data. As the vocoded and natural tokens are time aligned it is possible to examine changes in various spectral features (ie. acoustic features localised at particular positions on a spectrogram) and relate these changes to the intelligibility and confusion data. Such features can be identified by looking for regions on the spectrograms whose changes in spectral distance from natural speech best correlate with changes in intelligibility. Preliminary results for these experiments are reported in Mannell (1994b)

Auditorily Scaled Spectrograms

Related to this project is an attempt produce auditorily scaled spectrograms. Some of these spectrograms are used in the part of the project that deals with spectral and perceptual distance.

For a comparison of Hertz, Bark and ERB-scaled spectrograms, click here.

For a more general overview of auditory representations of speech, click here.

Project Status

Over the course of this project about 600 subjects have participated in the speech perception and related psychoacoustics experiments. Data collection for this project is now complete. All data entry is complete. Statistical analyses and detailed confusion analyses are in progress. Currently only preliminary reports have been produced (see below). Several papers are being prepared for submission to refereed journals.

Relevant Papers

Mannell R.H., (2002), "The Perception of speech processed with non-overlapping and overlapping filters in a bark-scaled channel vocoder", Proceedings of the Ninth Australian International Conference on Speech Science and Technology, Melbourne, December 2002.

Mannell R.H. (1994b), "The prediction of 'perceptual distance' from spectral distance measures based upon auditory and non-auditory models of intensity scaling", Proceedings of the Fifth Australian International Conference on Speech Science and Technology, Perth, Dec. 1994.

Mannell R.H. (1994a), The Perceptual and Auditory Implications of Parametric Scaling in Synthetic Speech, Unpublished Doctoral Dissertation, Macquarie University, Sydney, Australia (chapter 2, chapter 3, bibliography)

Mannell R.H. (1992), "The effects of presentation level on sone, amplitude-j.n.d. and deciBel quantisation of channel vocoded speech", Proceedings of the Fourth Australian International Conference on Speech Science and Technology, Brisbane, Nov. 1992.

Mannell R.H. & Clark J.E., (1991b), "A comparison of the intelligibility scores of consonants and vowels using channel and formant vocoded speech", Proceedings of the Twelfth International Congress of Phonetic Sciences, Aix-en-Provence, France, 19-24 August, 1991.

Mannell R.H., (1991a), "Sone-scaled and intensity-j.n.d.-scaled spectral quantisation of channel vocoded speech", Proceedings of the Twelfth International Congress of Phonetic Sciences, Aix-en-Provence, France, 19-24 August, 1991.

Mannell R.H., (1990), "The effects of Phase information on the intelligibility of channel vocoded speech", Proceedings of the Third Australian International Conference on Speech Science and Technology, Melbourne, Nov. 1990.

Mannell, R.H. and Clark, J.E., (1990), "The Perceptual consequences of frequency and time domain parametric encoding in automatic analysis and resynthesis of speech", a paper presented at the International Conference on Tactile Aids, Hearing Aids and Cochlear Implants, National Acoustics Laboratories, Sydney, May 1990.

Clark J.E., & Mannell R.H., (1989) "Frequency resolution effects on phonetic level perception of synthesised speech", Proceedings of ESCA Tutorial Day and Workshop on Speech Input/Output Assessment and Speech Databases, Noordwijkerhout, the Netherlands, 20-23 September, 1989, pp 1.8.1 - 1.8.4

Clark J.E., & Mannell R.H., (1988) "Some comparative characteristics of uniform auditorily scaled channel synthesis", Proceedings of the Second Australian International Conference on Speech Science and Technology, Sydney, Nov. 1988. pp 282-287

Mannell R.H., (1988) "Spectral distortion and spectral distance measures", Proceedings of the Second Australian International Conference on Speech Science and Technology, Sydney, Nov. 1988. pp 158-163

Clark J.E., Mannell R.H., & Ostry D., (1987) "Time and frequency resolution constraints on synthetic speech intelligibility", Proceedings of the Eleventh International Congress of Phonetic Sciences, Tallin, Estonia, Aug. 1987.

Mannell R.H., Ostry D., & Clark J.E., (1985) "Channel vocoder performance", Working Papers, 1985, Speech Hearing and Language Research Centre, Macquarie University.