Department of Linguistics
Channel Vocoders: Design and Parametric Scaling
Original Article: Mannell, R.H., (1994), The perceptual and auditory implications of parametric scaling in synthetic speech, Unpublished Ph.D. dissertation, Macquarie University (slightly modified parts of chapter 3)
The first channel vocoder was invented by Dudley (1939) and the basic principles established then remain valid today. During the intervening years, a great variety of speech coding devices have developed from those beginnings, however, the channel vocoder remains a viable and still commonly used system (Schroeder, 1966; Gold & Rader, 1967; Gold et al, 1981). Originally, vocoders were designed to permit efficient transmission of speech over telephone lines. Today however speech coding devices are used in a variety of "...applications in the transmission, storage and encryption of speech signals." (Schroeder, 1966) Modern vocoders are also used in the improvement of some types of degraded speech (eg. diver's "helium speech") and as aids to the handicapped (eg. tactile vocoders which permit the deaf to feel sounds.)
Apart from channel vocoders, two other types of vocoders are particularly familiar to speech researchers. The first of these is the Formant vocoder in which the frequency and amplitude of the first three or four formants are extracted and transmitted. The resynthesis can be done either serially or in parallel, and different classes of speech sounds respond best to different configurations. (see eg. Schroeder, 1966; Holmes, 1982). The second of these common vocoder types is the LPC (linear prediction coefficient) vocoder, which predicts the position of the N major poles in the band being analysed. This all-pole model is adequate for non-nasal voiced sounds, but "...for nasals and fricative sounds, the detailed acoustic theory calls for both poles and zeros in the vocal tract transfer function." (Rabiner and Schafer, 1978, p398) They noted further, however, that if the number of predictor coefficients is high enough, "...the all-pole model provides a good representation for almost all the sounds of speech." (ibid) ( see also, eg. Atal et al, 1970,1971)
The basic notion behind all vocoders is that there is a great deal of redundancy in the natural speech signal and that transmission efficiency (and thus transmission rate) can be increased by ignoring redundancies. Clearly, the different types of vocoders make different a priori assumptions about the nature of speech and thus about what to code and what not to code. Nearly all vocoder systems assume that phase information is not required since the ear is considered by most workers to be largely insensitive to phase. (but see Flanagan et al, 1966; Oppenheim et al, 1979,1981; Mathes et al, 1947; Schroeder, 1959). LPC vocoders also make assumptions about the number of poles required to model the speech adequately, whilst formant vocoders make further assumptions about the nature, behaviour and relative importance of the various poles. Apart from the phase assumption, the only other assumption that channel vocoders make about speech is that the smoothed envelope of short-term speech varies slowly and so this short-term power spectrum can be sampled at a rate proportional to the assumed rate of variation in the spectral envelope. Because they make fewer assumptions about the information bearing parameters of the speech signal, channel vocoders are generally capable of less signal compression than the LPC and formant vocoders. On the other hand, they are also less likely to impose artificial, theoretically derived structural constraints on the output speech. Since one of the purposes of this study is to examine the interactions between time and frequency resolution and speech intelligibility it is desirable to limit the possibility of such interfering constraints. At the same time, the present project has no interest in the maximisation of speech compression.
To summarise, the reasons for selecting the channel vocoder for the present study are:-
- "It avoids any major inbuilt apriori assumptions about the primacy or otherwise of particular spectral features such as energy peaks, as bearers of phonological information. Rather all forms of the channel vocoder provide a comprehensive encoding of the input signal spectral information which has an accuracy predominantly determined by its filterbank and frame rate transmission characteristics. ...
- It allows a simple and very robust means of extracting a parametric description of the input speech spectral energy distribution.
- It does not discriminate against idiosyncratic spectral properties of any input speech signal (because of its comprehensive spectral encoding noted in (a) above).
- The frequency and time resolution characteristics of this class of vocoder are explicit in their actual structure."
1 Channel Vocoder Design
Basically, the channel vocoder separates a combination of the vocal tract filter function and source slope (or spectral envelope) from the residual source or excitation function (spectral fine structure) in the frequency domain, and voiced from unvoiced segments in the time domain. The general design of a channel vocoder is well known (see Dudley, 1939; Schroeder, 1966; Gold & Rader, 1967; Gold et al, 1981) and readers requiring an overview of channel vocoder design are referred to this literature. The following sub-sections deal with the details of the design of the channel band pass filters, the low pass demodulation filter and the vocal source and pitch algorithms utilised in the channel vocoder that was used in this study (see figure 1). For the present project a digital software vocoder was implemened because there was no need for real-time output and so hardware implementations designed to maximise speed were not required. Further, software simulation allows great design flexibility and ease of modification and the availability of digital filter design software made the production of a large variety of filters relatively easy.
Figure 1: The design of the channel vocoder used in the present study
1.1 Digital filters
1.1.1 Channel Filters
Opinion as to the ideal filter type for channel vocoders has varied greatly and there have been many types of filters used in actual designs over the years. Gold et al (1981, p13) mention "Butterworth, Bessel, Lerner, and Chebyshev, as well as finite impulse response filters of the optimal equiripple or the frequency sampling variety." Digital filters can be divided into two classes, finite impulse response (FIR) or infinite impulse response (IIR) filters, each with their advantages and disadvantages. FIR filters "...have no non-zero poles, only zeros. Also, FIR systems can have exactly linear phase." (Rabiner and Schafer, 1978, pp20-21). IIR filters on the other hand have "...poles as well as zeros"(ibid). Rabiner and Schafer (ibid) explain that the possibility of exactly linear phase in FIR filters greatly facilitates precise time alignment, as well as simplifying the problem of approximating filter design specifications, since "...it is only necessary to be concerned with approximating a desired magnitude response..."(ibid, p20) because "...the criterion of linear phase for the composite filter bank response is trivially met if the individual filters have identical linear phase characteristics" (ibid, p295) On the other hand, IIR filter specifications are often easier to implement, especially when "...sharp cutoff frequency selective filters..." (ibid, p22) are required, because "...the IIR filter is often orders of magnitude more efficient in realizing sharp cutoff filters than FIR filters." (ibid, p23). The design of IIR filters are generally easily implemented "...transformations of classical analog design procedures."(Rabiner and Schafer, 1978, but also see Rader and Gold, 1967) These procedures produce digital versions of Butterworth (maximally flat amplitude), Bessel (maximally flat group delay), Chebyshev (equiripple in either passband or stopband) and Elliptic (equiripple in both passband and stopband) filter designs.
Some well known methods for the design of FIR filters have been listed by Rabiner and Schafer (ibid, p20), including :-
- Window design (see Rabiner & Gold, 1975)
- Frequency sampling design (see Rabiner & Gold, 1975; Rader & Gold, 1967)
Some authors (eg. Golden, 1968; Holmes, 1980) have recommended that although the filters in the analysis filter bank should all be kept in the same phase, adjacent filters in the synthesis filter bank should be in opposite phase. By this, they are referring to the multiplication of every second filter's output by -1 before summation. This is necessitated by the use of IIR filters with their lack of exact linear phase, and the possibility of the phase relationships between adjacent filters causing subtraction rather than addition of signals at frequencies between the adjacent channel centres. Gold et al (1981) noted that FIR filters are to be preferred "...because their phase can be kept linear at the transitions..."(p16) although they have the disadvantage of requiring a higher order than an IIR filter with a similar transition. Zahorian and Gordy (1983) compared listener preference of speech synthesised by FIR and IIR filters and found that FIR filters were "distinctly preferred" over IIR filters.
For the present study, FIR filters have been designed using a software package that utilises the window design method. The bandpass filter banks have been produced by first producing a lowpass prototype which is then converted into the bandpass filters by applying a "frequency-band transformation" (Golden, 1968) for each bandpass filter. This amounts to modulation of the LP prototype with the centre frequency of the required BP filter. The analysis and synthesis bandpass channel filter banks were identical in these experiments except for the experiments on phase, in which the phase relations of the synthesis filters were varied. For the remainder of the conditions there was, however a slight difference in the synthesis bank (cf. the analysis bank). If the outputs of the synthesis banks are simply added together, this results in cosine phase addition with a resultant waveform that has an extremely large initial peak followed by only very small fluctuations to the end of each period. This was found to result in dynamic range problems with the initial peak sometimes being clipped. This could be avoided by reducing the intensity of the internal representation, but this would result in an increase in quantisation noise. This was considered to be undesirable, especially considering the potential sensitivity of the intensity perception experiments to such effects. The possibility of setting adjacent channels to opposite phase was considered, but it was found that for these FIR filters this resulted in non-linearities at filter transitions. Instead, it was found that delaying each channel by one sample relative to the next lower channel produced the best results and did not result in non-linearities (see section 2.4 for further details on phase methodology). It should be noted that this results in a high channel delay relative to the lowest channel of 4.8 ms in one condition 2.4 ms in a second condition and less than 2 ms in all other conditions. These orders of delay are not resolved by the auditory system and the high and low channels would be perceived as instantaneous.
Two sets of channel filter banks were designed for use in the present project. One set of filters was uniformly spaced on the Hz scale whilst the second set of filters was uniformly spaced on the Bark scale. In order to simulate varying frequency resolution each of these scales was divided uniformly by varying numbers of channel banspass filters. The actual spacing of these filters and their relationship to models of frequency resolution will be examined in section 2.1.
The channel band pass filters are derived by multiplying a prototype low pass filter with an appropriate -6 dB cutoff frequency (Fc) by the centre frequency of the desired band pass filter. The filters are designed to overlap at the -6dB point (half pressure), rather than the -3 dB point (half intensity) because during resynthesis the remodulated channels are added together in the time domain (ie. addition of time varying pressure waves) and so a -6 dB crossover results in a flat frequency response across the whole 5 kHz range. Each filter is attenuated to -60 dB by the centre of the adjacent filter. In this study, channel filter bandwidth is defined as the bandwidth at the -6 dB cross-over point with the adjacent filters (see figure 2) and this value is equivalent to 2Fc where Fc is the -6dB cutoff frequency of the prototype lowpass filter. The time resolution of each channel filter is derived using the sampling theorem (t = 1 / 2Fc, see section 2.2). All filters are symmetrical in either the Hz or the Bark scale. Figure 2 displays the design characteristics of three adjacent BP filters.
Figure 2: Channel filter design showing crossover at -6 dB, and -60 dB attenuation at the centre of the adjacent filter. Also, all sidelobes are < -60 dB. The frequency table can be either Hertz or Bark.
1.1.2 Demodulation Low Pass Filters
Demodulation LP filter design has generally been considered to be relatively straightforward. Rader (1963) demonstrated, using averaged spectra from different channels, that most speech information is found in the 0-25 Hz band of the channel output. Further, the lowest fundamental expected in male speech is about 50 Hz. This led most channel vocoder designers to specify demodulation LP filters with a cutoff at about 25 Hz and with good attenuation at 50 Hz.(see Rabiner and Gold, 1975; Gold and Rader, 1967) Gold and Rader (1967) conceded that the transient nature of certain speech sounds may require a LP filter cutoff wider than 25 Hz. An averaged spectral analysis of the type carried out by Rader (1963), however, clearly advantages the longer speech segments such as vowels and disadvantages the often very short transients of stops.
In the present study, for any individual vocoder configuration an identical LP filter is used on the output of all analysis channels ensuring that all frequency components have the same time resolution. In a practical speech transmission system, such a filter would act as a sampling filter and would determine the lowest sampling rate possible before aliasing errors occur.
With the exception of the vocoders utilised in the time resolution perception experiment (see section 2.2), all channel vocoder configurations had the same demodulation LP filter. The frequency response of that filter (TR101) is shown in figure 3 and its impulse (time) response is shown in figure 4. The time resolution details of this demodulation filter and of the demodulation filters used in the time resolution experiments are dealt with in section 2.2 below.
Figure 3: Frequency response of the demodulation LP filter used in all configurations except those designed to test time resolution.
Figure 4: Time (impulse) response of the demodulation LP filter used in all configurations except those designed to test time resolution.
A problem that arises whenever a long sequence is processed is that the FFT operations must be done in a series of segments (sequences of equal length). It is not simply a matter of stringing these transformed segments together, because an input string with a length M becomes a string of length M + 2L (L is the filter half-length) following the FFT. This is because when the centre sample of the filter is lined up with the first sample of the signal during convolution, half of the filter is adjacent signal and half is effectively adjacent zeros. At least some part of the filter will be adjacent zeros until the centre value of the filter has been slid on a half-length into the signal. The same applies to the other end of the transform. Therefore, the completed transform of that segment will only contain M - 2L samples which have not had zeros contributing to their value. Two algorithms cope with the problem of how to add together such segments (see Cappellini et al, 1978, Ch. 6). The select-save algorithm simply discards all but the M - 2L uncorrupted samples and slides the segment back L samples from the last end-point before doing the next transform. These saved segments are then strung together. The overlap-add algorithm simply adds the overlaps together after each transform. It does not need to slide back L samples before the next FFT. The overlap-add algorithm is used in this vocoder. In this study the overlap-add algorithm has been used.
1.3 Source and Pitch Detection
Although the design and testing of pitch algorithms is not a central concern of this study, it is "extremely important to have good F0 measurement and voicing decisions." (Holmes, 1980, p54). Gold and Tierney (1963) noted that although the short-time spectrum of a perfectly periodic narrow pulse train has as flat spectral envelope, a narrow pulse train with a non-constant period, as occurs in normal speech, has "pitch-induced spectral distortion". They showed that the effect of a 10% periodic perturbation of pitch from one pulse to the next can reduce the power of some harmonics (in what should be a flat spectrum) by as much as 90%. "Since such perturbations are not unusual, one is led to suspect that appreciable spectral distortion of the synthetic speech may result from nonconstant pitch effects" (ibid, p730).
There are a vast number of pitch algorithms (see Hess, 1983) which deal with the problem of pitch extraction. Perhaps one of the simplest involves the use of a LP filter which rejects higher harmonics, and then some sort of peak detector or threshold crossing detector (see Hess, 1983, pp162-166, or Schroeder, 1966, p729). Although this method has the advantage of simplicity, its implementation is often very complex since the human pitch ranges from "...below 50 Hz for low-pitched adult male speakers to above 500 Hz for children" (Schroeder, ibid). No single LP filter can select the fundamental and reject all higher harmonics over the entire range, and so automatic fundamental tracking and filter selection has been attempted.(ibid) For the present study, all speech tokens are spoken by one male speaker, and the pitch of all the tokens ranges from 80 Hz to 140 Hz.
The algorithm simply requires the speech to be passed through an LP filter with a cutoff of 140 Hz and reasonable attenuation at 160 Hz (to avoid passing any higher harmonics). This is then passed through Schmidt triggered threshold crossing logic to obtain the positive-moving zero crossing of each pitch cycle. Because of the slight distortion of the fundamental wave (probably the result of some of the second harmonic being passed), the resultant pitch contour is very rough, with up to a 10% periodic waver in it. This gives output speech with a very rough quality, and so a simple averaging smoothing algorithm has been added to give an acceptable and reasonably accurate smooth pitch contour.
The voiced/voiceless decision algorithm used in this vocoder outputs one of four decision values.
- 0 no speech
- 1 voiceless
- 2 mixed
- 3 voiced
Three input analysis signals are used in this algorithm.
- summed energy of all demodulated channels. (E:tot)
- summed energy of the demodulated channels above 3750 Hz. (E:high)
- demodulated pitch filter output. (E:low)
These signals are updated every 0.1 msecs.
First, a speech/no-speech decision is made by comparing the value E:tot to a heuristic value (it takes a higher level to turn the speech on, than it does to turn it off). If speech is detected, then a similar heuristic voicing/no-voicing decision is made. This decision is based on the level of E:low. If voicing is detected, then the ratio of E:high / E:tot is compared to a fixed level (determined for this speaker by experiment) and if it exceeds this level, then it is assumed that the excitation is mixed. Otherwise, it is assumed that the exitation is voiced only.
This algorithm has proven to be very reliable when used for a single subject on the limited set of tokens that are utilised in this study.
2 Parametric Scaling
Mannell (1994) conducted a study which simulated various effects of parametric distortion (frequency, time, intensity and phase) on the perception of speech. This study had two goals. One goal was to examine the extent to which a channel vocoder's parameters could be distorted before a degradation in speech intelligibility occurred. Another goal was to examine this degradation from an auditory perspective, by carrying out many of these distortions using auditory scales. The results of this study have been reported in Clark and Mannell (1988), Mannell et. al. (1985), Mannell and Clark (1991) and Mannell (1994).
The details of how the channel vocoder design was manipulated to produce these effects is included here to illustrate the flexibility of a channel vocoder in simulating auditory processing of speech.
One of the aims of this study was to examine the perceptual effects of varying frequency representations. This is achieved by manipulating the frequency characteristcs of the channel band pass filters of the channel vocoder (see section 1.1.1). This scaling resulted in a number of vocoder configurations of varying frequency resolution on two frequency scales, the Hertz and the Bark scales (click here for details).
The first set of filters uniformly divided the 0 - 5 kHz band into 6, 12 24 and 48 equal Hz-scale bandwidth bands. These uniform Hz-scaled filters were as follows:-
|Number of Filters||Bandwidth (Hz)||Time Resolution (ms)|
Table 1 Frequency and time specifications of the Hz-scaled channel filter banks.
For convenience the four sets of filters will henceforth be referred to as 100, 200, 400 and 800 Hz filters.
The second series of filters are modelled on auditory critical bandwidths (click her for details). The critical band filter banks are as follows:-
|Number of Filters||Bandwidth (Bark)||Max Time Resolution (ms)|
Table 2 Frequency and time specifications of the Bark-scaled channel filter banks
The base bandwidth in a one Bark filter bank is 100 Hz. This is the bandwidth of the lowest frequency bands of an auditory filter (see figure 3). The bandwidth of a 0.75, 1.5, 2.0 and 3.0 Bark baseband is therefore 75, 150, 200 and 300 Hz respectively. The maximum (ie. worst case) time resolution is therefore calculated using this bandwidth value as all higher frequency filters, having a wider bandwidth will have a lower (better) time resolution. It must be noted that the worst case time resolution of the 0.75 Bark filter bank is actually worse than that of the sampling (time resolution) filter utilised in all the critical band vocoder studies. This means that the time resolution of this system (for the lower frequency channels) is defined by the channel filter and not the LP sampling filter. The error should be minimal, however, as a difference in time resolution of 3.3 msecs will be shown in the intelligibility study to be insignificant. These filters are also designed to cross the adjacent filters at -6 dB and the next-to-adjacent filters at -60 dB giving a flat summation at 0 dB. Because the adjacent filters are not of the same bandwidth (in Hz), this means that these filters are not symmetrical on the Hz scale although they are perfectly symmetrical on the Bark scale. Each filter is produced from a separate LP filter which is multiplied by the appropriate center frequency.
The following approximation (Ostry, pers. comm.) of the critical band curve was used to produce these filters and is described by the formula
|where||BWµ||=||bandwidth at µ|
|x0||=||2.4491723 (log10 281.3 Hz|
|B||=||2(log10 100 Hz|
Figure 5: Comparison of a 1 Bark BP filter with an auditory filter shape (3 parameter rounded exponential: ROEX(p,t,w)) as defined by Pattison et al (1982).
As can be seen from figure 5, the bandpass filters utilised in this study are much sharper than Patterson's et al (1982) model of auditory filter shape as defined by a 3 parameter rounded exponential function (ROEX(p,t,w)). In that model, the filter response flattens out at about -40 dB. Also, the bandwidth of the auditory filters are narrower than 1 Bark, as defined by determination of auditory filter equivalent rectangular bandwidths (ERBs) (click here for details), however, for this pair of filters the bandwidths as defined by the -6dB point are almost identical. It is anticipated, however, that although these differences in filter shape might have readily measurable psychoacoustic consequences, the phonetic consequences of the differences between these two sets of filter shapes will be small.
The effect of changes in the time resolution of the vocoder simulation is a major interest of this study. The time resolution is set by passing each channel output through a low pass filter (the demodulation filter). The definition of the effective time resolution enforced by a low pass filter is to some extent arbitrary and different definitions may be appropriate in different applications. Three common definitions, based on different but related aspects of the filter characteristics might be considered:
(i) Sampling theorem : the maximum data update interval which is able to fully specify a signal limited to a -3 dB LP cutoff frequency Fc-3 is:-
t = 1 / 2Fc-3
(ii) Step response : the rise time between two arbitrary points on the filter step response (typically 10% to 90% of the final value). For an ideal lowpass filter with -3 dB LP cutoff frequency Fc-3 this is approximately:-
t = 1 / 2.2Fc-3
(iii) Resolvability measure : the time between two input impulses which result in just distinguishable peaks in the filter output. For an ideal lowpass filter with -3 dB LP cutoff frequency Fc-3 this time is approximately:-
t = 1 / 1.5Fc-3
For channel vocoders, the sampling theorem definition would probably normally be the most appropriate as it specifies the lowest data rate which can be used to transmit the channel amplitude data without loss of information. In any case, the LP filters utilised in this study are not ideal lowpass filters but are digital filters designed to have a flat pass band (ie. pass band ripple < 0.1 dB) and also frequency domain main side lobes below -60 dB, and which therefore vary in terms of the nuber of poles that would be required to realise them. Any measure of time resolution based on the frequency response would (and does) result in values inconsistent with the actual relative impulse reponse times of the filters. For the purposes of the present study a combination of all three approaches has been taken. Firstly, a set of filters was defined as a function of their bandwidth at -60 dB (Fc-60), rather than at -3dB (Fc-3). A notional temporal response time t-60 was then given to each based on the formula:-
t-60 = 1/ 2Fc-60
Figure 6: Two auditory temporal responses as determined by forward and backward masking thresholds (Fastl, 1976) separated by a typical gap detection threshold of 8 ms.
This produced a set of filters with t-60 = 10, 20, 40, 60 and 80 ms. These values, however, don't necessarily relate closely to any auditorily modelled notion of time resolution. What seems to be most relevant is some sort of resolvability measure, or the time between two input impulses or two bands of noise that result in just distinguishable separate events after being passed through the filter. Such a definition of resolvability is analogous to a model of auditory gap detection (click here for details). One approach to determining such a resolvability measure for each filter would involve the filtering of gaps of varying duration in noise bands and examining the extent to which such gaps are resolved by normally hearing subjects. The gap threshold would be the pre-filtered gap duration just perceived after passing the signal through both the LP filter and the human auditory system. Filters with the same or better resolution as the auditory system should receive gap thresholds equivalent to that of the auditory system.
An alternative approach to defining the temporal resolution of the LP filters, and one ot the approaches taken for this study, is based on a model of gap detection and its relationship to forward and backward masking (click here for details). Fastl (1976) measured forward and backward masking functions and suggested that both are simultaneously involved in the perception of tones situated in gaps in critical band noise with the preceding noise band forward masking and the following noise band backward masking the tone. Fastl's (ibid) model (which didn't add together the combined effects forward and backward masking) predicted tone thresholds that were systematically lower than measured thresholds for gaps of 10 ms and he suggested that this would be due to interactions between forward and backward masking. It is equally likely that, for small gap durations, forward and backward masking would interact to mask the gap to a greater extent than either could alone. Figure 6 shows the temporal (impulse) response of two idealised auditory filters as defined by Fastl's forward and backward masking results separated by a typical gap detection threshold of 8 ms. It can be seen that the two impulse responses overlap at -4 dB. If the two responses are added together, this results in a dip of 1 dB. Zwicker's (1970) model suggests that if a pair of auditorily processed signals differ by 1 dB at any point in the spectrum then the two signals are discriminable. If that model can be extended to this type of temporal sequence then it may provide the basis for a convenient model of auditory gap detection.
For the present study it is assumed that the -4 dB width of the impulse response of a LP filter is a reasonable prediction of that filter system's time resolution and is analogous to auditory filter gap detection thresholds. The -4 dB width is exactly the peak separation between two such identical impulse responses that would result in response cross-over at -4 dB. This hypothesis will be tested perceptually, in the manner outlined above, in an extension of the present study. For the present study, however, each of the LP filters used in the time resolution experiments was characterised by both the notional temporal response, t-60, as well as by this model of gap detection, tGap. A third measure of time resolution was also used which measures the 10% to 89% (-20 dB to -1 dB) temporal step response, tStep. Figure 7 shows the impulse responses of the five LP filters utilised in this study.
Figure 7: The dB-scaled impulse response of the five LP demodulation filters utilised in the time resolution experiments. The filters are, working outwards from the centre, t-60 = 10, 20, 40, 60, 80 ms. The two innermost filters have appreciable side lobes.
|Fc-60 (Hz)||t-60 (ms)||tGap (ms)||tStep (ms)|
Table 3 The -60 dB LP cutoff frequency, the derived notional time resolution (t-60) and two impulse-response-derived time resolution measures, tGap (gap detection model) and tStep (10 - 89% step response).
Close examination of table 3 shows that the tGap and tStep are in close proportional relationship to each other, whilst there is no such close relationship between t-60 and the other two measures of time resolution.
The time resolution experiments were all carried out on a channel vocoder configuration that consisted of 24 uniform 200 Hz bandwidth analysis and synthesis BP channel filters.
The experiments on the perception of intensity scaled speech utilised speech quantised according to three intensity scales, the deciBel scale and two auditory intensity scales. The channel vocoder illustrated in figure 1 was modified to incorporate a quantisation module after the analysis BP and LP filters. The intensity quantisation was carried out on the smoothed (demodulated) output of each of the 1 Bark analysis filter bank. This output signal is a time function of the changing energy level in each of the 18 x 1 Bark filters. For each quantisation scale, a number of degrees of quantisation coarseness was utilised. There were four stimulus presentation levels (40, 50, 70 and 90 dB SPL ref: 20Pa).
For the dB scale, the signal was quantised in steps of 0.5, 1, 2, 4, 8 and 16 dB. Because of the linear relationship between dB quantisation and presentation level in dB SPL the quantisation needed only be performed once for every degree of quantisation coarseness and the resulting signals then presented at each of the experimental presentation levels (eg. 1 dB quantisation calculated for a 70 dB presentation level is still a 1dB quantisation when the same signal is presented at 40 dB).
The first auditory quantisation scale utilised was based on Reisz's (1928) determination of intensity jnds (I) (click here for details). Reisz provided values for 7 frequencies. Each of these curves (expressed in Pascals) was linearly extrapolated to obtain threshold values (very close in dB to the 10 dB SL values) and higher values (above 80 or 90 dB SPL). This extrapolation appeared to be a reasonable approximation as the Pascal-scaled curves were linear for high intensities and I seemed to have reached asymptote at the lowest intensities for all seven frequencies. Each of these curves was then integrated to obtain a jnd-rate curve representing numbers of jnds above threshold. Each 1 jnd step was then expressed as dB above threshold for that frequency and these values were interpolated to obtain jnds for other frequencies (eg. the 1 Bark filter bank centre frequencies). This provided interpolated jnd step values in terms of dB above threshold which could easily be converted to dB SPL by adding the values to the threshold for each frequency expressed in dB. The intensity jnd-rate curve for 1000 Hz is shown in figure 8 and a complete set of intensity-jnd-rate contours is shown in figure 9.
Figure 8: Comparison of sone and intensity-jnd-rate curves for 1000 Hz.
Figure 9: Intensity-jnd-rate contours in 1 jnd steps. Derived from Reisz (1928).
The speech tokens were quantised according to 6 degrees of coarseness on the intensity-jnd-rate scale, being 1, 2, 4, 8, 16 and 32 jnd steps. The jnd values for each filter were derived from the curves in figure 9 as appropriate to the centre frequencies of each 1 Bark channel filter. For each of the four presentation levels (40, 50, 70, 90 dB SPL) the jnd quantisation needed to be recalculated as there is a non-linear relationship between intensity-jnds and presentation level in dB SPL. For example, a 1 jnd quantisation step calculated for a presentation level of 70 dB would not remain a 1 jnd quantisation step if that speech token were to be presented at 50 or 90 dB SPL. This meant that there should be 24 sets of jnd-quantised tokens (6 quantisation step sizes X 4 presentation levels). There were, however, actually 21 sets of jnd-quantised tokens as it was found that the coarsest quantisations resulted in 1 bit quantisations at the two lowest presentation levels. The actual sets of tokens produced are summarised in table 4.
|Quantisation step size (Intensity jnds)|
Table 4 Intensity-jnd quantised token sets as a function of quantisation step size and presentation level.
The second auditory quantisation scale utilised was the loudness or sone scale. This scale was derived from the loudness level or phon scale curves shown (click here for details). The sone scale can be derived from the phon scale using formula VII (click here for details). In light of the most recent loudness model of stevens (mark VII, Stevens, 1972), this formula was considered sufficiently accurate to model loudness down to 0.1 sones or 10 phons. Figure 8 shows the intensity (dB SPL) versus sone curve at 1000 Hz, contrasted with the intensity-jnd-rate curve at the same frequency. This procedure produced the set of sone contours shown in figure 10.
Figure 10: Sone contours in 1 sone steps.
Since the quantisation was carried out on the smoothed (demodulated) time-varying intensity output of 18 Bark filters the calculation of sones was a relatively simple procedure. The vocoding methodology produced an intensity integration over critical bandwidths during the filtering process and this is a prerequisite to sone calculation. The speech tokens were quantised according to 9 degrees of coarseness on the sone scale, being 0.2, 0.4, 0.8, 1.6. 3.2, 6.4, 12.8, 25.6 and 51.2 sones. The sone values were determined from the curves shown in figure 10 as appropriate to the centre frequencies of each 1 Bark channel. As with the intensity-jnd-rate scale, there is a non-linear relationship between sones and presentation level in dB SPL. As a consequence, for each of the four presentation levels the sone quantisation needed to be recalculated. This would have resulted in 36 sets of tokens, but for the same reasons outlined for the jnd quantisation some of the coarser quantisation steps were missing for the lower presentation levels and there were actually 25 set of tokens as shown in table 5.
|Quantisation step size (sones)|
Table 5 Sone quantised token sets as a function of quantisation step size and presentation level.
This experiment compared the intelligibility of natural speech to vocoded speech with five different channel phase relationships and carried out utilising the 1 Bark configuration of the channel vocoder described above. The only difference between the five vocoder configurations was the treatment of phase.
One vocoder ("PZERO") was a typical channel vocoder in that all of its filters had zero cosine phase. The waveform of /i:/ when passed through this vocoder configuration is shown in figure 11D.
A second set of data ("PNAT") was derived from the output of this same vocoder but its synthetic phase spectrum was replaced with the saved original natural phase spectrum. This was possible as the natural input speech and the synthetic output speech of PZERO were precisely time aligned. In figure 11, D represents the speech output by PZERO whilst B represents that same speech with natural phase restored (PNAT). With the exception of the extra high frequency components caused by the width of the higher frequency filters, the waveform produced by the condition PNAT is very similar to that in the original natural signal (A).
The third vocoder condition ("PZDELAY") was passed through identical filters to ZERO, however upon remodulation of the synthetic source with the channel amplitude information the impulses of the voiced source were delayed by one sample for each filter with increasing centre frequency so that the 18th filter was delayed by 18 samples or 1.8 msecs. This progressive delay is equivalent to an increasingly negative phase delay as frequency increases. This arrangement conforms approximately to Fant's requirement that "if the phase shift is not linearly related to frequency, there will be separate time delays for separate frequency intervals of the spectrum" (Fant, 1960, p235). Note that only the voiced components of the spectrum can be encoded for phase in this way. This is reasonable, however, since in "...random white noise ... phase is distributed at random throughout the spectrum" (ibid, p235). The waveform shape for the vowel /i/ is shown in figure 11C and appears closer to natural speech (A) and PNAT (B) than are any of the vocoder outputs shown in 3.11D, 3.11E or 3.11F.
The fourth vocoder condition utilised the same filter as before except that their phase was changed. The phase values were chosen on the basis of a statement by Fant (1968) to the effect that "...the phase shifts by the amount - per formant with increasing frequency" (ibid, p195). A "theoretical" phase spectrum was calculated using this relationship, for an "ideal" neutral vowel with formants at odd multiples of 500 Hz. This produced a linear relationship between phase and frequency and the determined phase angles varied from 0� for the lowest frequency Bark filter to 810� for the highest frequency Bark filter. These values would be the ideal choice for the filter phase values, however, such values are restricted to the range 0 to -360 and so it was necessary to wrap the ideal phase spectrum into this range. The waveform of /i/ for this vocoder ("P810") is shown in figure 11E.
The fifth vocoder was produced in a similar way to vocoder P810 except that the phase spectrum declined linearly from 0 at 0 Hz to -360 at 5 kHz. The waveform of /i/ for this vocoder ("P360") is shown in figure 11F.
It is clear, upon examining the waveforms in figure 11, that the PZERO, P810 and P360 vocoders result in waveforms which deviate markedly from the natural speech waveform, whilst the PNAT and PZDELAY more closely resemble the natural speech waveform.
The results of perceptual experiments based on these channel vocoder phase configurations are briefly reviewed by Mannell (1990).
Figure 11: Comparison of (A) the natural waveform of /i:/ with its time-aligned vocoded counterparts with (B) reinstated natural phase, (C) zero phase with sample delay, (D) zero phase, (E) wrapped 0 to -810o phase, (F) 0 to -360o phase.
- Atal B.S., and Schroeder M.R. (1970) "Adaptive predictive coding of speech signals", Bell Sys. Tech. J., 1973-1986.
- Atal B.S., and Hanauer S.L. (1971) "Speech Analysis and Synthesis by linear prediction of the speech wave", JASA 50, 637-655.
- Cappellini V., Constantinides A. & Emiliani P. (1978) Digital Filters and their Applications, Academic Press
- Clark, J.E. & Mannell R.H. (1988) "Some comparative characteristics of uniform and auditorily scaled channel synthesis", Proc. SST-88, 282-287.
- Dudley H.W. (1939) "Remaking speech", JASA 17, 1969-1977.
- Fant, G. (1960) Acoustic Theory of Speech Production, (Mouton: The Hague, second printing 1970)
- Fant, G. (1968) "Analysis and synthesis of speech processes", in Malmberg, B. (ed.) Manual of Phonetics (North Holland: Amsterdam)
- Fastl, H. (1976) "Temporal masking effects: II. Critical band noise masker", Acustica 36, 317-331
- Flanagan, J.L. & Golden, R.M. (1966) "Phase vocoder", Bell Sys. Tech. J., 1493-1509.
- Gold, B., Blankenship, P.E. & McAulay, (1981) "New applications of channel vocoders", IEEE Trans. ASSP-29, 13-23
- Gold B. & Tierney J. (1963) "Pitch-induced spectral distortion in channel vocoders", JASA 35, 730-731
- Gold & Rader (1967) "The channel vocoder", IEEE Trans. Audio & Electroacoustics, AU-15, 369-382
- Golden R.M. (1968) "Vocoder filter design: Practical considerations", JASA 43(4), 803-810.
- Hess W. (1983) Pitch Determination of Speech Signals, Springer-Verlag
- Holmes, J.N. (1980) "The JSRU channel vocoder", Proc. IEE, 127, 53-60
- Holmes J.N. (1982) "Formant synthesisers: Cascade or parallel?", JSRU Research Report No 1017, Dec 1982.
- Mannell, R.H., (1990), "The effects of phase information on the intelligibility of channel vocoded speech", Proc. Third Australian International Conference on Speech Science and Technology, Melbourne, November 1990 Click here to see this paper.
- Mannell, R.H., (1994), The perceptual and auditory implications of parametric scaling in synthetic speech, Unpublished PhD dissertation, Macquarie University
- Mannell, R.H., Clark J.E., and Ostry D., (1985) "Channel vocoder performance", Working Papers 1985, Speech, Hearing and Language research Centre, Macquarie University, pp 75-133
- Mannell, R.H. and Clark, J.E., (1990), "The Perceptual consequences of frequency and time domain parametric encoding in automatic analysis and resynthesis of speech", an unpublished paper presented at the International Conference on Tactile Aids, Hearing Aids and Cochlear Implants, National Acoustics Laboratories, Sydney, May 1990. Click here to see this paper.
- Mannell, R.H., & Clark, J.E. (1991), "A comparison of the intelligibility scores of consonants and vowels using channel and formant vocoded speech", in Proc. XII ICPhS
- Mathes R.C., and Miller R.L. (1947) "Phase effects in monaural perception", JASA 19, 780-797.
- Oppenheim A.V., and Lim J.S. (1981) "The importance of phase in signals", Proc IEEE 69, 529-541.
- Oppenheim, A.V., Lim, J.S., Kopec, G. & Pohlig, S.C. (1979) "Phase in speech and pictures", IEEE, ICASSP 1979, 632-637
- Patterson, R.D., Nimmo-Smith, I., Weber, D.L. & Milroy, R. (1982) "The deterioration od hearing with age: Frequency selectivity, the critical ratio, the audiogram, and speech threshold", JASA 72, 1788-1803
- Rabiner L.R.,and Gold B. (1975) Theory and Application of Digital Signal Processing, Prentice-Hall
- Rabiner L.R., and Schafer R.W. (1978) Digital Processing of Speech Signals, Prentice-Hall
- Rader C.M. (1963) "Spectra of vocoder-channel signals", JASA 35, p805(A)
- Rader C.M., and Gold B. (1967) "Digital filter designtechniques in the frequency domain", Proc. IEEE 55, 149-171
- Reisz, R.R. (1928) "Differential sensitivity of the ear for pure tones", Phys.Rev. 31, 867-875
- Schroeder M.R. (1959) "New results concerning monaural phase sensitivity", JASA 31, p1597
- Schroeder, M.R. (1966) "Vocoders: Analysis and Synthesis of speech", Proc. IEEE, 54, 720-734
- Stevens, S.S. (1972) "Perceived level of noise by Mark VII and decibels (E)", JASA 51, 575-601
- Zahorian S.A. & Gordy P.E. (1983) "Finite impulse response (FIR) filters for speech analysis and synthesis", IEEE ICASSP-83, 808-811
- Zwicker, E. (1970) "Masking and psychological excitation as consequences of the ear's frequency analysis", In R. Plomp & G.F. Smoorenburg (eds) Frequency Analysis and Periodicity Detection in Hearing, Sijthoff: Leiden (Cited in Moore & Glasberg, 1986)