Skip to Content

Department of Linguistics

MU-Talk: Channel Vocoder-based Diphone Concatenation

Robert Mannell

In the mid 1990's work commenced on the integration of channel vocoder based diphone concatenation into the SHLRC TTS system. The techniques adopted were based on prior research by Clark and Mannell which had produced a substantial amount of quantitative evidence on the relationship between human speech perception and synthesiser design. The most intelligible speech was produced by channel synthesis utilising a model of the human auditory periphery (Bark frequency scale). The speech produced by such a system was significantly more intelligible (for consonants) than was formant-based synthesis. It was anticipated that the quality of the speech produced by such a Bark scaled channel synthesis system would be superior to that of existing synthesisers based on formant or LPC methods.

Various studies (Holmes 1972, 1982, Klatt 1980) suggested that serial formant processing, whilst being an adequate model of vowels (Fant 1960), is intrinsically limited in its ability to model consonants resulting in a lower consonant intelligibility relative to natural speech than is the case for vowels (Clark 1983). Some of these studies (Holmes 1972, 1982, Klatt 1980) suggested the use of parallel formant filters for at least the consonants because in such systems it is possible to independently adjust the various formant filter gains and so approximate the complex spectral shapes of the consonants.

Prior to the commencement of this work on a channel diphone concatenation synthesiser, Clark and Mannell carried out an extensive series of perceptual tests (Clark, Mannell & Ostry 1987, Clark & Mannell 1988, 1989, Mannell & Clark 1990, 1991) designed to test the acoustic modelling criteria for speech synthesis systems with a view to better quality and intelligibility. These studies involved investigating numerous synthesiser configurations implemented as vocoders using several hundred listeners. We varied synthesiser data rates, frequency resolution (as defined by the number of BP filters in a channel vocoder), filter bank shapes (uniform vs Bark-scaled) and compared them all with natural speech and with formant vocoded speech produced by the highly regarded JSRU formant vocoder (Rye & Holmes 1982). The best uniform channel filter configuration had 48 channels (100 Hz bandwidth) whilst the best Bark scaled channel filter configuration had 18 channels (l Bark bandwidth). It was found that the best channel filter configurations (uniform Hz scaled and Bark scaled) and the formant vocoder performed very well for vowels with intelligibility scores closely approaching natural vowel intelligibility. The situation for consonants was, however quite different. Both uniform and Bark scaled channel synthesisers produced consonants which were significantly more intelligible than formant synthesiser consonants. Furthermore, the best uniform channel vocoder configuration was for some phonetic classes (ie. for nasals and liquids when heard mixed with noise) inferior to the best Bark-scaled vocoder. Not only was the performance of the best Bark-scaled vocoder tested superior to that of the best uniform vocoder tested but it consists of fewer channels. This mads it even more desirable in a synthesis system for reasons of increased computational, data storage and data transmission efficiency. The improved performance of auditorily modelled filter systems in synthesis parallels promising results for the use of auditory models in speech recognition (Ghitza 1986, Seneff 1988, Samouellian & Summerfield 1988, Hamada et al 1989). The advantage of using vocoder systems in examining the intrinsic behaviour of spectrum versus formant synthesis is that the effect of synthesis rules is minimised. There is, of course, a difficulty in basing such claims on the performance of one formant synthesiser (even one as well regarded as the JSRU system). There is always the possibility that formant tracking algorithms are not as accurate as they might be. Anyone who has attempted to devise a foolproof formant tracker will realise how difficult this is and how much logic is required to handle difficult cases. There have, however, been no reports of formant synthesisers that have been able to produce consonants that reasonably approach natural speech in intelligibility. The attraction of channel or spectrum synthesis is that it makes no a priori assumptions about the number or importance of spectral peaks in either vowels or consonants and so there is no need to develop complex algorithms designed to find the most important peaks. Further, such systems, can be shown to have the intrinsic ability to produce both vowels and consonants which approach natural intelligibility. Such intrinsic ability remains to be demonstrated for formant systems.

Word, syllable and diphone or demi-syllable concatenation synthesis systems are not new (Olive & Nakatani 1974, Olive 1977, Shadle & Atal 1978). Most such systems utilise formant synthesis methods (eg. ten Bosch et al 1989), LPC methods (eg. Rodet & Depalle 1985, Stella & Charpentier 1985) or waveform concatenation (Charpentier et al 1986, 1989). Unless LPC systems utilise a large number of coefficients they suffer from one of the drawbacks of formant systems. That is, they make a priori assumptions about the number of poles required to model the speech adequately. Further, LPC analysis is limited in its ability to model anti-resonances (but see Markel & Gray 1976, pp 271-275). LPC analysis also makes no assumptions about the perceptual importance of peaks in various parts of the spectrum. Two peaks which are prominent in the O - 4 kHz band will be treated identically to two equally prominent and equally separated peaks in the O - 1 kHz band. In other words, LPC systems are not normally modelled on the characteristics of the auditory system (although conceivably they could be weighted to simulate auditory models). The diphone concatenation system of Charpentier et al (1986, 1989) utilises pitch synchronous overlapping frames stored as 512 point FFTs which are added together using the overlap-add algorithm (also used in our channel vocoder). Charpentier's method is analogous to a uniform filterbank method consisting of 256 uniformly spaced BP filters and, in a similar fashion to the LPC methods, makes no attempt to utilise an auditory model. In consequence, the size of Charpentier's diphone library is approximately 7 Mbytes, much of which consists of auditorily redundant data.

A synthesis system based on a 1 Bark BP filter bank and a 10 msec update rate (see Clark, Mannell & Ostry 1987 for data on system time constraints) appears to offer the best intrinsic potential performance available. Clearly, the big problem in such a system is to devise data digitisation procedures and concatenation and prosody rules which would permit such a system to achieve its potential performance. Fortunately, a large proportion of the prosody rules developed for our existing text to speech systems was transportable to this new system. The main difficulty with the prosody would be the implementation of timing rules which would take into account the non-uniform compression and expansion which occurs in speech of different overall rates and in syllables which attract tonic stress and those which do not.

The original aim of this project was the investigation of auditory model based word concatenation. The possibility of developing this system by utilising the concatenation of non-uniform synthesis units (eg. ten Bosch et al 1989, Sagisaki 1988) was subsequently investigated. It was intended that the capture rate of the system be expanded by including a diphone or demi-syllable database plus diphone / demi-syllable concatenation rules. Several possible approaches were considered here. One approach was to attempt to model all possible diphones or demi- syllables. Another possible solution was to use the most common diphones or demi-syllables and solve the remaining problem sequences by utilising phoneme concatenation rules (see ten Bosch et al op.cit.). In the end only word and diphone concatenation was attempted. It was quickly realised that the word concatenation model was more complex than originally realised. This was due to the need to provide for additional morphemes in inflected or derivational forms of the words in the word database. It was complicated by the need to deal with word-to-word transitions and with connected speech processes (assimilation, deletion, reduction, insertion) across word boundaries. In the end the non-uniform concatenation approach was shelved in favour of a diphone concatenation system as this represented the simplest system to implement.

By about 1996 a nearly complete set of diphones had been collected. An initial test version of this system was connected to the TTS system and was found to produce variable speech quality. Much of this variability seems to be related to the effects of dynamic time-warping of the data which is needed to produce the phoneme durations determined by the TTS system. Tests showed that this accounted for about 2/3 of the loss of quality, and the variability in speech quality was related to the amount of time warping. The other 1/3 of the quality loss seems to be related to the concatenation algorithm, which simply abbutted adjacent diphones and then smoothed the transition between them.

In order for natural speech rhythm to be achieved, dynamic time warping is required, but for best results this should not involve the uniform stretching or compression of a phoneme. Some parts of most phonemes are highly expandible or compressible (eg. vowel and consonant targets, stop occlusions) whilst other parts are not (eg. transitions, stop bursts). To achiev non-uniform time warping within each phoneme it was considered necessary to further segment the data to indicate transitions, stop bursts etc. which require different time-warping rules to vowel targets, stop occlusions etc.

Once concatenation and compression or expansion has been achieved and the data restored to the system's 10 msec update rate the the remainder of the system behaves in an identical fashion to the resynthesis end of the original channel vocoder.

Work on the channel concatenation component of MU-Talk has not occurred for a few years now (for reasons of lack of time). The channel concatenation forms part of the MU-Talk system, but has been deactivated for now.

Structure of the Channel Diphone Module

The structure of the channel diphone module is effectively a channel vocoder broken into two halves.

The first half analyses input speech into 18 band-pass channel parameter streams, passes each of these through a low-pass filter to remove source information and then segments the channel parameter stream into diphones which are stored for later use. This procedure is summarised in figure 1.

Figure 1: Extraction of channel diphones

The re-synthesis component of the channel diphone module is part of the TTS system. It takes channel diphones as its input, dynamically time warps them to match each phoneme's intended duration, concatenates the diphones together (most likely using an overlap-add methodology), re-modulates each channel with a TTS-generated source function, passes this through the re-synthesis band-pass channel filterbank, and then adds the output of the 18 channels to produce a speech waveform. This procedure is summarised in figure 2.

Figure 2: Generation of speech using channel diphones

Plans for the Future

  1. Collection of the remainder of the diphone set.
  2. Determination of the optimal level of sub-phonemic labelling required to perform non-uniform time warping of each phoneme. A prerequisite for this is the determination of the time-warping methodology to be utilised. Sub-phonemic labelling is very time intensive so it is necessary to ensure that only the features required are labelled. Further, once phoneme labelling has been carried out (as is the case with the existing diphone set) it should be possible to establish some automatic procedures for sub-phonemic labelling (this would still require human confirmation, however).
  3. Examination of different concatenation algorithms. The currently favoured model is an overlap-add approach to channel diphone concatenation, but with initial careful alignment of certain key sub-phonemic features (eg. stop bursts). Sigmoidal and linear overlap-adding of diphones is being considered.
  4. Modelling of the various intensity characteristics of the speech signal, including relative channel intensity, phoneme intensity profiles (eg. peaks, dips in occlusions, sudden changes in bursts, etc.), intensity correlates of prosody, and intensity correlates of vocal affect.
  5. Once these goals have been achieved, then the system can be used to simulate the effects of hearing loss on the perception of speech.

References

Bosch, L.F.M. ten, Collier, R., Boves, L., (1989) "From diphones to allophones: from data to rules", Eurospeech 89, Vol 1, 129- 131, Paris, Sept. 1989.

Charpentier, F.J., & Stella, M.G., (1986) "Diphone synthesis using an overlap-add technique for speech waveforms concatenation, ICASSP 86, Tokyo.

Charpentier, F., & Moulines, E., (1989) "Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones", Eurospeech 89, Vol 1, 129-131, Paris, Sept. 1989.

Clark, J.E. (1983) Intelligibility comparisons for two synthetic and one natural speech source, Journal of Phonetics 11, 80-93.

Clark, J.E., Mannell, R.H., & Ostry, D. (1987) Time & frequency constraints on synthetic speech intelligibility Proceedings of 11th International Congress of Phonetic Sciences, 28.2.1-28.2.4, Tallinn, Estonia.

Clark, J.E. & Mannell, R.H., (1988) Some comparative characteristics of uniform and auditorily scaled channel synthesis, Proceedings of SST-88, 22-27, Sydney.

Clark, J.E. & Mannell, R.H., (1989) Frequency resolution effects on phonetic level perception of synthesised speech, Proceedings of the European Speech Communication Association Workshop on Speech Input/Output Assessment and Speech Databases, 1.8.1-4, Noordwijkerhout:, the Netherlands, 20-23 September 1989.

Fant, G., (1960) Acoustic Theory of Speech Production, Mouton, The Hague

Ghitza, O., (1986) "Auditory nerve representation as a front-end for speech recognition in noisy environments", Computer Speech and Language, 1, 109-130.

Hamada, H., Hirahara, T., Imamura, A., Matsoka, T., & Nakatsu, R., (1989) "Auditory-based filterbank analysis as a front-end processor for speech recognition", Eurospeech 89, Vol 2, 396-399, Paris, Sept. 1989.

Holmes, J.N., (1972) Speech Synthesis, Mills & Boon, London.

Holmes, J.N., (1982) "Formant synthesis: Cascade or parallel?" JSRU Research Report No 1017.

Klatt, D., (1980) "Software for a cascade/parallel formant synthesiser", JASA, 67 971-995.

Mannell, R.H., & Clark, J.E., (1990) The perceptual consequences of frequency and time domain parametric coding in automatic analysis and resynthesis of speech, a paper presented at the International Conference on Tactile Aids, Hearing Aids and Cochlear Implants, National Acoustics Laboratories, Sydney, May 1-3, 1990.

Mannell, R.H., & Clark, J.E., (1991) A comparison of the intelligibility scores of consonants and vowels using channel and formant vocoded speech, Proceedings of the 12th Internation Congress of Phonetic Sciences, Aix En Provence, France, 19-24 August, 1991.

Markel, J.D., & Gray, A.H., (1976) Linear Prediction of Speech, Springer-Verlag, Berlin.

Olive, J.P., & Nakatani, L.H., (1974) "Rule-synthesis of speech by word concatenation: a first step", JASA, 55(3), 660-666.

Olive, J.P., (1977) "Rule synthesis of speech from dyadic units", ICASSP77, 568-570, Hartford.

Rodet, X., & Depalle, P., (1985) "Synthesis by rule: LPC diphones and calculation of formant trajectories", ICASSP 85 736-739, Tampa.

Rye, J.M. & Holmes, J.N., (1982) "A versatile software parallel- formant speech synthesiser", JSRU Research Report, No 1016.

Sagisaki, Y., 11988) "Speech synthesis by rule using an optimal selection of non-uniform synthesis units", ICASSP 88, 679-682, New York.

Samouelian, A., & Summerfield, C.D., (1988) "Computational model of the peripheral auditory system for speech recognition: Initial results", Proc. SST-88, 234-239, Sydney, Nov. 1988.

Seneff, S., (1988) "A joint synchrony/nerve-rate model of auditory speech processing , J. Phon., 16, 55-76.

Shadle, C.H., & Atal, B.S., (1978) "Speech synthesis by linear interpolation of spectral parameters between dyad boundaries, ICASSP 78, 577-580, Tulsa.

Stella, M.G., & Charpentier, F.J., (1985) "Diphone synthesis using multipulse linear predictive coding and a phase vocoder", ICASSP 85 740-743, Tampa.