Skip to Content

Department of Linguistics

Parallel Formant Synthesis-by-Rule and Diphone-Concatenation utilising data extracted from a Speech Database

Robert Mannell

Speech synthesis and text-to-speech (TTS) development has had a long history at SHLRC, commencing with the work of John Clark which started in the mid 1970's and continued through the 1980's with Summerfield and Mannell under ARC and Telecom funding. From mid-1990 to early 1993, under Telstra (formerly OTC) funding, we (Mannell, Clark, Fletcher, Harrington, McVeigh et al.) converted our existing TTS system so that it could be interfaced with the VLSI parallel formant synthesiser developed by Clive Summerfield. It was originally intended that the new system should, like the previous system, be based on parallel formant synthesis-by-rule (SBR) techniques which would be refined in the new system to produce higher quality output speech.

A very common feature of SBR development programmes is a great reliance on ad-hoc trial-and-error approaches to rule development. It was determined at the outset of the project that would rely upon detailed analysis of a segmented and labelled speech database. The first subject (John Clark) recorded, segmented and labelled as part of the ANDOSL project was selected as our vocal model. One reason for the selection of that voice was that it was the same voice utilised as the vocal model for all of our preceding TTS systems. This facilitated comparison of our new system with our preceding systems. It also allowed the gradual incorporation of upgraded TTS sub-systems with older sub-systems without creating a mis-match of vocal characteristics.

It should be pointed out here that, unlike speech recognition systems which attempt to model speech communities, speech synthesis utilising SBR generally has as its goal the modelling of a single speaker. This is an obvious requirement for speech concatenation synthesis. It is possible, however, that the utilisation of single speaker models for SBR synthesis may be a historical legacy arising from the extremely labour intensive methods of detailed context-sensitive analysis that were utilised in the past. It may be possible to base an SBR model on, for example, a generalised average male speaker of general Australian English through the use of a database such as ANDOSL and a database query system such as MU+. For the present project it was decided to conform to the traditional approach, leaving this question for a possible future project. There were a number of reasons for this. Firstly, the traditional approach is known to work whilst the alternative multi-speaker generalised model approach may not be a viable approach to SBR development and the goal of the project was to produce a practical TTS system. Secondly, the ANDOSL database was in its infancy and only a very small number of male subjects had been segmented and labelled even at the conclusion of the project. Thirdly, it was desirable to model the same speaker for an SBR system as the one being utilised in a separate diphone concatenation system under development so that comparisons between the two systems could be carried out more readily.

During the early phase of the project we concentrated on prosodic development (Fletcher and McVeigh), on database development (Harrington et al) and on software redesign (McVeigh and Mannell). The database was an essential prerequisite for the completion of the prosody modules which were highly dependant upon analysis of the segemented and labelled connected speech data. Detailed SBR development could also not proceed in a principled way until the database was ready. When that point was reached, about mid-way through the project, we were faced with the need to make a decision about how to best proceed. It had originally been intended that we utilise the database to extract sufficient context-sensitive data to allow a principled development of the SBR rules. As a result of work by Mannell on another project we were led to the need to consider another alternative. Clark and Mannell had for a number of years been interested in the fundamental ability (or lack of it) of the formant model to produce high quality consonants with intelligibility approaching channel vocoded and natural speech. It was necessary to be able to compare our channel vocoder data to the highest possible quality formant modelled data. We had been forced to rely on the format vocoders produced by other groups (viz. the JSRU formant vocoder). My intention was to produce a system capable of producing speech utilising an analysis-by-synthesis process whereby formant parameters would be iteratively specified until the least possible frame-by-frame spectral distance between natural and synthetic speech was achieved. The parameters so determined would be the gain and bandwidth parameters (formant frequencies being obtainable much more readily using other methods). The resultant speech was of such high quality that it seemed likely that we could use the method in the TTS system. This approach could be utilised on the speech database to provide a detailed set of statistics on not only formant frequencies (as first envisaged) but also on formant gains and bandwidths tuned to the target synthesiser (not previously considered possible). This development therefore extended our options in the development of SBR rules utilising the database to include all necessary synthesiser parameters. It also seemed desirable to consider the use of this parameter extraction methodology in the production of formant-based diphones (a diphone is a speech token which extends from part way through one phoneme to part way through the next and so stores the difficult to model phoneme-to-phoneme transitions). The development of a formant diphone system is not normally considered practical because of the need for and extreme difficulty in obtaining gain and bandwidth data which meaningfully relate to the needs of the target synthesiser. The new parameter extraction procedure made such a process possible. Given that we now had a segmented database and given the existence of the new extraction procedure it suddenly became possible to consider formant-based diphone-concatenation as an alternative to an SBR system. We felt that we would be able to produce high-quality speech much faster by this approach than by the alternative approach of statistically extracting SBR rules from the database (although the latter process remains a very interesting possibility). So, for reasons of the likely much greater speed in the production of high quality TTS speech we opted for the formant-based-diphone-concatenation approach.

The formant parameter extraction procedure for a parallel formant synthesiser system needs to take account of a fairly large number of parameters. For the target synthesiser utilised in the present project the following parameters need to be extracted:-

  1. Formant centre frequencies for F1 to F5 as well as Fn (nasal formant)
  2. Formant bandwidths for F1 to F5 and Fn (B1 to B5 and Bn)
  3. Formant intensities (or gains) for F1 to F5 and Fn (G1 to G5 and Gn)Excitation characteristics for F1 to F5 (ie. degree of voicing and aperiodicity). Fn is assumed to be always fully voiced in this system. (M1 to M5) These source "mix" values allow for differing degrees of voiced and voiceless excitation for each of the formants.

The determination of each of these sets of parameters was carried out in different ways.

i) Formant centre frequency (Fx) extraction

Fx extraction required the development of a formant tracking algorithm that would be capable of tracking formants continuously during vowels, consonants, stop occlusions, etc. This is needed as the synthesiser requires continuous formant parameter tracks analogous to pole continuity in human speech. The Waves™ formant tracker was not suitable for this task as it produces discontinuous formant tracks across (especially voiceless) consonants and has a high error rate even in vowels and vowel-like consonants. The formant tracker developed for this project was a hybrid automatic/hand analysis expert systems based algorithm. It involved an initial "rough" formant analysis followed by hand selection of the acceptable formant values for each phoneme target using a graphical interface. This selection procedure resulted in an 6 dimensional ellipsoid (based on F1-F5 and Fn) that encompassed the acceptable values of each formant for each phoneme or allophone target. From these ellipsoids a formant probability space was determined. When a phoneme target spectrum was analysed a number of candidate peaks would be identified and combinations of 5 or 6 of those peaks would be hypothesised as being F1-F5 (with Fn optional). Such combinations of hypothesised formant values would determine a point in 5 or 6 dimensional formant space. If that point was close to the centre of the appropriate ellipsoid then that combination of selected peaks would be determined as having a higher probability of being correct than combinations of peaks that defined a point further from the centre of the ellipsoid. Vowel-like consonants (semi-vowels and approximants) were dealt with in a manner very similar to the vowels. Non-vowel-like consonant formant probability spaces were much more problematic and were initially approached from a traditional locus frequency perspective. In the present case, however, we did not assume a single locus frequency for each consonant but rather a locus space analagous to the vowel formant probability spaces. This process was quite difficult as "formants" were not always clearly visible in the spectra of certain consonants (eg. some voiceless fricatives, stop occlusions) and were sometimes merely derived from interpolations from measurable formant values on either side of the consonant. As with the vowels, this process eventually resulted in a set of formant probability ellipsoids for each consonant phoneme. (It is now clear that this procedure is inadequate and that probability ellipsiods should instead be derived at a more allophonic level for the consonants). Transition probability corridors were also determined for each pair of phonemes which involved simple interpolations between the pairs of ellipsoids. Details of this algorithm can be seen in figure 1.

ii) Bandwidth extraction

These parameters were determined by rule. As the database was segmented and labelled the voicing of each segment could be assumed from the phonemic labels. A simple bandwidth calculation based on formant frequency and voicing (Fant, 1960) was used.

iii) Formant intensity or Gain

Serial formant synthesiser architectures have a great advantage in that they don't require different input gain values for each of the formants during vowels synthesis as these cascade architectures automatically result in the appropriate relative vowel formant intensities. For many of the consonants, however, this architecture results in very inadequate formant intensity profiles which can only be overcome by placing separate fricative and nasal filters in parallel to the formant channels. Parallel formant architectures on the other hand have the advantage of allowing the detailed shaping of the spectrum by the selection of different gain parameters for each formant as appropriate to both vowel and consonant spectra. This creates a greatly increased complexity for vowel gain specification but allows the specification of vowel and consonant spectral shapes utilising the same set of resonators. This allows for greater flexibility and is more able to avoid the perceptual dis-integration of the fricative and vowel branches which occurs because of the inherent lack of pole continuity between the two branches. The precise specification of appropriate formant gains is, however, an extremely difficult problem. Formant gain values measured from LPC or FFT spectra are in general quite unsuitable as input gain parameters for a parallel formant synthesiser. This is because there is only an indirect (and individual-synthesiser-specific) relationship between appropriate synthesiser input gain parameters and relative output formant intensities. This problem is usually solved by iterative hand modifications of the input parameters until the desired spectral shape is achieved. This problem is exacerbated by the interaction between the gain parameters, so that the modification of one gain parameter will often result in modifications of the output intensities of one or more of the other formant intensities. This problem was solved in the current project by an iterative analysis-by-synthesis approach which utilised a software simulation of the target synthesiser. Each gain was varied iteratively for each analysis frame until the synthetic formant peak intensities matched the natural formant peak intensities most closely. (See figure 2 for details of this algorithm). When the formant frequencies are accurate, this procedure results in very natural speech. Relatively small deviations of extracted formant frequency from actual formant frequency can result in poor gain specification as the peak intensity of the synthetic frame may be attempting to match the intensity of some part of the natural spectrum remote from the actual formant peak. Heuristics have been added to reduce the effects of this problem and a revised procedure utilising Euclidean spectral distances is being considered.

iv) Formant source characteristics (voiced/voiceless/mixed)

This is determined by rule, based on the phonemic identity of each segment as defined by the database labels.

The formant parameter (frequency, bandwidth, and gain) extraction system proved to be more complex than first anticipated and required modification of various details of its design. Particular problems were encountered in the expert-systems approach to formant frequency extraction. This required a considerable re-evaluation of the way in which pole-formant probability spaces were determined and a re-examination of how to best allocate LPC poles to formants when candidate pole trajectories converged. The determination of formant gains also needed to be re-evaluated for the special case of converging pole trajectories. This especially affected the nasal formant and in particular what to do with that formant when it disappeared or merged with the first formant (similar, but much less common, problems occurred with other pairs of formants). For example, when two formants merge sometimes both formants are given the same gain and when both are excited at that level a peak results which has twice the intensity required. This intensity doubling effect often has extreme effects on speech quality particularly when it occurs at the normally more intense lower frequencies (ie. F1 and Fn). Various heuristics have now been added to the algorithm to deal with these problems, however, many of these problems will be dealt with in a much more principled manner when the gain extraction procedures are enhanced. The algorithm, with its heuristic enhancements, now works well for the vast majority of diphones.

The speech so produced was not of acceptable quality. This is for a number of reasons:-

i) The parameter extraction algorithms: The formant and gain extraction algorithms were both found to be unstable in a small but significant number of cases. These problems were dealt with in the short term by numerous heuristics. For the long term, development of the parameter extraction algorithms has continued.

ii) The database: There were a number of problems caused by attempting to extract a complete set of diphones from a connected speech database. Many of the diphones were in a much more reduced state than would be the case for diphones extracted from isolated words. This results in poorly modelled stressed speech sounds (especially tonic stress). It was necessary to record a large number of real and nonsense words uttered in isolation or in tonic stressed context in a carrier phrase. Although a large number of diphones are now extracted from this subsidiary database, numerous diphones (especially cross-junctural diphones) are still selected from the connected speech database. Not all required diphones can be extracted from the existing databases. More work is required to complete the diphone set.

iii) The placing of diphone boundaries was not accurate as the algorithm relied upon the automatic (and often erroneous) identification of phoneme targets. If the diphone boundary is not located accurately at a consistent point in the phoneme target then concatenation will result in very poor quality speech. Labelling of the isolated word database needed to be extended to include the identification of phoneme targets and transitions. A particular problem occurs when formant frequency targets (defined in the usual way) don't align with gain targets (intensity peaks in vowels, for example, sometimes aligned with the later part of the transition preceding the target). If a diphone cut-point in a left-context diphone is at the intensity peak of the phoneme and the cut-point in a right-context diphone aligns with a point some distance down the intensity peak then an intensity discontinuity often occurs with audible consequences.

Future Directions

The success of the alternative Bark-scaled channel-parameter diphone concatenation system has temporarily delayed further development of the formant synthesis system. The processes of parameter extraction and diphone concatenation are trivial for that system when compared to the formant system.

The formant system could be further developed in a number of ways.

i) Further work will be done on the formant parameter extraction algorithms. A new algorithm is being developed which doesn't rely on the extremely time consuming analysis-by-synthesis procedure currently in use (and under continuing development). This algorithm requires a detailed analysis of the relationship between measured natural peak intensities, synthesiser gain parameters and resulting synthetic peak intensities. The proposed method will hopefully extract formant frequency, bandwidth and gain values simultaneously via an iterative peak/curve fitting procedure.

ii) Further attempts will be made to overcome the existing diphone concatenation problems with formant gain and frequency parameter concatenation and smoothing.

iii) It should be possible to produce a detailed formant parameter version of the database (both connected speech and word data) using the parameter extraction procedure. It is anticipated that such a parameterised database could then be interrogated using MU+ to produce a set of single-speaker-specific context sensitive rules (or representations) which could form the basis of a new SBR system.

iv) Procedure iii) could be repeated for a number of similar (gender, dialect, etc.) speakers to produce a set of SBR rules for a generalised speaker. It seems possible that averaging across speakers will be less harmful to speech quality than SBR rules which are derived from averages across too many segmental contexts.

v) The results of procedure iii) might be used in conjunction with a word-level concatenation system (also under development) so that word concatenation would account for the most frequent words and SBR would fill in the gaps between them and would deal with junctural transitions between those words and adjacent words or morphemes.

vi) Alternatively, word-level concatenation might potentially be combined with diphone concatenation.