Skip to Content

Department of Linguistics

A Brief Historical Introduction to Speech Synthesis:
A Macquarie Perspective

Robert Mannell

The reader is advised that this introduction is particularly from the perspective of speech synthesis research that has occurred at Macquarie University since the late 1960s. This bias is intentional as this topic has as one of its goals the provision of an overview of the history of speech synthesis research in our lab. It is stressed, however, that this lab is only one of many that have carried out research in this field.

i) Early attempts at speech synthesis.

Prior to the 18th century there are no recorded serious attempts at speech synthesis, although it was not uncommon for attempts to be made to impress people and influence events by making voices emanate from statues (etc.) via speaking tubes. One of the earliest successful attempts at speech synthesis occurred in Russia in 1779 when Kratzenstein constructed a mechanical model of the human vocal tract which was capable of reproducing a few steady state vowels. The first recorded success in synthesising connected speech was achieved by von Kempelen in 1791 when he completed the construction of an ingenious pneumatic synthesiser (figure 1) that was driven by a bellows with the air being forced past a whistle and an adjustable leather 'vocal tract'. The leather 'vocal tract' was a simple leather bag whose dimensions could be varied by external manipulation by hand. There were also a couple of hiss whistles to allow simulation of fricatives and a pair of openings to simulate the nostrils. This device must have required considerable practice before an operator could produce a sequence of sounds anything like speech. Interest in mechanical synthesisers persisted into the 20th century, with an apparently successful synthesiser being built by Reisz as late as 1937 (see Flanagan, 1972).

Figure 1: Pneumatic speech synthesiser developed by von Kempelen in 1791.

ii) Vocoders.

With the 20th century came the development of electronics and later of electronic resonators. Electronic resonators are analogs of acoustic resonators and can potentially be made to simulate them with considerable accuracy. Further they can be much more usefully and readily controlled than can mechanical or pneumatic systems. There were a few attempts early in the century to utilise electronic resonators in such a way that they could produce steady state vowels, however it had to wait until the late 1930's when work by Dudley at the Bell Laboratories produced the first electrical connected speech synthesisers.

Dudley developed two devices in the late thirties that are generally accepted to be the earliest electronic speech synthesisers. One of them, the 'Voder' (figure 2), was demonstrated at the 1939 and 1940 world fairs. It was essentially a parallel array of ten electronic resonators arranged as contiguous band-pass filters spanning the important frequencies of the speech spectrum (such a system is sometimes referred to as a spectrum synthesiser). The device was controlled via a keyboard (ie. played like a piano). Ten finger keys controlled the output gain of each of the filters, a wrist bar controlled the selection of aperiodic hiss or periodic buzz, whilst a foot pedal controlled the pitch of the buzz. Three additional keys supplied appropriate stop-like transient excitation. The Voder required operators who had trained for months and who were capable, at most, of producing speech which was only reasonably intelligible when supplied with context the asking of leading questions. For these reasons the Voder came to be considered of little practical value.

Figure 2: A schematic diagram of Dudley's Voder speech synthesiser. (after Dudley, Riesz and Watkins, 1939)

Dudley's other device also came into being in 1939 and was called a 'channel vocoder' (Dudley, 1939). This device remains the basis of many devices still in use today. The channel vocoder, and all subsequent vocoders, are essentially analysis/synthesis devices. That is, they are divided into two halves, an analysis half and a synthesis half. The analysis half analyses an incoming speech signal and derives from that natural signal certain parameters. These parameters are passed as codes to the synthesis half of the device and are there used resynthesise a synthetic version of the original speech. The channel vocoder is the simplest of the vocoders and is basically devided into two branches. One branch determines whether the signal is voiced or voiceless and if voiced it determines the pitch. This information is used to produce a synthetic source (either a periodic buzz with the determined pitch or an aperiodic hiss). The other branch of the channel vocoder is a bank of band-pass filters (electronic resonators) which measure the level of the signal in each frequency band at each point in time. It is only these levels (and not the complete detailed signal) as well as the pitch and voiced/voiceless information which is passed to the synthesis half of the device. The synthetic source is then produced (as described above) and is mixed with a spectral envelope reconstituted from the filter level values to produce a synthetic version of the original signal. The vocoders were originally developed at the Bell telephone labs as devices which allowed a speech signal to be coded more efficiently (less bits of information) than natural speech and thus allow more conversations to be passed simultaneously over the telephone network. Various other vocoder configurations have been developed which dispense with simple filter banks and rely on complex mathematical transformations of the data (eg. Linear Prediction Coefficient (LPC) vocoders) or on the detection of the formants in the speech signal (formant vocoders). The main motivation behind the development of these devices remains the more efficient coding of speech signals for transmission over various communications lines. They are also valuable as tools in basic research into the fundamental limitations of various synthesis configurations and time and frequency resolution conditions. Figure 3 shows the research channel vocoder developed at Macquarie University

Figure 3: SHLRC channel vocoder. Note that the circles containing a cross indicate that the source and channel data are multiplied (convolved) together. The circle containing a sigma indicates that the source-excited channel data from all channels are added together to obtain output speech. (after Clark, Mannell and Ostry, 1987)

iii) Spectrogram Readers

Various devices have been developed (mainly in the 1950's) which are able to read either a spectrogram or a simplified hand-painted version of a spectrogram. The most famous of these was developed by Cooper, Liberman and Borst (1951, also Cooper, 1953)and is called the pattern play-back machine (figure 4). The spectrogram is placed on a sheet of thick plastic connected end-to-end to form a continuous roll which runs over two rollers not unlike those that used to be found on mangle or wringer washing machines. The device works by illuminating the spectrogram with 50 contiguous light spots. This light can be either reflected from the spectrogram or can be made to pass through a special transparent (usually hand-painted) spectrogram. The amount of light reflected or transmitted depends on the density of the spectrogram pattern (the darker the pattern, the less the amount of light). The intensity of the light is then detected by 50 photocells and each photocell's output is converted to electrical current in direct proportion to the intensity of that frequency band in the original signal. These electrical signals are eventually added together and amplified and a monotone version of the original speech sound is produced.

Figure 4: The pattern-playback synthesiser which reads stylised spectrograms hand-painted onto a moving transparent sheet (after Cooper, Liberman and Borst, 1951)

iv) Parametric Synthesis (Formant Synthesisers)

a) Serial ("Analog") Formant Synthesis

Serial formant synthesisers were "analog" in more than one way. Firstly, they were almost always analog electronic hardware devices. Secondly, the way that they processed formants (ie. in series) was considered to be analogous to the way that formants are produced in the vocal tract (ie. each resonance acted upon the acoustic result of the other resonances).

"Pat" and "Mac"

In 1948 Walter Lawrence (Lawrence, 1953) created a device ("Pat") which looked superficially similar to the pattern playback in that it also consisted of spectrogram-like patterns painted onto a continuous sheet of plastic which was rotated around a mangle (this device was later copied by John Bernard at Macquarie University, beginning in 1968, to become the first Australian made speech synthesiser, known as "Mac"). Whilst similar in appearance, Pat differed drastically in principle to the pattern playback, however, as it was one of the earliest parametric synthesisers. The pattern on the plastic was divided up into six (later eight) concentric bands, with each band containing the changing pattern of a single parameter painted in electrically conductive silver paint. A current would be passed through each of these continuous silver lines and a set of electric contacts (one for each parameter) would sense the position of each of the lines (utilising a varying resistor) and the values would be translated into continuously varying voltages. These varying voltages would then be interpreted by the electronics as varying parameters capable of controlling the synthesiser's various filters etc. These parameters were selected as they were considered by Lawrence to be the minimum number of variables which needed to be controlled in order to produce intelligible speech. The parameters controlled by the later version of Pat were :-

  • F0 fundamental frequence
  • A0 amplitude of the periodic component
  • AH amplitude of the aperiodic (hiss) component
  • F1, F2, F3 the first three formant frequencies
  • FLP hiss low pass (ie. upper edge of hiss band)
  • FHP hiss high pass (ie. lower edge of hiss band)

The formants were designed as a series of three electrical resonators. The signal source (random hiss or periodic buzz) would first pass through F1 where it would be filtered according to the input parameter (F1). The output of this filter would then be filtered by the F2 filter, and the output of the F2 filter would be filtered by the F3 filter. Because of this serial (or cascade) configuration, the amplitudes of the formats could not be individually controlled as a change in the amplitude of the output of one filter would affect the relative weightings of all filters. Fortunately, when the formant filters are cascaded in this way with a single controllable amplitude the relative formant amplitudes come out acceptably for vowels, as do the formant bandwidths. If this were not the case then it would be essential to also control amplitude and bandwidth parameters for each formant (ie. A1, A2, A3, B1, B2, B3, adding another six parameters). The serial configuration thus simplifies the amount of detailed control required for vowels. Unfortunately, the picture is not so simple for consonants which are characterised by high frequency energy peaks (eg. certain fricatives). To produce a reasonable [s], for example, it is not simply enough to specify a sufficiently high hiss amplitude (AH) with appropriate formant filter centre frequencies (F1, F2, F3) as this would only produce some sort of whispered vowel (albeit, one with rather strange formant values) with the typical vowel spectral tilt of about -6 dB/octave. The amplitudes and bandwidths would be wrong as the high frequency band has a bandwidth much greater than a typical vowel formant and also the amplitude of this band is much higher than any vowel formant would be at such a frequency. For this reason, a separate hiss filter needs to be connected in parallel to the formant filters which has a bandwidth adjustable by setting FHP and FLP parameters and which has a completely independent amplitude control AH. Since this filter is in parallel it is in not way constrained by the output spectral slope of a preceding cascaded formant filter and can instead be simply added to the spectrum produced by the formant filters. In this way high amplitude, variable bandwidth hiss can be superimposed over the signal to allow the modelling of reasonable fricatives.

Figure 5: The Macquarie MAC speech synthesiser being adjusted by technical officer Trevor Blum in 1971. Click here to read a 1971 Macquarie "University News" report by John Bernard on this system. Note that the reference to the "miniaturisation of modern components" on page 9 of that report refers to the large box of electronics on the top of the synthesiser, the right half of which is being adjusted by Mr Blum.

Click on the icons in the following two lines to hear some sample speech produced on Mac:-

    Where were you?
    How are you?

These examples are optimal choices of phrases for Mac as they consist mostly of vowels and vowel-like consonants (in this case the semi-vowels /w/ and /j/). Only /h/ in "how" in the second phrase does not fit these criteria. /h/ is synthesised in the second example by exciting a vowel spectrum (identical to the spectrum of the first target of the following diphthong "ow") with a hiss noise (you may need to listen carefully to hear this). Even though the choice of phrases is "optimal" for Mac and the speech is fairly intelligible (especially when helped by the text), the speech is of rather poor quality.

Generating parameter tracks that resulted in good quality speech was extremely time consuming on such a system and a great deal of time would have been required to produce a well know, and much higher quality, demonstration of the      "North Wind and Sun" passage produced on Pat in Edinburgh in 1962. Whilst a few words are not highly intelligible (eg. some instances of "traveler") the speech quality is more natural than was the case for the Mac examples. This passage would have been produced a phrase at a time. Each phrase would have been recorded onto reel-to-reel tape and the phrases would have then been edited together to produce the full passage.

"Sid"

In 1972, John Clark (Macquarie University) began the development of a serial formant speech synthesiser ("Sid", see figure 6) which is in some ways similar to the above synthesiser but which has several more parameters and is computer controlled rather than mangle controlled. The extra parameters include a nasal resonance filter and a variable time base, as well as an extra hiss through formants parameter to allow for better voiced fricatives. There are a total of 12 parameters in this system. Computer control has the advantage over the mangle system of avoiding messy and extremely time consuming painting of the silver etc. Once the parameters are typed in they can also be saved for future use, or can be easily modified (modification of the mangle system involved tedious scraping of the offending pattern). Although data entry is digital (a series of numbers representing discrete points in time) the actual synthesiser is analog and the data needs to be converted into continuous analog form before it can control the synthesiser. Data entry can be done by hand or the parameters can be generated by rules from either high or low level phonetic parameters or even from orthographic text (text-to-speech).

Figure 6: The Macquarie SID serial formant speech synthesiser (after Clark, 1976)

b) Parallel Formant Synthesis

Although the addition of a hiss filter in parallel to the formant filters facilitates the production of reasonable fricatives, it can never really fully model them. Such systems rarely produce consonants with better than 70% of the intelligibility of natural consonants, although the vowels often achieve natural vowel intelligibility equivalence. Much greater control of the amplitudes and bandwidths is required in order to allow both better consonants and also more control over voice quality parameters. Further, in order to fully model certain consonants (eg. nasals, fricatives) it may be desirable to add an anti-resonance filter to the system. Such control can only be achieved with a parallel formant filter system such as the system designed by John Clark et al (amongst many others). The Macquarie system (figure 7) has 5 formant filters each with independently controllable centre frequency (Fx), bandwidth (Bx), amplitude(Gx) and periodic/aperiodic mixing levels (Mx, where 1=fully aperiodic and 0=fully periodic and the intervening values represent a mixed source). There is also a nasal resonance filter with controllable centre frequency (Fn) and gain (Gn, bandwidth is fixed) and an anti-resonance filter with adjustable centre frequency (Fz) and bandwidth (Bz, gain is fixed). This gives a total of 24 filter parameters. There are in addition to these several parameters that control the type of source spectrum (eg. glottal waveform shape), pitch, and various voice quality parameters. In all there are in excess of 30 parameters which would be hopelessly complex for anything but computer control. This system is designed to be driven by a synthesis-by-rule software package which is presently under development.

Figure 7: The Macquarie parallel formant synthesiser. This is the configuration used from the mid-1980's to the early-1990's. In more recent versions the "zero" (or anti-resonance) filter has been removed. (after Clark, Summerfield and Mannell, 1986)

v) Synthesis-by-rule (SBR)

A synthesis-by-rule system (figure 8) is one which accepts as input, a string of characters which refer, at some level of abstraction, to the desired output utterance. The first system of this kind, utilising segmental phonetic input, was proposed by Liberman et al in 1959. Internally, a set of algorithms computes a set of quasi-continuous values for each of the required physical parameters to be output. These algoritmhs usually have access to an internal database which may comprise various tables of range values (etc.) for each phoneme in all of its potential segmental environments, or data which helps in the derivation of appropriate suprasegmental (intonation, stress, rhythm) contours.

Figure 8: A simplified schematic diagram of a synthesis-by-rule system. (after Clark, 1986)

"Because of the complexity of context sensitive effects it is not sufficient merely to concatenate a set of static parameter values along an appropriate time course with simple transitions connecting them. Establishing appropriate rules or algorithms for generating the context sensitive formant patterns which model those observable in corresponding sequences of natural speech, particularly target and transition properties, is a major component of all synthesis by rule systems." (Clark, 1986)

The input string may vary from ordinary orthographic text (in which case it is referred to as a text to speech system), through orthographic text with lexical stress marks, simple phonetic script (with or without prosodic diacritics), to a very detailed low-level phonemic code including specifications for both segmental and prosodic features (eg. Clark, 1979).

The actual rules included, as well as the degree of complexity of the rules, the complexity of the input string etc. vary according to the particular phonetic/phonological model being followed, as well as the overall objectives of the particular project.

Most systems between the late 1950's and the mid-1990's focused on formant coding methods although there have been attempts, with varying degrees of success at various other approaches including some systems based on time-varying articulatory parameters (eg. Flanagan et al, 1970).

The Macquarie SHLRC SBR synthesiser (described above) is a formant-coded system which consists of several distinct and separate levels of rules (figure 9). The output of any one level is typically a string of codes or parameters and the system can be entered at any level by simply supplying the appropriate code sequences.

Figure 9: The Macquarie SBR-based Synthesis-by-rule system. This system was designed in the mid-1980's and was in use until the early-1990's. (after Mannell and Clark, 1986, 1987)

Entry at the highest level consists of simply entering an ordinary text string. The grapheme-to-phoneme rules produce a high level phonetic transcription of this input in ordinary IPA with stress markers. (see the section on text-to-speech below).

The next level of the system consists of a set of context sensitive rules which take into account the phonetic context of each segment as well as its prosodic features. These rules produce a low level phonetic representation which consists of a line of code for each phonetic segment. This line of code specifies each vowel relative to the 16 cardinal vowels. For example, a certain vowel when adjacent to a particular consonant is determined (with the help of the database) to be half way between cardinal vowels 1 & 2 and further to be retracted by a certain amount relative to those cardinals. A consonant, on the other hand, will have its various component lengths (eg. occlusion, VOT, onset transition, offset transition, etc) specified (again with reference to another database) differently for each segmental context. Both vowel and consonant strings also have specified a number of prosodic features (derived relative to the total sentence prosody). These include, intonation contour type (eg. rising, falling, linear, curved, etc.) and a start and end pitch value, as well as intensity information.

This low level phonetic representation is then passed to the next level of the system, the synthesis-by-rule module. This level also has its own databases which permit the respecification of each line of low level code into a time-varying sequence of formant frequencies, bandwidths, gains, aperiodic/periodic source mix ratios, pitch, glottal waveform shape, etc. For example, a vowel upon entry to this module is specified as a point on either the rounded or unrounded cardinal vowel plane. A vowel formant database specifies each of the cardinal vowels in terms of the centre frequencies and gains of each of the five formants. A mathematical algorithm allows the conversion of the vowel specification from a point on a cardinal plane to a point in five dimensional formant space (ie. a point specified in terms of the five formants). A difficulty at this level is the accurate specification of the transitions between two segments. Some parameters require slow transitions and some require fast transitions, some require linear transitions and some require curved transitions. The selection of the appropriate transition slopes for each parameter is made by a combination of rules and database entries.

The output of the synthesis-by-rule module is a sequence of acoustic parameters which become, finally , the input to the synthesiser which then translates them into a time varying waveform (again using certain rules and a database). This waveform is (in the MU-Talk synthesiser) in a digital form and so must first be converted into a continuously varying voltage (ie. digital-to-analog conversion) which operates a magnet in a loud speaker causing a continuously varying magnetic field which in its turn causes the speaker to vibrate in the same pattern and thus in turn causes the air to vibrate and so produces sound which hopefully sounds like the intended speech.

vi) Concatenation Synthesis

Concatenation of small units of natural speech (or of parameters derived from natural speech) is now the basis of the majority of commercial speech synthesis systems. Such concatenation systems very often utilise diphones (which extend from the steady-state portion of one phoneme to the steady-state portion of the next phoneme). Other sized units, including triphones, demi-syllables, words, etc have also been used and in some systems a mix of different sized units is used. Details of concatenation synthesis will form the focus of another topic.

The newer diphone-concatenation version of the Macquarie University TTS system, known as "MU-TALK" will be also described in a separate topic.

Other concatenation systems, such as PSOLA and MBROLA will also be examined in another topic.

vii) References

  1. Bernard, J.R., 1971, "MAC the speech synthesiser", University News, No. 44, Nov. 1971, Macquarie University.
  2. Clark, J.E., 1979, Synthesis by Rule of Australian English Speech, PhD Thesis, Macquarie University
  3. Clark J.E., Summerfield C.D. & Mannell R.H., 1986, "A high performance digital hardware synthesiser", Proceedings of the First Australian Conference on Speech Science and Technology, Canberra, Nov. 1986. pp 342-347
  4. Clark, J.E., Mannell, R.H., & Ostry, D., 1987, "Time and frequency resolution constraints on synthetic speech intelligibility", Proceedings of the Eleventh International Congress of Phonetic Sciences, Tallin, Estonia, Aug. 1987.
  5. Cooper, F.S., Liberman, A.M., and Borst, J.M., 1951, "The intraconversion of audible and visible patterns as a basis for research in the perception of speech", Proc. National Academy of Science, 37, 318-325
  6. Cooper, F. S., 1953, "Some Instrumental Aids to Research on Speech", Report on the Fourth Annual Round Table Meeting on Linguistics and Language Teaching, Georgetown University Press, pp. 46-53.
  7. Dudley, H.W., 1939, "The vocoder'', Bell Labs Rec., vol. 18, pp. 122-126.
  8. Dudley, H., Riesz R.R, and Watkins S.S.A., 1939, A synthetic speaker, Journal of the Franklin Institute, 227, 739-764
  9. Flanagan, J. L., 1972, Speech Analysis, Synthesis, and Perception, Springer-Verlag, New York
  10. Kempelen, W. von, 1791, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, J.V. Degen, Vienna
  11. Kratzenstein, C. G, 1782, Sur la formation et la naissance des voyelles, Journal de Physique 21, 358-380. (reporting on work carried out in 1779)
  12. Lawrence, W., 1953, "The synthesis of speech from signals which have a low information rate". In Communication Theory Butterworth: London, 460-469
  13. Mannell R.H., & Clark J.E., (1986) "Text-to-speech rule and dictionary development", Proceedings of the First Australian Conference on Speech Science and Technology, Canberra, Nov. 1986. pp 14-19
  14. Mannell R.H., & Clark J.E., (1987) "Text-to-speech rule and dictionary development", Speech Communication 6, 1987, pp 317- 324.
  15. Riesz, R. R., 1930, "Description and Demonstration of an Artificial Larynx," J. Acous. Soc. Amer., 1, 27