Skip to Content

Department of Linguistics

Evaluation of Speech Synthesis Systems

Robert Mannell

TTS and other speech synthesis systems have been with us for quite a long time now and many claims have been made in the literature about the quality of particular systems. Fortunately, the days where the following type of claim was made has now passed:-

"Informal listening tests were carried out and the system was found to produce high quality speech"
Paraphrased from numerous sources

The most reliable methods of speech synthesis evaluation rely on measurements of the perceptual performance of human listeners (often referred to as "subjective" tests of speech quality and intelligibility). There is a strong demand today for the so-called "objective" evaluation of text-to-speech systems. This is driven by a desire to be able to determine the quality and intelligibility of a speech synthesis, or other speech technology, system without needing to use significant numbers of human listeners. Interest in the evaluation of speech synthesisers in not new, however, as can be seen in papers from the early 1980's by David Pisoni and John Clark (below) as well as a number of papers that pre-date the Pisoni and Clark papers. See Mannell (1984) for a bibliography and a review of this early material. Also have a look at this page.

In the early 1990's an international organisation, COCOSDA (The International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques for Speech Input/Output), was established. One of COCOSDA's initiatives is in the area of speech synthesis assessment. Speech synthesis evaluation was a major focus of the 3rd ESCA/COCOSDA Speech Synthesis Workshop, a satellite conference of ICSLP'98. Speech synthesis evaluation continues to be an interest in many conferences, including the annual Interspeech conference sponsored by the International Speech Communication Association (ISCA). ISCA also sponsors a group known as SynSIG (Synthesis Special Interest Group) which has taken over the role of COCOSDA in recent years.

Bibliography

  1. Benoit C. and Pols L.C.W. (1992) "On the assessment of synthetic speech", In Bailly G. and Benoit C. (eds.) Talking Machines: Theories, Models and Designs, North-Holland, Amsterdam
  2. Benoit C. (1997) "Section Introduction: Evaluation inside or assessment outside?", In van Santen J.P.H., Sproat R.W., Olive J.P., and Hirschberg J., (eds.) Progress in Speech Synthesis, Springer, New York
  3. Carlson R., Granstrom B., and Nord L. (1992) "Segmental evaluation using the Esprit/SAM test procedures and monosyllabic words", In Bailly G. and Benoit C. (eds.) Talking Machines: Theories, Models and Designs, North-Holland, Amsterdam
  4. Clark J.E. (1983) "Intelligibility comparisons for two synthetic and one natural speech source", J.Phonetics 11, 37-49
  5. Falaschi A. (1992) "Segmental quality assessment by pseudo-words" In Bailly G. and Benoit C. (eds.) Talking Machines: Theories, Models and Designs, North-Holland, Amsterdam
  6. Fourcin A. (1992) "Assessment of synthetic speech", In Bailly G. and Benoit C. (eds.) Talking Machines: Theories, Models and Designs, North-Holland, Amsterdam
  7. Mannell R.H. (1984) Aspects of Speech Synthesis Performance, Honours dissertation, Macquarie University.
  8. Pisoni D.B., Nusbaum H.C., Luce P.A., and Schwab E.C. (1983) "Perceptual evaluation of synthetic speech: Some considerations of the user/system", IEEE ICASSP-83, 535-538.
  9. Pols L.C.W., and Jekosch U., (1997) "A structured way of looking at the performance of text-to-speech systems", In van Santen J.P.H., Sproat R.W., Olive J.P., and Hirschberg J., (eds.) Progress in Speech Synthesis, Springer, New York