Skip to Content

Department of Linguistics

Speech and Language Technology: An Introduction

Robert Mannell

Speech and language technologies have been around for a very long time, and have had an enormous impact upon human society.

Early language technology could be said to include the earliest writing materials and, much later, the printing press (Gutenberg, ca.1450). Language technology of this genre evolved eventually into modern printing technology including typewriters (Mills, 1714, but the first practical one, Sholes, 1867) and word processors. The telegraph (Morse, 1844) can also be described as a form of language technology as its function was the transmission of text over great distances, greatly influencing the conduct of government, business and war. This was followed by the radio (Hertz, 1867; Marconi, 1896) which was restricted to non-speech signals such as Morse code until 1915 when AT&T broadcast speech coast-to-coast (USA) for the first time.

Primitive examples of speech technology are less obvious and undoubtedly less influential, but include early megaphones and speaking tubes of similar vintage to the earliest writing technology. More recently, in the 18th and 19th century, the earliest mechanical and pneumatic speech synthesis devices came into being but had no impact on everyday life. The first form of speech technology to exert a fundamental influence on society was the telephone (Alexander Graham Bell, 1876) followed in 1915 by the first successful long distance transmissions of speech over the radio. In parallel to these developments was the evolution of the phonograph (Edison, 1877) which by the 1920's had become a mass medium. The tape recorder was invented by Pouslon in 1898 and allowed speech to be recorded onto steel piano wires. This was not replaced by magnetic tape until 1927 (O'Neill, using magnetised paper) but commercially viable tape recorders based on magnetised plastic tape did not appear until the 1930's. Electronic speech synthesis appeared in the 1920's with simple electrical resonators simulating sustained vowels. By the late 1930's complete electronic speech synthesis systems were devised which produced speech from data extracted from natural speech (the "vocoder", Dudley, 1939) or which worked via complex manipulations performed by a human operator (the "Voder", Dudley, 1939). Of these two devices, the Voder was little more than a side-show device which was very popular at the 1939 New York World Fair. The Vocoder, on the other hand, found practical use in telephony as a means of increasing the amount of speech transmitted through the telephone network.

It should be readily seen that the practical forms of language technology referred to above are all tools which assist in the reproduction of human-composed text. Such tools have continued to evolve with the rapid development of computer-based word-processing and desk top publishing over the past 20 years. The practical forms of speech technology listed above were all speech transmission systems or speech recording and re-play systems which, in both cases, converted human speech into a form that could be stored or transmitted with the intention of that speech being heard at either a later time, or at a distant location. Such speech transmission and recording systems have continued to develop rapidly, and this process has accelerated recently as commercial computer-based sound systems have become increasingly available. These types of speech and language technology are referred to here as "passive" forms of the technology as the technology plays no part in the interpretation or original generation of the speech or text. The "passive" forms of speech and language technology are outlined in figure 1.

Figure 1: "Passive" speech and language Technology.

The remainder of this paper will concentrate specifically on:-

  1. Computer-based speech and language technology which has as its goal the generation of speech or text by the computer rather than the transmission or replay of human-generated speech or text.
  2. The computer recognition of speech (ie. the conversion of human speech into text) or the computer understanding of speech or text, rather than the simple transmission of human generated speech or text.

In other words, the paper will concentrate upon systems in which at least some of the linguistic processing of speech or text is carried out by the computer and which might be conveniently referred to as "active" forms of speech and language technology (see figure 2). All modern forms of "active" speech and language technology are computer based, existing either on general purpose computers or specialised digital signal processing chips (although sometimes digital technology is coupled with analog devices). Recording, transmission and reproduction systems will only be considered when they interface with such systems (eg. when speech recognition is interfaced to a word processor to produce a dictation system).

Figure 2: "Active" speech and language technology.

The field of speech technology has as its current major foci the three major areas of speech synthesis (including text-to-speech), speech recognition (commonly called voice recognition) and speaker recognition. Further, there is still considerable current research occurring in the fields of speech transmission, speech coding (to allow more speech to pass through the telecommunications network) and speech encryption (for the secure transmission of speech).

Speaker recognition or speaker verification systems are quite well developed at present and, as their title suggests, permit the identification of individuals by analysing their voice. The most reliable systems work well in quiet conditions but tend to be unreliable in noisy conditions (eg. voice identification rather than PIN numbers as street-side ATM machines). They must be trained on each speaker for whom identification is required (this requirement is not likely to change in the future, given the nature of the task). Training needs to be carried out under a variety of voice conditions otherwise a speaker may not be identified if he or she has a cold. They work most reliably when the person to be identified is required to utter a set phrase. They are best at identifying a speaker as a particular member of a pre-trained set of speakers, but tend to be unreliable at determining that a speaker does not belong to that pre-trained set. In other words, employees may be reliably discriminated but an outsider is likely to be erroneously identified as being the employee with the most similar vocal characteristics. Such systems are more accurately called speaker discrimination systems rather than speaker identification systems. It must be stressed that the popular notion of a "voice print" is misleading as it appears to imply that they are as reliable as fingerprints which are supposed to be unique to their owner. It is likely that in a country the size of Australia that there may be many people (probably hundreds) with indistinguishable "voice prints". What must be considered is the low probability that a person with unauthorised access to another's password will also possess an indistinguishable "voice print" and thus be able to break into the system illegally. Much current work is being carried out on the questions of speaker recognition in noise and of the rejection of speakers who do not belong to the set of trained speakers. Speaker identification systems are already available but work best as security systems when they are coupled with some other form of security. Spoken passwords may be a workable option as they combine passwords with speaker recognition. Of course, spoken passwords can be overheard but an unauthorised user's voice must then also match the authorised user's vocal patterns to a very high degree before a security breach can occur. Spoken passwords also require the additional complexity of speech recognition, or the conversion of speech into text. It is likely that speaker recognition will be routinely be incorporated into computer security systems in the near future.

Speech recognition (most commonly, but inaccurately, called voice recognition) is the process of recognising strings of words (ie. text) from the continuous flow of speech sounds. This process is extremely complicated as there are no clear acoustic boundaries between words. Further, the acoustic separation of individual speech sounds ("phonemes") is in most cases completely non-existent. The acoustic cues to individual speech sounds overlap with those of similar sounds to a great extent so that usually numerous candidate sounds of varying probability are identified, rather than a single sound. This process, when repeated for the many sounds in continuous speech, often results in an extremely large number of possible combinations. The combinations can be reduced by determining which sequences of phonemes don't make up sequences of words. This can still result in errors as in normal speech we tend to constantly delete sounds, insert sounds and modify (or "assimilate") sounds so that they become more like other sounds. Further, since the actual boundaries between words are not clear it is very difficult to decide whether a two adjacent phonemes occur in the same word or in adjacent words.

One way of improving the success rate of a speech recognition system is to require the speaker to pause between each word so that the boundaries between words are made explicit. Such systems typically attempt to match spoken words with an internal library of words. Such systems can be either speaker-dependant or speaker-independent. Speaker-dependant systems are "trained" for each speaker that needs to communicate with the system. This typically involves the speaker speaking the word list, or some other training list, into the recogniser. Speaker-independent systems have been pre-trained on a large number of speakers and so new speakers don't need to train the system to their voices. Such systems work best on a dialectally homogeneous population of speakers. For example, American English systems have a higher error rate with speakers of Australian English than with speakers of American English. This is a major problem for a multi-cultural society such as Australia and will probably be solved over the next ten years or so by sub-systems that first identify dialect and then switch to the appropriate dialect-oriented speech recognition sub-system. Systems can either be limited to restricted word lists (closed sets) or to specific semantic "domains" (eg. student enquiries, weather, etc.) or they can be designed to identify an unrestricted range of words (open sets). Restricted word list systems are more accurate than restricted domain systems which are, in turn, more accurate than unrestricted systems.

Currently on the market are a number of speech recognition systems. Many are isolated-word, speaker-dependant, restricted word list systems and such systems are typically very accurate (eg. 98-99%). Some speaker-independent systems are also commercially available. They typically are restricted to very small word lists (eg. the digits 0-9) and can be very reliable for this small word list if the words are uttered in isolation (eg. 8-5-0-7-1-1-1), but error rate increases for dialect groups quite distinct from the group that the system is trained on. (nb. some work has already been carried out in Australia on Chinese, Vietnamese, Italian and Greek varieties of English). A number of continuous speech systems are now becoming available. Such systems are most reliable when they are speaker-dependant and these typically require training sessions of considerable duration. Typical performance on current systems is around 90-95% accuracy and the speaker is often prompted by the system to resolve confused identifications. All existing systems only work well in quiet conditions and so are not particularly suitable to open-plan offices.

There is a great deal of research being carried out in Australia (including Macquarie University) on the resolution of these many issues. It is now becoming clear that the ultimate feasibility of speaker-independent, unrestricted, continuous speech systems with success rates in at least the mid-90% range is dependant upon much more detailed consideration of the phonetics and linguistics of the speech perception and speech understanding process that occurs in humans. It seems that even the most clever computational systems (eg. statistical, neural networks, etc.) are not capable of providing accuracy much greater than around 95% for these unrestricted systems (even given that the target population is dialectally homogeneous). All commercial and most research speech recognition systems make no attempt to "understand" what is being said, but only attempt to convert the speech into strings of words (text) and so are really speech-to-text systems. It is now clear that many of the problems that occur in the final selection of words and word strings from the many competing possibilities can only be overcome by some degree of "speech understanding" being incorporated into the systems. For example, the correct solution often occurs in the set of possibilities but is rated as having a lower probability than other incorrect strings of words which are given high probabilities on acoustic phonetic grounds. If the system is able to "understand" the semantic context or "domain" of the speech that has immediately proceeded then it may be able to reject strings with higher acoustic-phonetic probability but inappropriate semantic focus in favour of a more semantically appropriate alternative. This can be achieved by incorporating sophisticated, and currently embryonic, "speech understanding" "artificial intelligence" systems. Alternatively, one can "cheat" by restricting the system to a limited domain of discourse. It is then much more achievable to determine whether individual words or word strings are appropriate to the expected domain. For example, if the domain is limited to weather reports then the system would, when confronted with two possibilities, "rain" and "lane", select the first word as being more appropriate to the domain. Such systems are currently under development in a number of laboratories.

A few restricted-domain continuous speech systems requiring at least some speaker training are now available. Speaker-independent versions (no individual training) of such systems are also beginning to emerge.

Currently, the least developed area of speech and language technology is the area of language understanding. Such systems can be divided into speech understanding systems when the input is speech and text understanding systems when the input is text. Speech understanding systems are very complex as the "understanding" component is used to assist the process of speech recognition. At the same time the "understanding" component only has a tentative text at its disposal in its quest for the meaning of the text. There are currently only very early experimental and incomplete speech understanding systems. It is unlikely that systems which are able to extract the meaning of continuous, unrestricted, natural spoken language will become available in the next 10 or more years. Text understanding systems are not much more advanced. There has, however, been considerable success with limited systems which concentrate on only certain aspects of the semantics of language or which deal only with language from limited domains. Some systems are fairly successful at understanding simple declarative statements which consist of single simple clauses and which are restricted to simple domains. Isolated command word "understanding" is, in contrast, almost trivial, if the system is limited to a closed set of commands (in much the same way that computer programming languages are limited to small lists of commands and simple syntax). The challenge lies in the "understanding" of unrestricted natural text or speech. The compromise solution, for the foreseeable future, is to limit the human speaker or writer to semi-natural language which consists of a simplified and very consistent syntax (a sub-set of that of natural language) to be used within a limited domain of knowledge. This will permit semi-natural interaction with a computer or other computerised systems. Alternatively, a system can be limited to a limited domain. There is a lot of current work on spoken dialogue systems which focus on a single semantic domain (eg. sports events, airline ticketing, etc.) and considerable progress is being made in the development of commercial versions of such domain-limited systems.

Text generation systems are a better developed area of natural language processing. The goal is the production of text from some internal representation of meaning. When coupled with a text-to-speech system a computer is able to produce speech from internal meaning or knowledge representations. There has been considerable success in the development of experimental text generation systems which are able to produce apparently natural text within a limited knowledge domain. Work on these issues is being carried out at Macquarie by two groups, one in the Linguistics Department and one in the Computing Department. In contrast, systems with unlimited domain have generally been avoided in most experimental work to date. Currently, there is a great deal of attention being given to the use of text generation systems as the output front-end to computer database query systems. A computer database can be considered a limited domain of knowledge from which meaning can be extracted in response to user queries. A text generation system allows for flexible natural language output from such a system in response to novel and possibly unexpected queries. When coupled with text-to-speech this results in a system which would enable spoken database query responses over the telephone (in response to touch-phone codes or isolated word command recognition). Such systems are currently in commercial development and are beginning to emerge on the marketplace.

Machine translation research has been carried out for about 40 years now and has proven far more difficult that many early researchers predicted. Essentially, the ideal machine generation system consists of a speech or text understanding system which translates the incoming language into an individual-language-independent representation of meaning. This can then form the input to a text generation system which produces text (and potentially speech) in the target system. The definition of representations which are independent of the idiosyncrasies of the source or target language forms a major research focus of laboratories in Europe where an international collaborative team is developing a multi-language translation system. Commercial translating systems are much simpler than this and are generally specific to the combination of a particular pair of languages. They tend to translate word-by-word are not very successful with complex sentences or the very many sentences containing idiomatic phrases. More complex systems may start appearing commercially in the next few years although they are likely to be domain dependent initially and be restricted to sentences with simple clause structures.

Text-to-speech (TTS) and speech synthesis systems have formed a central part of research in the Linguistics Department at Macquarie for about 30 years. TTS is the most mature of the "active" speech and language technologies. It is now possible to reliably (99% accuracy) produce speech from unrestricted text. The main issue in TTS research is the naturalness of the speech. The area where most systems fall down is in the area of prosody which entails the accurate modelling of rhythm and intonation. This is particularly problematic as accurate prosody requires an understanding of the grammar of the text, and its intended meaning, including which words are intended to be the focus of the sentence. In order to achieve this with complete success, it is necessary to incorporate subsystems which examine the grammar and the meaning (ie. text understanding systems). Such systems are now being examined experimentally but are still far from commercial realisation. Most systems, including the Macquarie systems, usually adopt a much simpler model of text grammar and simply assume that the model is met when producing sentence rhythm and intonation. This produces prosody that is acceptable for most simple statements. Such an approach is perfectly acceptable for database query response systems which would usually be expected to respond using simple statements. When TTS is added to the "front-end" of a text-generation system it potentially has the added advantage of an already known grammatical structure and semantic intent. TTS systems are now abundantly available commercially. The quality of such systems varies enormously from almost unintelligible systems to the very intelligible American DecTalk system. Most available systems utilise American English, but it is hoped that the Macquarie system may eventually become commercially available once some remaining problems are resolved. This will provide the Australian market with an Australian English TTS system and it is intended that such a system will find application in a large number of telephone information services (amongst others). In the near future we will be looking at expanding the range of voices available. At least one other Australian English synthesiser is also under development and is (I believe) based on the MBROLA approach which will be dealt with elsewhere in this course. A number of laboratories are now looking at the issue of vocal affect (emotion). It is possible that by the end of this decade it will be difficult to distinguish synthetic voices from natural ones. Such synthetic voices will be male or female, will possibly simulate a number of dialect varieties and will be capable of sounding happy or sad. TTS systems will not only be of use in database query response systems but will also be useful in the teaching of first and second languages (as well as other computer-aided learning tasks) and in the important area of assisting the visually impaired in the use of computers.

In summary, it is clear that "passive" speech and language technologies are currently in an advanced state of development. Of the "active" forms of speech and language technology, the generation side (synthesis, TTS and text generation) is much more advanced than the recognition and understanding side of the technology. Some of this "active" technology is currently available commercially and limited but useful versions of the remaining aspects of the technology will emerge over the next few years. Such systems will have a growing effect on the way we access computerised knowledge and as the more fully realised systems become available early next century we will begin to interact with computers in that most human and natural of all media, natural speech and language. This technology will also have a significant impact on certain areas of computer assisted learning and teaching and particularly in the area of language teaching. Finally, this technology will have an increasingly liberating effect on the visually impaired and also the vocally motor impaired.