Skip to Content

Department of Linguistics

Important: If you have not yet either installed the phonetic font "Charis SIL" or tested this installation to determine if the phonetic characters installed properly then click here to go to the phonetic font help pages.

Robert Mannell

Vowel Perception in Australian English

Please Note: For an explanation of the phonetic symbols used to represent the Australian English vowel phonemes, go to the Australian English Vowel Symbols page

The perception of vowels depends primarily on frequency domain cues, since vowels have stable configurations over rather long periods and are typically treated as being fairly steady state combinations of fine spectral detail derived from the glottal source function and a spectral envelope derived from the vocal tract filter function which shapes the spectrum into a series of peaks or formants (Fant, 1968). According to this theory, the pattern of these formants relative to each other (on the frequency axis, at a fixed point in time) give us most of the cues we require to allow us to identify vowels and various voiced consonants such as the glides. Peterson and Barney (1952) examined the vowels of American English and the distribution of individual speaker's vowels in formant space. Although the various tokens of each vowel clustered together, there was nevertheless a certain amount of overlap between vowels with adjacent formant values. They found that when vowels are confused, they are usually confused with an adjacent vowel on the F1/F2 plane.

Australian English (Aus.E.) is a "non-rhotic" (no syllable-final /r/) dialect of English and is very similar at a phonological level to South Eastern Urban British English. The main differences between these two dialects (at least as far as vowels are concerned) are in the areas of phoneme realisation and phoneme selection. In other words, the vowel phoneme repertoire is more or less identical but speakers of Australian English and British English pronounce them differently and sometimes choose to select different phonemes when pronouncing the same words. Australian English has traditionally been described (Mitchell, 1946; Mitchell & Delbridge, 1965) as consisting of a continuum of varieties: "Cultivated", "General", and "Broad". The Broad end of the continuum is the most marked Australian form whilst the Cultivated end of the continuum tends towards the British English Received Pronunciation (RP) form (although Bernard (1970) claims it is nevertheless quite distinctly Australian). About 2/3 of the Australian population speak the General variety. Like British English, Aus.E. consists of 11 monophthongs (ignoring schwa) and 5 closing diphthongs (see table 1, below). Both dialects, being non-rhotic, have a set of centring diphthongs (replacing /Vr/ sequences in rhotic dialects). The exact number and identity of the centring diphthongs is difficult to determine as Aus.E. is currently in the process of losing some of them. /ɔə/ has already been lost from Aus.E. (some time between Mitchell (1946) and Mitchell and Delbridge (1965a)). /ʊə/ is pronounced by some speakers as [oː] (this process seems to be happening word-by-word rather than globally). /eː/ is pronounced by many General and Cultivated speakers as [eə] and by many Broad and General speakers as [e:] (and they all hear [eə] and [eː] as /eː/, see below). /ɪə/ is also pronounced by some speakers as [ɪː] (made possible because /iː/ is usually pronounced [əi] or [əi] by the same speakers) and [ɪː] is also heard by most Aus.E. listeners as /ɪə/.

Figures 1 and 2 are plots of Bernard's (Bernard, 1970; Bernard and Mannell, 1986) vowel production data for all 11 Australian English (Aus.E.) vowel monophthongs (excluding schwa) as produced by speakers of all three major varieties of Aus.E. There is no significant difference in the vowel formant values for each monophthong vowel phoneme across the three varieties of Aus.E. and so the three varieties are presented as single pooled two-standard-deviation ellipses on the two diagrams (the 2 standard deviation ellipses closely follow the actual data variation). It can be readily seen that /ʉː/ is very fronted in Aus.E. Further, /ʉː/ is pronounced with a more lip-neutral gesture than in British English and this is reflected in the fairly high F3 for this vowel. /ʉː/ and /ɜː/ overlap much less in F1/F2/F3 space than they do on either the F1/F2 or the F2/F3 planes and so the rounding distinction, although reduced, seems to be at least partly maintained. /ʊ/ and /oː/ are close to being minimally contrasted by the length feature. /e/ and /æ/ are higher than the equivalent vowels in British English. What is not evident in the diagram is that /iː/ has a distinct central onglide for most speakers of Aus.E. (/iː/ = [ɪiːə, and əi]) which is probably an important perceptual cue to this vowel.

Figure 1: F2/F3 plot (2 standard deviation ellipses) of the monophthongs produced by adult male speakers of Australian English (after Bernard & Mannell, 1986)

Figure 2: F1/F2 plot (2 standard deviation ellipses) of the monophthongs produced by adult male speakers of Australian English (after Bernard & Mannell, 1986)

This document, when examining vowel perception, focuses on the monophthongs. This is because the monophthongs suffice to illustrate most of the effects that frequency, intensity and temporal distortion can cause in either the impaired auditory system, or in a speech synthesis system. In Mannell (1994), the study was restricted to monophthongs for the following reasons:-

i) The diphthongs are the main source of variation between the varieties of Aus.E. and so listener idiolect effects would be very likely to interfere with the perceptual effects of parametric manipulation.

ii) Most of the diphthongs form non-lexical items in /h_d/ frames and some of the orthographic forms of these nonsense words create difficulties for naive subjects that interfere with interpretation of what was perceived. (nb. some of the monophthongs form nonsense words in an /h_d/ frame but the orthographic representations are quite straightforward)

iii) The monophthongs are sufficient to demonstrate most of the types of frequency domain effects of parametric manipulation on vowel intelligibility. It is possible, however, that some conditions of extreme time resolution distortion may have significant effects on target 1 to target 2 transitions in diphthongs. Similar effects should also occur, however, in approximant-to-vowel transitions and examination of such tokens should be sufficient to deduce general effects that would also affect diphthongs.

Peterson and Barney's (1952) study of American English showed an overlap on the F1/F2 plane of adjacent vowel productions. As has been shown above, the same order of overlap is evident in Australian English vowel production. The Peterson and Barney (ibid.) study also showed that when perceptual problems occurred, vowels tended to be confused with adjacent vowels on the F1/F2 plane. It would be useful, for the purposes of the present study, to have a clear picture of how vowels are perceived on the F1/F2 plane in Aus.E. The present author, in another study (Mannell, 1988, 1995), examined the vowel perception of Aus.E. listeners. The subjects were asked, in a forced-choice experiment, to indicate which /h_d/ word or nonsense word they heard when listening to a list of tokens produced by a synthetic voice. The study examined a number of simulated voices ranging from a male to a female voice. Each subject, however, only listened to one voice. The aim of that study was to characterise Aus.E. vowel perception and also to examine the shift in vowel boundaries (vowel normalisation) that occurred when various aspects of the vocal quality (simulating a physiological continuum) were modified. The condition examined here has been found to have a typical pattern of vowel boundaries for simulated male voices and so should be of relevance to the present study which examines the parametric manipulation of a male voice. This voice had F3, F4 and F5 appropriate to an average male speaker of Aus.E. (using values from Bernard & Mannell, 1986).

Figures 3 and 4 summarise the results for the relevant condition (there were 30 subjects for this condition). The results are divided into two planes, short and long, and the vowels in the /h_d/ tokens were presented either short or long (150 and 300 msecs respectively, which are typical durations for Aus.E. vowels in /h_d/ frames uttered in citation format (Bernard & Mannell, 1986)). The actual tokens produced can be found at the grid intersections in the diagrams (ie. 100 Hz spacing in F1 and F2). The diagrams show up to 3 contours for each perceived vowel. The area inside the inner contour (dark red fill) contains tokens perceived as that particular vowel by 75% or more subjects. The area between the next contour and the inner contour (light grey fill) contains tokens perceived by 50-75% of the subjects. This 50% contour is the "predominance boundary" of Nearey (1977) within which a particular vowel percept predominates. The next contour is the 25% contour and the area it encloses is not shaded as this area is typically the area of perceptual overlap with adjacent phonemes.

Figure 3: Perceptual (F1/F2) space of Australian English subjects listening to a synthetic male voice producing SHORT vowels. 75% (dark red fill), 50% (light grey fill) and 25% identification contours are shown. (Mannell, 1988)

Figure 4: Perceptual (F1/F2) space of Australian English subjects listening to a synthetic male voice producing LONG vowels. 75% (dark red fill), 50% (light grey fill) and 25% identification contours are shown. (Mannell, 1988)

It should be mentioned, that these results are for phonetically-trained (senior undergraduate) listeners. Results for phonetically-untrained subjects vary from the above in a number of ways. Firstly, perception of the data by untrained subjects resulted in a much weaker phoneme space for /iː/, /ʉː/ and /ɐ/. The /iː/ space was weaker in the sense that proportionally fewer subjects perceived each data point as /iː/ than was the case for the trained subjects even though the actual boundary positions were similar. It seems likely that the untrained subjects were somewhat confused by the lack of the common onglide ([əi]). The /ɐ/ vowel space was much weaker as many of the short tokens were perceived as /ɐː/. This is probably due to the process of lexical access favouring common lexical items ("hard") over uncommon items and nonsense words ("hud"). The /ʉː/ vowel space is probably weak for similar reasons (subjects were prepared for the response "who'd" but nevertheless many naive subjects had problems with that orthographic form and used "hood" instead). In general, confusions occurred less than 100Hz from the phoneme boundaries and then with the adjacent phoneme. Also, sometimes a long vowel token was heard as a short vowel phoneme (/e/, /ɔ/ and /ʊ/ can be seen weakly in the long vowel space) and sometimes a short vowel token was heard as a long vowel phoneme (/oː/, /ɜː/ and /ʉː/ can be seen weakly in the short vowel space). Note that /æ/ is found to be perceived strongly regardless of whether the token is long or short. This is consistent with Bernard's (Bernard, 1970; Bernard & Mannell, 1986) data that indicates that /æ/ is the longest "short" vowel in Aus.E. and could be considered to be of intermediate (or neutral) duration. In the long vowel space is a region of overlap between /e/ and /eː/ perceptions (both vowels averaging about 40% of identifications in that region, and each occasionally fluctuating above 50%). This is consistent with the observation (Mitchell & Delbridge, 1967; Bernard, 1970; Clark, 1989) that /eː/ is produced by many Australians as [eː] and that there is a trend towards this becoming the most common pattern (Cox, pers.comm.) amongst adolescents. It is interesting to note that when a space occurs between two vowels of a particular length the tokens in that region are often perceived as belonging to the vowel phoneme of inappropriate length. This is usually not a strong tendency, however, as can be seen in the central region of the short vowel space. Even though there is no short vowel appropriate to this region (except perhaps schwa, which ~25% of trained subjects indicated) the /ɜː/ percept never rose above 50%. For naive subjects, there were a few tokens with /ɜː/ percepts a little above 50% (presumably due to no competition with a possible /ə/ response) but in general the pattern was similar. These results suggest a three dimensional vowel perceptual space for Aus.E. vowels consisting of F1, F2 and duration (with a possible fourth F3 dimension) which should be mappable onto an acoustic/auditory space of time, frequency and intensity.

The most important temporal cue in vowels is the [short] vs [long] length opposition mentioned above. Bernard (1970) lists the nucleus durations for 19 Aus.E. vowels in nonsense syllables presented in both citation form and in a short sentence carrier. The short vowels have durations in the range of about 140 to 200 msecs, whilst the long vowels and diphthongs have durations of between about 260 and 320 msecs. Three of the four shortest vowels (160 msecs or less) are the three paired with a long vowel and the differences between the lengths of the short and the long vowel of a pair are greater than 120 msecs. It is unlikely that a speech synthesiser would have a gross time resolution poor enough to cause confusion between the short and the long members of these pairs. It is important to note however that the lengths of these vowels are quoted for one environment only, and that vowel length is very much affected by environment. In continuous speech, phoneme duration would be significantly shorter for both long and short vowesl than that which occurs for citation forms.

The lengths of the component portions of the A.E. vowels are also examined by Bernard (1970). He lists values for onglide, target and offglide for monophthongs, and onglide, target 1, medial glide, target 2, and offglide for diphthongs. The targets of the monophthongs range between 40 and 170 msecs and so will be resistant to all but the most extreme cases of time resolution degradation. The durations of the first target and between-target transitions of the diphthongs have similar values, but the second target of these vowels range from 0-44 msecs in length and so might conceivably be affected by time resolutions of the order of 20 msecs. Some studies (eg. Strange et al, 1976), on the other hand, have indicated that formant transitions to or from adjacent consonants are important for vowel perception, and as many such transitions are very fast, it is possible that a large reduction in synthesiser time resolution, or impaired auditory resolution, may obscure these cues and thus reduce vowel as well as consonant intelligibility.

References

Bernard, J.R. (1970) "Toward the acoustic specification of Australian English", Zeitschrift fur Phonetik, Sprachwissenschaft und Kommunikationsforschung, Band 23, Heft 2/3

Bernard, J.R. and Mannell, R.H. (1986) "A study /h_d/ words in Australian English", Working Papers of the Speech, Hearing and Language Research Centre, Macquarie University

Harrington, J., Cox, F., & Evans, Z. (1996) "An acoustic study of broad, general and cultivated Australian English vowels", Australian Journal of Linguistics.

Mannell R.H., (1988) "Perceptual space of male and female Australian English vowels", Proceedings of the Second Australian International Conference on Speech Science and Technology, Sydney, Nov. 1988. pp 22-27

Mannell, R.H. (1994) The Perceptual and Auditory Implications of Parametric Scaling in Synthetic Speech, Ph.D. dissertation, Macquarie University

Mannell R.H. (1995), "Perceptual mapping and vowel normalisation", Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Sweden, August 13-19, 1995

Mitchell, A.G. (1946) The Pronunciation of English in Australia, Angus and Robertson.

Mitchell, A.G., & Delbridge, A. (revised edition, 1965(a)) The Pronunciation of English in Australia, Angus and Robertson.

Mitchell, A.G., & Delbridge, A. (1965(b)) The Speech of Australian Adolescents, Angus and Robertson.

Nearey, T.M. (1977) Phonetic Feature Systems for Vowels, Ph.D. Dissertation, Univ. of Connecticut

Peterson, G.E. & Barney, H.L. (1952) "Control methods used in a study of vowels" JASA, 24, 175-184

Strange W., Verbrugge R.R., Shankweiler D.P. & Edman T.R. (1976) "Consonant environment specifies vowel identity", JASA 60(1), 213-224