Skip to Content

Department of Linguistics

ROBERT MANNELL - RESEARCH PAGES

Vowel Perceptual Space Mapping

Australian English Subjects

This project examines the perception, by native speakers of Australian English, of a large array of synthesised speech tokens. In a typical condition the speech tokens consist of "long" and "short" vowels in an /h_d/ context (although consonantal context is varied for some conditions). A large array of tokens is generated by a specially modified formant synthesiser. The tokens are uniformly spaced in two "vowel spaces". The two vowel spaces differ only in vowel duration ("long" versus "short"). Listening subjects are asked to identify each token (the details of exactly what they are asked to do varies from condition to condition). From the responses, a contour map is developed for each vowel phoneme that indicates the percentage of responses that specified that vowel phoneme for each point in the two vowel spaces. A composite contour map is then produced that displays the responses for all vowel phonemes.

A typical pair of vowel space contour maps is displayed in figures 1 and 2. These two diagrams display the responses of phonetically trained subjects responding with Australian English vowel phoneme symbols. These particular vowel spaces have tokens located at each grid intersection point within the defined vowel space boundary. The tokens in this example are in an /h_d/ context and long vowels are defined here as being 300 ms in length and short vowels are defined as being 150 ms in length (for some conditions, these durations are varied). The tokens are confined to a frequency range typical of an adult male speaker. Higher formants in this condition model those of a male speaker. The F0 is 160 Hz which was defined, for the purpose of this project, as being ambiguous with respect to speaker sex.

Figure 1: Perceptual (F1/F2) space of Australian English subjects listening to a synthetic male voice producing SHORT vowels. 75% (dark red fill), 50% (light grey fill) and 25% identification contours are shown. (Mannell, 1988)

Figure 2: Perceptual (F1/F2) space of Australian English subjects listening to a synthetic male voice producing LONG vowels. 75% (dark red fill), 50% (light grey fill) and 25% identification contours are shown. (Mannell, 1988)

This project has examined the effect of 10 independent variables in various combinations. The project commenced in 1988 and data collection was completed in 2008. Over 750 subjects have participated. Table 1 summarises most of the conditions in this project.

VWL
Frame
Contx
Phon
Len
Hi Fx
F0
BW
Lo F3
F4/ F5
Phon. Naive
#Pnts
#Subj
Year
01
M
h_d
phon
300
M
160
OK
NO
YES
NO
224
20
1988
02
F
h_d
phon
300
F
160
OK
NO
YES
NO
360
20
1988
06
M
h_d
ortho
300
M
160
OK
YES
YES
YES
280
30
89-91
07
F
h_d
ortho
300
F
160
OK
YES
YES
YES
289
20
89-91
08
M
h_d
ortho
300
M
110
OK
YES
YES
YES
280
20
89-91
09
F
h_d
ortho
300
F
220
OK
YES
YES
YES
289
20
89-91
10
M
h_d
ortho
300
none
160
OK
NO
NO
YES
224
18
1991
11
F
h_d
ortho
300
none
160
OK
NO
NO
YES
230
18
1991
12
M
h_d
ortho
300
none
110
OK
NO
NO
YES
224
20
1991
13
F
h_d
ortho
300
none
220
OK
NO
NO
YES
230
19
1991
14
red.F
h_d
ortho
300
F
160
OK
NO
YES
YES
136
20
90-91
15
red.F
h_d
ortho
300
F
220
OK
NO
YES
YES
136
20
90-91
16
red.F
h_d
ortho
300
F
110
OK
NO
YES
YES
136
20
1991
17
F
h_d
ortho
300
none
110
OK
NO
NO
YES
230
19
1991
18
M
h_d
ortho
300
none
220
OK
NO
NO
YES
224
19
1991
19
M
h_d
ortho
300
M
110
400
YES
YES
YES
280
19
90-91
20
M
h_d
ortho
300
M
110
800
YES
YES
YES
280
20
1991
21
M
h_d
ortho
300
M
220
OK
NO
YES
YES
224
18
1991
22
F
h_d
ortho
300
F
110
OK
NO
YES
YES
230
18
1991
23
red.M
h_d
ortho
300
M
160
OK
NO
YES
YES
112
20
1991
24
M
h_t
ortho
180
M
110
OK
NO
YES
YES
224
20
2001
25
M
h_t
ortho
300
M
110
OK
NO
YES
YES
224
20
2001
26
M
s_t
ortho
180
M
110
OK
NO
YES
YES
224
20
2001
28
M
h_d
phon
300
M
110
OK
NO
YES
NO
224
20
2004
29
M
h_d
ortho
300
M
110
OK
NO
YES
NO
224
20
2004
30
F
h_d
phon
300
F
160
OK
NO
YES
NO
360
20
2005
31A
M(+F)
h_d
phon
300
M(+F)
160
OK
YES
YES
NO
309
20
2005
31B
M(+F)
h_d
phon
300
M(+F)
160
OK
YES
YES
NO
300
20
2006
32
F
h_d
ortho
300
F
220
OK
YES
YES
YES
289
19
2006
33
M
h_d
ortho
300
M
110
OK
YES
YES
YES
224
21
2006
34
F(+M)
h_d
phon
300
F(+M)
220
OK
YES
YES
NO
309
20
2006
35
red.F
h_d
ortho
300
F
110
OK
NO
YES
YES
136
20
2006
36
red.F
h_d
ortho
300
F
160
OK
NO
YES
YES
136
20
2006
37
red.F
h_d
ortho
300
F
220
OK
NO
YES
YES
136
20
2006
38
F
h_d
ortho
300
F
220
OK
YES
YES
NO
289
19
2007
39
M
h_d
ortho
300
M
160
OK
YES
YES
YES
280
20
2008

Table 1: This table summarises the Australian English conditions tested in this project. The conditions have been given codes such as "VWL01" and an abbreviated version of these codes are shows in column 1. A typical condition consisted of 20 subjects, although this number varied between 18 and 30 (see the 2nd right-most column "#Subj"). The third right-most column ("#Pnts") indicates the number of tokens presented to the subjects in each condition. The remaining columns summarise the treatment of the independent variables in each of the conditions and these will be described in the text below. The blue shading of certain cells highlights less common independent variable settings. Missing numbers under the VWL column represent discarded conditions.

Frame   A "frame" defines the size of the vowel space for each condition. A "male" frame ("M") can be seen in figures 1 and 2, above, and is intended to represent an average male vowel production space (ie. the range of possible vowel articulations for an average male vocal tract. A "female" frame ("F") similarly attempts to model the range of possible vowel articulations for an average female speaker. This space has a similar shape to the male frame but is significantly larger with a maximum F2 of 3120 Hz and a maximum F1 of 1080 Hz. A "reduced female" frame ("red.F") is similar in size to a male frame but the token spacing is the same as for the female frame. A "reduced male" frame ("red.M") is reduced in size in the F1 and F2 dimensions relative to the male frame. The "red.F" conditions were repeated in 2006 utilising a better model of female F3. M(+F) and F(+M) represent a condition utilising male or a female frames (and HiFx), respectively, followed by a number of tokens with the opposite vocal gender specification (to test a point normalisation hypothesis).
Contx   This variable indicates the consonantal context for each condition. The most common context is the often used /h_d/ frame, but so far /h_t/ and /s_t/ frames have also been examined.
Phon   This variable indicates whether the subjects were instructed to provide an orthographic response ("ortho") or a phonetic response ("phon"). Obviously, phonetically untrained subjects responded only orthographically.
Len   This variable indicates vowel length. A value of "300" indicates that the long vowels had a duration of 300 ms and that the short vowels for such a condition had a duration of 150 ms. A value of "180" indicates that the long vowels had a duration of 180 ms and that the short vowels for such a condition had a duration of 90 ms.
Hi Fx   This variable indicates whether formants above F3 were present and whether they had values which modeled a male (lower F4 and F5) or a female (higher F4) vocal tract. As the synthesiser was limited to a 5000 Hz bandwidth, the female F5 was omitted from the model as it is greater than 5000 Hz.
F0   Three F0 values are used: 110 Hz (male), 160 Hz (neutral), 220 Hz (female)
BW   For most conditions vowel formant bandwidths are calculated using the formula BW=A*(1.+FX/B) where A=50 and B=2000 and FX is the formant frequency. This results, for example, in a bandwidth of 100 Hz for a formant with a frequency of 2000 Hz. For two conditions (VWL19 and VWL20) the bandwidth was fixed to the much wider values of 400 Hz and 800 Hz for all formants at all frequencies.
Lo F3   F3 was determined for most vowel tokens as a simple function of F2. That is, for each value of F2 there is a single value of F3. This approach is a reasonable model of F3 for all Australian English vowels except for /u:/ which, being a rounded central vowel has a lower F3 than the other central vowels, which are all unrounded. For some conditions, this is modeled by the provision of a second partial long vowel plane which is distinguished from the main plane by its consistently lower F3 values.
F4/F5   This variable indicates whether F4 and F5 are present. When not present they are modeled with a low gain and a broad bandwidth.
Phon. Naive   Some conditions use phonetically trained subjects whilst other conditions use untrained subjects. Phonetic training is defined minimally as a person who has at least completed our second year phonetics course (or equivalent) and who obtained a good result in the transcription assessments.
Forced
Choice
  Originally there was an additional independent variable "forced choice" but regardless of the instructions most participants left zero blanks or at most only a very small number of blank responses so htere was effectively no distinction between forced choice and non-forced choice conditions.
Year   The year(s) the data for this particular condition was collected. The time span of this project permits controlled examination of change in monophthong vowel perception in Australian English over a 19 year period.

Non-native Speakers of English and Native Speakers of Other Dialects of English

Non native speakers of English (135 subjects), native speakers of other dialects of English (34 subjects) and male and female native speakers of Australian English (30 female and 30 male) were asked to perform a task identical to condition VWL06. They were then asked to record a list of /h_d/ tokens as elicited by orthographic prompts. Individual subject vowel spaces are being examined as well as composite vowel spaces for subjects with the same L1. This latter analysis is only being done on L1 groups with a reasonable number of subjects, such as Cantonese, Japanese and Korean as well as British and US English and separately for each of male and female Australian English speakers. Perceptual centroids for each vowel are being compared to the equivalent produced vowel (following normalisation to an equivalent male vowel, where necessary). Data was also collected on L2 English speakers' experience with English.

Research Questions

The main research questions are:-

  1. To what extent do each of vowel frame size, fundamental frequency and higher formant frequencies contribute to vowel perceptual normalisation?
  2. Which model of normalisation best accounts for the differences between the male and female vowel spaces?
  3. To what extent and in what way do lexical access effects (lexicality, lexical frequency, neighbourhood density) influence these vowel perceptual patterns? That is, how do the processes of lexical access and normalisation interact?
  4. What evidence is there for changes in vowel perceptual patterns over the 17 years of this project and are these changes related to observed changes in the production of Australian English vowels over the same period?
  5. What differences can be found in the perceptual patterns of phonetically trained and phonetically naive subjects?
  6. What differences can be found in the perceptual patterns of phonetically trained subjects when they respond with orthographic (and therefore lexical) responses and when they respond with vowel phoneme symbols?
  7. What is the interaction between L1 and vowel perceptual patterns for learners of English?
  8. What is the interaction between the extent of English experience and vowel perceptual patterns for learners of English?
  9. To what extent do perceived vowel centroids relate to (normalised) productions of the same vowels for speakers of diverse L1 or other dialects of English?

Project Status

A very large amount of data has been collected from more than 800 subjects and this has all undergone initial data entry, processing and analysis. All planned data collection for this project is now complete. So far, only preliminary results have been presented at conferences. Several papers are currently being prepared for submission to refereed journals.

No doubt, numerous follow-up experiments might be anticipated for the future. The synthesiser will be modified to permit it to produce a larger range of tokens (ie. more consonantal contexts) and this will also probably result in the potential for further experiments. These modifications could expand its applicability to other dialects of English and perhaps even to other languages.

Relevant Papers

Mannell, R. (2009) "Production and perception of Australian English /Iə/ and /e:/ in CV and CVd context", Proceedings of The Australian Language and Speech Conference 2009, 3-4 December 2009, Sydney (part of HCSNet SummerFest 2009, at University of New South Wales)

Mannell, R. (2008), "Perception and Production of /i:/, /Iə/ and /e:/ in Australian English" ,Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech 2008), 22-26 September 2008, Brisbane, pp.351-354

Mannell, R., (2006), " Perception and modelling of vowels and vocal gender in synthetic speech", Australian Journal of Psychology, Vol. 58, Supplement 2006, p.9. (abstract only) (presentation slides)

Mannell, R., (2005), " Perception and modelling of vowels and vocal gender in synthetic speech", 15th Australian Language and Speech Conference, Macquarie University, 15-16 December, 2005.

Mannell, R.H. (2004), "Perceptual vowel space for Australian English lax vowels: 1988 and 2004", Proceedings of 10th Australian International Conference on Speech Science and Technology, Sydney, Australia, pp 221-226.

Mannell R.H. (2001), "The influence of lexical access on the perception of vowel phoneme boundaries in formant space", 13th Australian Language and Speech Conference, Sydney: Australia (Abstract) (presentation slides)

Mannell R.H. (1995), "Perceptual mapping and vowel normalisation", Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Sweden, August 13-19, 1995

Mannell R.H., (1988) "Perceptual space of male and female Australian English vowels", Proceedings of the Second Australian International Conference on Speech Science and Technology, Sydney, Nov. 1988. pp 22-27

Bernard J.R., & Mannell R.H., (1988) "A study of /h_d/ words in Australian English", Working Papers, 1986, Speech Hearing and Language Research Centre, Macquarie University.