Skip to Content

Department of Linguistics

Formant Parameter Extraction Procedure

Robert Mannell

There are many approaches to formant extraction which are based on various (typically LPC-based) signal processing techniques. Such approaches to formant extraction are normally fully-automated although provision is often made for intervention after the process has been completed (eg. mouse-controlled editing of formant tracks). It is normally not feasible to apply expert system approaches to formant tracking as most applications (eg. formant vocoding) require the algorithm to operate on raw speech data.

A further limitation of most formant tracking procedures is that formant gain and bandwidth extraction is either absent or very rudimentary. It is extremely difficult, for example, to extract meaningful bandwidth information as the extracted values are dependant on both the actual formant bandwidth and also on the extraction process (eg. variations in the number of LPC coefficients result in variations in bandwidth). It is also difficult to define gain in a useful way. Again, in LPC extraction procedures, the relative peak height of the formants can vary with the number of coefficients used. Further, is it desirable for raw formant peak height be measured or is it more meaningful to measure formant gain after removing the effect of source spectral slope. For many applications is may not be necessary to accurately measure formant bandwidth and gain. For example, many speech analysis tasks of relevance to acoustic phonetic research only require accurate formant frequency information. Such tasks typically most often examine formants in vowels and (to a lesser extent) vowel-like consonants. In vowels, gains and bandwidths are fairly predictable (Fant, 1960). This predictability is utilised in serial formant synthesis where vowel formant gains are a straightforward consequence of the cascading of formant filters of equal input gain. Vowel formant bandwidth is also readily calculated utilising a simple relationship between formant centre frequency and formant bandwidth. If the purpose of the formant extraction procedure is to produce parameters for a parallel formant synthesiser, and especially if the intention is to model consonants, then it becomes much more critical to accurately model formant gains and bandwidths. Exactly how gain and bandwidth are modelled depends upon the use intended for those parameters. It is likely that different definitions of adequate formant gain and bandwidth parameters might be necessary for different applications.

The purpose of the present paper is to outline a formant frequency, gain and bandwidth extraction procedure that would be suitable for use with a parallel formant synthesiser. The parameters extracted by this procedure are specifically optimised for a particular purpose. That purpose is to drive a particular parallel formant synthesiser with well defined filter and source characteristics. The formant frequency parameters would be relevant to many purposes but it is likely that the gain and bandwidth parameters would only be of use on the particular synthesiser utilised in the extraction process. The parameters so extracted would be useful in either diphone formant synthesis or in the extraction of formant synthesis rules for use on the target synthesiser.

The formant parameter extraction procedure outlined in this paper does not represent a traditional approach to formant extraction but rather attempts to model the way a phonetician might go about formant detection based upon knowledge of where each formant would be expected for the various speech segments. This procedure is not, however, a completely automatic expert system but instead relies upon two important phases of expert human intervention. The first phase of intervention is external to this procedure but is an essential prerequisite to it. This is the collection of a fully segmented and annotated speech database. Each formant extraction process would only be applied to a single speaker, but the process could be used to model many voices by extracting parameters for each of those speakers separately. The importance of such a database is that the formant extraction process requires prior knowledge of the phonemic boundary and identity of each speech segment as well as various aspects of subsegmental division (eg. the sub-division of stops into occlusion, aspiration, etc.).

The extraction procedure is divided into three passes. The first pass establishes the vowel five-formant space for the speaker through a combination of automatic and non-automatic procedures. The formant space is then used in the second pass to probabilistically constrain the selection of formant centre frequencies. The formant frequencies so selected are then examined and if necessary are edited by mouse. Only when the formant centre frequencies are satisfactory does the third pass commence. The third pass is a fully automatic extraction of formant gain and bandwidth. This pass is based on an analysis-by-synthesis methodology. A first approximation of the gain and bandwidth of each formant is extracted from a 24 coefficient LPC. A binary search strategy is then used to select a series of candidate bandwidths and gains which are then utilised by the synthesiser to synthesise a frame of speech whose spectrum is then compared with the spectrum of the target natural speech. This procedure is repeated until the shortest euclidean spectral distance between natural and synthetic spectrum is obtained. The parameters which produce this shortest spectral distance are then saved as the appropriate parameters for that frame of speech.

The attraction of this process is that it produces the parameters which give the closest possible spectral match to the original natural speech for the particular formant synthesiser used in the ananysis-by-synthesis procedure. It seems most likely that this would result in the best performance possible with that speech synthesiser. This will be tested by running this procedure over a set of /h_d/ and CV syllables and extracting the parameters which will then be used to resynthesise the syllables. The intelligibility of these syllables will then be compared with the intelligibility of the same natural, channel vocoded and JSRU formant vocoded syllables. Any shortfall in intelligibility relative to either natural or channel vocoded speech would be reasonably assumed to indicate inherent limitations in the formant model itself (or at least the current version of the model as realised in the synthesiser algorithm) rather than the degree to which the model has been accurately utilised at the parametric level.

This algorithm does not represent a classical formant analysis system for several reasons.

i) This algorithm will only work on a segmented and annotated speech database in which all phonemes have been identified (as well as certain sub-segmental features such as stop occlusions and aspiration). It is not simply a signal processing system but is an expert system which makes its decisions based on knowledge of the phonetic nature of the annotated phonemes.

ii) This system does not accomplish its task in one pass but depends upon an initial pass of the targets of all non-nasalised non-schwa monophthongal vowels to establish the "formant space" of the speaker. This formant space can then be used by the system to establish acceptable bounds for the formant values of this speaker on a phoneme by phoneme basis in a process somewhat analogous to formant normalisation as it may occur in human speech perception. A vowel target is here defined as simply that part of a labelled vowel from L*0.2 to L*0.8 where L is the total length of the vowel. This avoids including the extreme values of consonant transitions and restricts the first pass to that part of the signal in which formants can be most reliably determined.

iii) The formant frequency tracker will depend upon two LPC's, a 14 coefficient LPC and a 24 coefficient LPC. The 14 coefficient LPC will be used to determined the major spectral peaks whilst the 24 coefficient LPC will be used to determine accurate peak positions especially for closely spaced formants which may appear fused in the 14 coefficient analysis. Once the formant frequency is determined by the combined use of the 14 and 24 coefficient LPC's the 14 coefficient analysis will be discarded and gains and bandwidths will be estimated utilising the 24 coefficient LPC.

iv) An attempt is only made to determine formant centre frequencies for certain phonetic classes. For example, no attempt will be made to determine the formant frequencies in voiceless fricatives (excluding /h/). For such classes formant frequencies will be determined by interpolation for medial cases and extrapolation in initial or final cases.

v) All decisions will be constrained by phonetic expectations.

vi) In the second pass formant frequencies will be determined as accurately as possible, but bandwidths and gains will only be estimated from the 24 coefficient LPC spectrum. A third pass will be required to accurately determine bandwidths and gains and that pass will be an analysis-by-synthesis process which utilises a software simulation of a formant synthesiser and a iterative LPC-based spectral distance matching until a best possible spectral match is obtained for each 10 msec frame. For the third pass it will be assumed that formant centre frequencies are accurate.

vii) Formant frequency decisions will be reviewed by a phonetician before the final centre frequencies are accepted and provision will be made for mouse based correction.

The algorithm

General Considerations

The analysis will be carried out on overlapping frames 512 points long and stepped in 10 msec steps.

All of the following assumes 10 kHz sampling rates and thus a 5 kHz bandwidth. It is desirable to decimate the signal to 10 kHz if it is presently at 20 kHz so that 14 coefficients only attempt to model F1-5 and are not confused by the spectrum above 5 kHz.

First Pass

Stage One

Identify the targets of all non-nasalised /3:/ vowels.

Utilising the 14 coefficient LPC and initially only the vowels annotated as /3:/, determine 4 or 5 major peaks (it is assumed that all peaks will be separated for this vowel). The knowledge that these peaks should be approximately evenly spaced assists in this process. When utilising only 14 coefficients, only the five formants should separate (although frequently F5 is missing). F5 will often only appear when the 24 coefficient LPC is used and so, for F5 only, the 24 rather than the 14 coefficient LPC is utilised. Ignore any peaks below 250 Hz (this avoids any nasal formant peaks that might occur). If there are only four evenly separated peaks and their spacing (averaged for all /3:/ vowels for this speaker) would predict a fifth formant above 5 kHz then this subject will be henceforth assumed to have a short vocal tract and only 4 formants will be determined in all subsequent analyses. It is not anticipated, at present, that small children's voices will be examined and so it will be assumed that all speakers will have four or five formants.

From the mean formant frequencies of the /3:/ vowels, approximately neutral formant frequencies (referred to as N1 to N5 below) will have been determined as will have been the average spacing between the formants (referred to as Ns). These values will be utilised to determine probabilities of formant status for all vowel peaks. The formant status probabilities will be assigned as follows (remember, only the central 60% of each vowel is being analysed, and that no diphthongs are being analysed):-

In all of the following, probability is linearly interpolated between 0. and 1. positions.

F5 p = 1 if F5=N5
  p = 0 if F5 <= N4
  p = 0 if F5 >= N5 + Ns
F4 p = 1 if F4 = N4
  p = 0 if F4 <= N3
  p = 0 if F4 >= N5 (N4 +Ns for small vocal tracts)
F3 for /i/ and /I/ (also /i@/ T1, /ai/ T2, /ei/ T2, /oi/ T2)
  p = 1 if F3 = N3 * 1.15
  p = 0 if F3 >= N4 * 1.15
  p = 0 if F3 <= N2 * 1.15
F3 for /E/ and /A/ (also /e@/ T1, /ei/ T1)
  p = 1 if F3 = N3 * 1.05
  p = 0 if F3 >= N4 * 1.05
  p = 0 if F3 <= N2 * 1.05
F3 for /u:/ /o:/ /o/ and /U/ (also /oi/ T1, /au/ T2, /@u/ T1 &T2, /u@/ T1)
  p = 1 if F3 = N3 * 0.95
  p = 0 if F3 >= N4 * 0.95
  p = 0 if F3 <= N2 * 0.95
F3 for all other vowels (also all remaining diphthong targets)
  p = 1 if F3 = N3
  p = 0 if F3 >= N4
  p = 0 if F3 <= N2
F2 for front vowels /i, I, E, A/ (also /ei/ T1&T2, /i@, e@/ T1, /ai, oi/ T2)
  p = 1 if N2 <= F2 <= N3
  p = 0 if F2 >= N3 + Ns/2
  p = 0 if F2 <= N2 - Ns/2
F2 for central vowels /u:, @:, a:, v/ (also /i@, e@, u@/ T2, /ai, au/ T1)
  p = 1 if N2 - Ns/2 <= F2 <= F2 + Ns/2
  p = 0 if F2 >= N3
  p = 0 if F2 <= N1
F2 for back vowels /o, o:, U/ (also all remaining diphthong targets)
  p = 1 if N1 + Ns/4 <= F2 <= N2
  p = 0 if F2 >= N2 + Ns/2
  p = 0 if F2 <= N1 -Ns/4
F1 for high vowels /i, I, u:, U/ (also /oi/ T1 & T2, /i@/ T1, /ei, ai, @u/ T2)
  p = 1 if 200 Hz <= F1 <= N1
  p = 0 if F1 >= N1 + Ns/2
  p = 0 if F1 <= 100 Hz
F1 for mid vowels /E, @:, o:, A/ (also /e@/ T1 & T2, /ei/ T1, /@u/ T2, /i@, u@/ T2)
  p = 1 if N1 - Ns/2 <= F1 <= N1 + Ns/2
  p = 0 if F1 >= N2
  p = 0 if F1 <= 100 Hz
F1 for low vowels /v, a:, o/ (also all remaining diphthong targets)
  p = 1 if N1 <= F1 <= N2
  p = 0 if F1 >= N2 + Ns/2
  p = 0 if F1 <= N1 - Ns/2

Stage Two

Identify all non-schwa vowel monophthong phonemes that are not adjacent to nasal consonants. This is done utilising database labelling. Nasalised vowels are avoided because of the difficulty in separating nasal formants and F1 especially in high (low F1) vowels.

Identify the central 60% of each vowel to avoid excessive consonantal effects on formant frequencies.

In 10 msec steps carry out an LPC analysis on a series of overlapping 512 point frames.

Carry out a 14 and a 24 coefficient LPC on each of these vowel targets and extract the spectrum by carrying out a 512 point FFT on the result of each LPC analysis. Any single peak can be (for the purposes of the first pass) be identified as more than one formant.

Determine all peaks for the 14 coefficient LPC's. (nb. the highest frequency component and the lowest frequency component are considered peaks if higher than the second highest and second lowest peaks respectively).

Each peak (for both 14 and 24 coefficient spectra) will be given a probability of being each of the four or five possible formants.

No context or local formant trajectory trends will be considered.

For each frame, the highest probability peak will be allocated as the formant value for each of the formants. If more than one peak shares the highest probability value, then the value closest to the neutral formant value will be allocated as the formant value for that frame. If the two closest and highest probability peaks are equidistant from the neutral value then the neutral value is allocated as the formant value for that frame.

A single peak can be allocated as belonging to zero, one or more formants.

The values selected for the five formants will be plotted on F1/F2, F2/F3, and F4/F5 scatter plots for each vowel seperately. A phonetician examines the plots and determines the major and minor axes of an elipse which will be drawn by the algorithm around the points selected as reasonable valid values for that vowel. The ellipses so selected will represent the 0.9 to 1.0 probability for that vowel in the second pass.

Utilising a sliding triangular window, Bark channel power differences will be calculated with the intention of defining dips in the resultant difference curves as vowel targets. The monophthong targets so defined will be utilised in the second pass. This process can also be used to identify with reasonable accuracy at least the first and very often the second target of the diphthongs.

All defined targets for all non-nasalised diphthongs in the database (for the current speaker only) will then have the above procedure applied to them. The probabilities for these vowels are defined in the table above (ie. with the monophthings)

Second Pass

Utilising the above procedure, a five (or for smaller vocal tracts, four) dimensional space is defined. This vowel space is represented as a total vowel space as well as an individual vowel target space for each vowel (monophthong and diphthong) target. It will be assumed in the following section of the algorithm that all vowels will be found within the bounds of that space and all vowel targets within the bounds of the appropriate individual vowel spaces, with the following exceptions. Vowel transitions to or from consonants which have measurable formants or "loci" defined as being outside the vowel formant space are permitted, with the constraint that the trajectory from the consonant backward or forward to the vowel target moves towards the allowed formant space for vowels in general, and for that vowel in particular.

Consonants defined as having measurable formants will have their allowable formant space defined in relation to the vowel formant space and such consonant formant spaces may be defined to lie partially or even fully outside the vowel formant space. As a general rule of thumb, consonants (excluding /h/) can be generally considered to possess greater constriction than vowels and this will often be realised as a formant space with lower F1 bounds than that which occurs for the vowels.

The second pass algorithm is as follows:-

In the second pass, only the 24 coefficient LPC will be used.

The 24 coefficient LPC is used to obtain the most likely peaks for each formant. The obvious formant ordering constraints (ie. F1 < F2 etc.) must be obeyed.

The following procedure is repeated for each utterance (typically a single sentence) in the database for the target speaker.

The first phase of the second pass requires that all identifiable vowel targets be selected and F1-F5 and Fn be determined for those targets only. In this pass the targets of all vowels (non-nasalised and nasalised) will be determined. It will be expected that Fn will occur in the vowels in a nasal context (adjacent to a nasal consonant) however it will be assumed that a nasal formant can also be found in vowels in non-nasal context.

Vowel target formant probabilities will be allocated by utilising the target ellipses determined in the first pass. All peaks that occur within the appropriate ellipse will be given a probability of between 0.9 and 1.0 (linearly interpolated from p=0.9 at the boundary to p=1.0 at the centre). All peaks that occur outside the ellipse will be given a probability in the following way. A second ellipse will be created with the same centre and rotation as the first. The outer boundary of that ellipse will be allocated p=0. for that vowel and probabilities will be linearly interpolated between the outer ellipse and the inner ellipse.

It is also necessary to take into account the intensity of the peaks when more than one candidate peak is found within the pair of ellipses. If two peaks have equal probability the the more intense peak is selected. Otherwise, the probability factor will need to be modified by a factor determined from the ratio of the highest peak to all other candidate peaks. The probability is modified according to the following formula:

where:-
Ahighpeak is the linear amplitude of the highest peak in the current frequency range
Apeak is the amplitude of the peak whose probability is being determined

In other words the highest peak's probability value remains unchanged whilst a lower intensity peak's probability is reduced by up to 0.25. If, after this process, two peaks still have equal probability then select the higher amplitude peak or, if they have equal amplitude, the peak closest to the centre of the ellipse. If they are identical in all ways then make a random selection between them (trajectory constraints will later correct any invalid selections in any case).

It is assumed that F1-4 will be allocated for all vowel targets fairly reliably. Fn will be selected occasionally and is defined (initially) as the highest amplitude peak between the selected F1 and 0Hz (see below). F5 will sometimes not be selected as it will not appear as a peak in the LPC. When either of these formants does not appear then label the formant frequencies for these formants as some unrealistic value (eg. -999Hz) to flag the formant as unselected for the present frame. Interpolation will supply values for these frames later.

Diphthongs that have had two targets identified have their target-to-target transitions determined at this point. The transition formant values are calculated in the following way. Firstly, a line interpolated between the centres of the two target ellipses is drawn. Then, two lines are drawn to join the two central ellipses (the two lines that are tangential to both elipses, without crossing the interpolated centre line). This is repeated for the outer zero probability ellipse. The probability at these interpolated lines is identical to that of the lines or points that they are joining, with probability being linearly interpolated between them. Formant peaks are selected exactly as they were for the targets.

Figure 1: Diphthong two-target probability ellipses and interpolated transition space.

Diphthongs without a defined second target have the 85-95% duration position defined as the second target and the second target and inter-target transition are determined as above.

Possible nasal formants will be any peak between the selected F1 and 0Hz (but not including 0Hz: peaks at the first sample in a spectrum represent DC offset and not meaningful speech poles). The selection of nasal peaks (where more than one occurs below F1) will be determined by trajectory considerations in the same way as consonant-to-vowel and vowel-to-consonant transitions are determined below. Nasal formants can only be selected after F1 selection is confirmed to avoid reselecting the F1 peak as Fn also. If no peaks appear below F1 then the nasal formant is flagged as absent (Fn=-999) and interpolated at a later stage of processing.

For the following phonemes the total vowel space for any formant will be defined as ranging from 0.0 to 1.0 where 0.0 represents the minimum frequency for a formant and 1.0 represents the maximum frequency for a particular formant.

Schwa /@/ often does not have a defined target and even when it does the values vary considerably with context. An individual vowel space has not therefore been determined for schwa. The vowel space for schwa is defined as the total vowel space of the speaker and probabilities are set to reflect the tendency for schwa to be centralised. Set the centre of the vowel space to p=1.0. Define a boundary drawn at 0.2 and 0.8 and set the probability at those values to 0.9. The outer boundary of the total vowel space is set to 0.5. The probabilities are adjusted according to amplitude in the same fashion as were the other vowel targets.

Vowel-to-vowel transition probabilities are defined in exactly the same way as were target-to-target transitions in diphthongs.

Before determining vowel-to-consonant and consonant-to-vowel transitions it is necessary to confirm the vowel target and inter-target transition selections. This is done by applying trajectory constraints to the data and ignoring any missing values. For each vowel the trajectory is determined outwards from a stable value in the centre of the vowel target (or target one in a diphthong). That is, the value which varies least from the surrounding values is selected. For each formant, a three point triangular sliding average of the formant difference function:-

Fxn-1 = Fxn-1 - Fxn-2

Fxn = Fxn - Fxn-1

Fxn+1 = Fxn+1 - Fxn

Fxave = (A*Fxn-1 + B*Fxn + A*Fxn+1) / (2*A + B)

where: x=1,2,3,4,5,N and A,B = triangular weights

is determined for the entire set of data frames currently analysed. Any local maxima (peaks of formant difference surrounded by lower difference values) are treated as suspect. Such suspect frames are examined for any other peaks with a non-zero probability which will, if selected, reduce the difference maximum. The peak which most reduces the local difference maximum is selected as representing the formant value.

/h/ between two vowels is treated as a voiceless vowel-to-vowel transition and probabilities will be determined for /h/ peaks with final selection being made on the basis of trajectory constraints in the same manner as the vowel targets and vowel-to-vowel transitions above.

The consonant-to-vowel and vowel-to-consonant transitions must lie within the individual vowel's space or lie between that space and the consonantal formant space. No attempt will be made to determine the peak probabilities in terms of the position in the vowel space as onset values can lie outside the vowel formant space. Transition formant peaks are instead selected on the basis of formant trajectory constraints which limit formant movement from one frame to the next (10 msecs). These trajectories are determined from the vowel target backwards to the vowel onset or forwards to the vowel offset (as defined by the database segmentation and labels). Each frame step will examine the difference between each peak (within the total vowel space for that formant) from the current data frame and the last selected peak. The peak selected will be the peak which has a difference value closest to the difference for the previous two frames.

An /h/ with only a following vowel is treated as a voiceless extension of the vowel target and is processed in the same way as consonant-to-vowel transitions on the basis of trajectory constraints.

The nasals /m,n,N/, approximants /w,j,l,r/ will be referred to in the following discussion as "vowel-like" consonants. Formant analyses will be carried out on these consonants. Formant values will be assumed to lie within the vowel formant space with the exception that F1 values can also be lower than those allowed in the vowel space. A further exception to this constraint is that F3 in /r/ occurs at values lower than the acceptible vowel values. Formant values are always determined relative to vowel targets (which have well defined formant peaks) and peak selection is constrained by trajectory considerations. It is necessary to select which vowel target is the reference spectrum from which formant trajectories are determined. This is often determined on the basis of syllable and word boundary criteria. The location of word boundaries is reasonably straightforward in a segmented database but can be complicated by cross-junctural assimilation, deletion and gemmination. The location of syllable boundaries has long been considered a non-trivial problem and is complicated by ambisyllabic consonants. The above considerations can be avoided in the present algorithm by allocating a single vowel-like consonant between two vowels to both vowels and tracking formants backwards and forwards to the centre of that consonant. If two vowel-like consonants occur between a pair of vowels then the first is assumed to "belong" to the preceding vowel and the second to the following vowel. Similarly, when three vowel-like consonants occur between two vowels the "syllable boundary" is placed in the middle of the central consonant.

Formant trajectory tracking is impeded by voiced (/v,D,z,Z/) and voiceless fricatives (/f,T,s,S/ but not /h/), affricates (/dZ, tS/), voiced (/b,d,g/) and voiceless stops (/p,t,k/). Formant trajectories can only be tracked from the vowel until one of these trajectory "boundaries" is encountered. Formant values are interpolated or extrapolated across these segments. It is assumed that sensible formant values cannot be determined for these consonants. There are two exceptions to this and they will be dealt with in the next two paragraphs.

In voiceless stops /p,t,k/ and both affricates /tS,dZ/ followed by a vowel or a vowel-like consonant, the aspiration will be examined for formants. This analysis will assume that formant values lie in the vowel space but with the exception that F1 values can also occur at lower F1 values than are allowed in the vowel space.

In voiced stops /b,d,g/ and affricates /dZ/ F1 to F5 are not tracked but are interpolated or extrapolated in accordance with the procedure above. The nasal formant is tracked however. The nasal formant is used to model the voicing murmur ("voice bar") that often occurs in voiced stop occlusions. The peak selected as the voice bar is the highest intensity peak found between 0Hz and the interpolated F1. Such a peak will not always occur as the voice bar often does not cross the entire occlusion. The voice bar values will be further constrained by the same trajectory considerations as above (ie. if > 1 peak below the interpolated F1 then the one that causes the smallest averaged differential is selected). Fn is interpolated across missing values and also interpolated backwards and forwards to the next reliable nasal value.

Syllabic consonants present a problem for this procedure when they are both preceded and followed by a consonant defined as being a trajectory boundary (eg. /l/ and /n/ in /botlz/, /bvtnz/) Such consonants need to have their formants tracked but don't occur in the context of a vowel and so it is not possible to rely on vowel-consonant trajectory constraints to determine formant locations. Such consonants will need to be treated as vowels. Their formant space will be considered to be the total vowel space, but the F1 range is permitted to drop to values lower than those permitted in the vowel space. No probabilities will be determined. The highest amplitude peak in each formant space range will be selected and then these values will be corrected on the basis of the trajectory constraints outlined above.

All interpolated or extrapolated formant values will be clearly labelled as artificial formant values so that they can be excluded from any statistical study of formant values across the database. They are needed, however, to maintain pole continuity through the consonant and so allow formant modelling using a formant synthesiser.

Provision is also made for hand correction using a mouse after this phase is completed. It is anticipated that only a small number of hand corrections will be necessary. When hand correction occurs the closest available pole is selected (if one is available) by the program. This allows for the sensible determination of gain and bandwidth in the next pass.

Third Pass

This pass is an analysis-by-synthesis approach to the extraction of formant gain and bandwidth parameters from natural speech.

This pass determines formant gain and bandwidth values which produce the closest possible frame-by-frame spectral characteristics to the original natural speech, when used with the same synthesiser that was used in the analysis-synthesis data extraction procedure. Such values may have some application beyond their utilisation on the analysis formant synthesiser but it expected that their main use would be in the statistical development of low-level synthesis rules targeted at that synthesiser or for the extraction of formant-based diphones for use in a speech concatenation system realised on that synthesiser.

The third pass assumes as input a set of accurate formant centre frequency values. The third pass outputs gain and bandwidth values with the spectral effects of a specific voice or noise source removed. An effect of this would be for /@:/ to produce gain values for the five formants which are approximately equal rather than reflecting the -6 dB/octave slope imposed by a typical voice source.

The third pass is carried out on a frame-by-frame basis. Adjacent frames are not examined when a frame is being processed and so no contextual or trend constraints are applied to the extraction of bandwidths and gains. Once a frame has been processed the extracted values will be stored and the processing of that frame will be completed.