Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
ga2ap.doc
Скачиваний:
22
Добавлен:
10.02.2016
Размер:
7.06 Mб
Скачать

Вопрос 10 pitch period estimation

One of the most important parameters in speech analysis, synthesis, and coding applications is the fundamental frequency, or pitch, of voiced speech. Pitch frequency is directly related to the speaker and sets the unique characteristic of a person. Voicing is generated when the airflow from the lungs is periodically inter- rupted by movements of the vocal cords. The time between successive vocal cord openings is called the fundamental period, or pitch period.

For men, the possible pitch frequency range is usually found somewhere between 50 and 250 Hz, while for women the range usually falls between 120 and 500 Hz. In terms of period, the range for a male is 4 to 20 ms, while for a female it is 2 to 8 ms.

Pitch period must be estimated at every frame. By comparing a frame with past samples, it is possible to identify the period in which the signal repeats itself, resulting in an estimate of the actual pitch period. Note that the estimation procedure makes sense only for voiced frames. Meaningless results are obtained for unvoiced frames due to their random nature.

Design of a pitch period estimation algorithm is a complex undertaking due to lack of perfectperiodicity, interference with formants of the vocal tract, uncertainty of the starting instance of a voiced segment, and other real-world elements such as noise and echo. In practice, pitch period estimation is implemented as a trade-off between computational complexity and performance. Many techniques have been proposed for the estimation of pitch period and only a few are included here.

The Autocorrelation Method

Assume we want to perform the estimation on the signal s[n], with n being the time index. We consider the frame that ends at time instant m, where the length of the frame is equal to N (i.e., from n ¼ m — N þ 1 to m). Then the autocorrelation value*

m

R½l; m] ¼ X

n ¼ m — N þ 1

s½n]s½n — l ] ð2:1Þ

reflects the similarity between the frame s[n], n ¼ m — N þ 1 to m, with respect to the time-shifted version s[n — l], where l is a positive integer representing a time lag. The range of lag is selected so that it covers a wide range of pitch period values.

For instance, for l ¼ 20 to 147 (2.5 to 18.3 ms), the possible pitch frequencyvalues range from 54.4 to 400 Hz at 8 kHz sampling rate. This range of l is applicable for

most speakers and can be encoded using 7 bits, since there are 27 ¼ 128 values of pitch period.

By calculating the autocorrelation values for the entire range of lag, it is possible to find the value of lag associated with the highest autocorrelation representing the pitch period estimate, since, in theory, autocorrelation is maximized when the lag is equal to the pitch period. The method is summarized with the following pseudocode:

It is important to mention that, in practice, the speech signal is often lowpass filtered before being used as input for pitch period estimation. Since the fundamental frequency associated with voicing is located in the low-frequency region (<500 Hz), lowpass filtering eliminates the interfering high-frequency components as well as out-of-band noise, leading to a more accurate estimate.

Magnitude Difference Function

One drawback of the autocorrelation method is the need for multiplication, which is relatively expensive for implementation, especially in those processors with limited functionality.Toovercomethis problem, the magnitude difference function is invented. This function is defined by

m

MDF½l; m] ¼ X

n ¼ m — N þ 1

js½n]— s½n — l]j: ð2:2Þ

For short segments of voiced speech it is reasonable to expect that s[n] — s[n — l] is small for l ¼ 0, T T, T 2T, ... , with T being the signal’s period. Thus, by computing the magnitude difference function for the lag range of interest, one can estimate the period by locating the lag value associated with the minimum

magnitude difference. Note that no products are needed for the implementation of the present method.

Fractional Pitch Period

The methods discussed earlier can only find integer-valued pitch periods. That is, the resultant period valuesare multiples of the sampling period (8 kHz)1 ¼

0.125 ms. In many applications, higher resolution is necessary to achieve good performance. In fact, pitch period of the original continuous-time (before sampling) signal is a real number; thus, integer periods are only approximations introducing errors that might have negative impact on system performance.

Multirate signal processing techniques can be introduced to extend the resolution beyond the limits set by fixed sampling rate. Interpolation, for instance, is a widely used method, where the actual sampling rate is increased. Medan, Yair, and Chazan (1991) published an algorithm for pitch period determination, which is based on a simple linear interpolation technique. The method allows the finding of a real-valued pitch period and can be implemented efficiently in practice. This method is explained in detail as follows.

Optimal Integer-Valued Pitch Period

Consider a speech frame that ends at time instant n ¼ m, with a length of N

(Figure 2.4). The frame can be expressed by

s½n]¼ b · s½n — N]þ e½n]; m — N þ 1 Ç n Ç m: ð2:3Þ

The above equation expresses {s[n], m — N þ 1 Ç n Ç m} as the sum between the product of a coefficient b with the frame {s[n — N], m — 2N þ 1 Ç n Ç m — N} and

Group 11560n

Figure 2.4 Signal frames in pitch period estimation.

the error signal* e[n]. Note from Figure 2.4 that two consecutive frames of length N are involved. The optimal pitch period at time instant n ¼ m can be defined as the particular value of N, denoted by No, that minimizes the normalized sum of squared error

m

P ðs½n]— bs½n N2

n ¼ m N þ 1

J½m; N] ¼

m

P

n ¼ m N þ 1

s2½n]

: ð2:4Þ

The normalization term (denominator) is required to compensate for the variable size of the speech segments involved and the uneven energy distribution over the pitch interval. Denoting No as our optimal pitch period, we have

No ¼ fNjJ½m; N] Ç J½m; M]; Nmin Ç N; M < Nmaxg; ð2:5Þ

where Nmin and Nmax are the minimum and maximum limits for the pitch period, respectively. These two are parameters of the algorithm that can be set according to the application. For instance, Nmin ¼ 20 and Nmax ¼ 147.

The optimal value of b can be found by differentiating J with respect to b and setting the result to zero. This gives

m

P s½n]s½n N]

b

n ¼ m N þ 1

¼ m

: ð2:6Þ

P

n ¼ m N þ 1

s2½n N]

* This indeed is the concept of linear prediction, where the signal of the current frame is predicted from the past, with the prediction calculated as the product of the past with a coefficient. Chapter 4 contains further details on the topic.

Substituting (2.6) in (2.4) and manipulating yields

. m .2

P s½n]s½n — N]

n ¼ m N þ 1

J½m; N]¼ 1 —

m

P

n ¼ m — N þ 1

m

s2½n — N] P

n ¼ m — N þ 1

s2½n]

: ð2:7Þ