- •Вопрос 1 Structure of a Speech Coding System
- •Вопрос 3 Desirable Properties of a Speech Coder
- •Вопрос 4 About Coding Delay
- •Вопрос 5 classification of speech coders
- •Вопрос 6 Origin of Speech Signals
- •Вопрос 7 Structure of the Human Auditory System
- •Вопрос 8 Absolute Threshold
- •Вопрос 9 speech coding standards
- •Вопрос 10 pitch period estimation
- •Вопрос 11 linear prediction
- •Вопрос 12 Error Minimization
- •Вопрос 13/14 Prediction Schemes
- •0 10 20
- •Вопрос 15 long-term linear prediction
- •0 0.5 1
- •0 0.5 1
- •Вопрос 16/17 Linear Predictive Coding (lpc)
- •16. Speech encoding. Lpc encoder
- •Overview
- •Lpc coefficient representations
- •Applications
- •20 / 21 . Speech encoding. Celp coder
- •22/23. Speech encoding. Ld-celp coder
- •14.1 Strategies to achieve low delay
- •24/25 Speech encoding. Acelp (g.729) coder
- •35. Jpeg2000 in video compression(mjpeg)
- •36. Coding for high quality moving pictures(mpeg-2)
Вопрос 10 pitch period estimation
One of the most important parameters in speech analysis, synthesis, and coding applications is the fundamental frequency, or pitch, of voiced speech. Pitch frequency is directly related to the speaker and sets the unique characteristic of a person. Voicing is generated when the airflow from the lungs is periodically inter- rupted by movements of the vocal cords. The time between successive vocal cord openings is called the fundamental period, or pitch period.
For men, the possible pitch frequency range is usually found somewhere between 50 and 250 Hz, while for women the range usually falls between 120 and 500 Hz. In terms of period, the range for a male is 4 to 20 ms, while for a female it is 2 to 8 ms.
Pitch period must be estimated at every frame. By comparing a frame with past samples, it is possible to identify the period in which the signal repeats itself, resulting in an estimate of the actual pitch period. Note that the estimation procedure makes sense only for voiced frames. Meaningless results are obtained for unvoiced frames due to their random nature.
Design of a pitch period estimation algorithm is a complex undertaking due to lack of perfectperiodicity, interference with formants of the vocal tract, uncertainty of the starting instance of a voiced segment, and other real-world elements such as noise and echo. In practice, pitch period estimation is implemented as a trade-off between computational complexity and performance. Many techniques have been proposed for the estimation of pitch period and only a few are included here.
The Autocorrelation Method
Assume we want to perform the estimation on the signal s[n], with n being the time index. We consider the frame that ends at time instant m, where the length of the frame is equal to N (i.e., from n ¼ m — N þ 1 to m). Then the autocorrelation value*
m
R½l; m] ¼ X
n ¼ m — N þ 1
s½n]s½n — l ] ð2:1Þ
reflects the similarity between the frame s[n], n ¼ m — N þ 1 to m, with respect to the time-shifted version s[n — l], where l is a positive integer representing a time lag. The range of lag is selected so that it covers a wide range of pitch period values.
For instance, for l ¼ 20 to 147 (2.5 to 18.3 ms), the possible pitch frequencyvalues range from 54.4 to 400 Hz at 8 kHz sampling rate. This range of l is applicable for
most speakers and can be encoded using 7 bits, since there are 27 ¼ 128 values of pitch period.
By calculating the autocorrelation values for the entire range of lag, it is possible to find the value of lag associated with the highest autocorrelation representing the pitch period estimate, since, in theory, autocorrelation is maximized when the lag is equal to the pitch period. The method is summarized with the following pseudocode:
It is important to mention that, in practice, the speech signal is often lowpass filtered before being used as input for pitch period estimation. Since the fundamental frequency associated with voicing is located in the low-frequency region (<500 Hz), lowpass filtering eliminates the interfering high-frequency components as well as out-of-band noise, leading to a more accurate estimate.
Magnitude Difference Function
One drawback of the autocorrelation method is the need for multiplication, which is relatively expensive for implementation, especially in those processors with limited functionality.Toovercomethis problem, the magnitude difference function is invented. This function is defined by
m
MDF½l; m] ¼ X
n ¼ m — N þ 1
js½n]— s½n — l]j: ð2:2Þ
For short segments of voiced speech it is reasonable to expect that s[n] — s[n — l] is small for l ¼ 0, T T, T 2T, ... , with T being the signal’s period. Thus, by computing the magnitude difference function for the lag range of interest, one can estimate the period by locating the lag value associated with the minimum
magnitude difference. Note that no products are needed for the implementation of the present method.
Fractional Pitch Period
The methods discussed earlier can only find integer-valued pitch periods. That is, the resultant period valuesare multiples of the sampling period (8 kHz)—1 ¼
0.125 ms. In many applications, higher resolution is necessary to achieve good performance. In fact, pitch period of the original continuous-time (before sampling) signal is a real number; thus, integer periods are only approximations introducing errors that might have negative impact on system performance.
Multirate signal processing techniques can be introduced to extend the resolution beyond the limits set by fixed sampling rate. Interpolation, for instance, is a widely used method, where the actual sampling rate is increased. Medan, Yair, and Chazan (1991) published an algorithm for pitch period determination, which is based on a simple linear interpolation technique. The method allows the finding of a real-valued pitch period and can be implemented efficiently in practice. This method is explained in detail as follows.
Optimal Integer-Valued Pitch Period
Consider a speech frame that ends at time instant n ¼ m, with a length of N
(Figure 2.4). The frame can be expressed by
s½n]¼ b · s½n — N]þ e½n]; m — N þ 1 Ç n Ç m: ð2:3Þ
The above equation expresses {s[n], m — N þ 1 Ç n Ç m} as the sum between the product of a coefficient b with the frame {s[n — N], m — 2N þ 1 Ç n Ç m — N} and
n
Figure 2.4 Signal frames in pitch period estimation.
the error signal* e[n]. Note from Figure 2.4 that two consecutive frames of length N are involved. The optimal pitch period at time instant n ¼ m can be defined as the particular value of N, denoted by No, that minimizes the normalized sum of squared error
m
P ðs½n]— bs½n — N]Þ2
n ¼ m — N þ 1
J½m; N] ¼
m
P
n ¼ m — N þ 1
s2½n]
: ð2:4Þ
The normalization term (denominator) is required to compensate for the variable size of the speech segments involved and the uneven energy distribution over the pitch interval. Denoting No as our optimal pitch period, we have
No ¼ fNjJ½m; N] Ç J½m; M]; Nmin Ç N; M < Nmaxg; ð2:5Þ
where Nmin and Nmax are the minimum and maximum limits for the pitch period, respectively. These two are parameters of the algorithm that can be set according to the application. For instance, Nmin ¼ 20 and Nmax ¼ 147.
The optimal value of b can be found by differentiating J with respect to b and setting the result to zero. This gives
m
P s½n]s½n — N]
b
¼ m
: ð2:6Þ
P
n ¼ m — N þ 1
s2½n — N]
* This indeed is the concept of linear prediction, where the signal of the current frame is predicted from the past, with the prediction calculated as the product of the past with a coefficient. Chapter 4 contains further details on the topic.
Substituting (2.6) in (2.4) and manipulating yields
. m .2
P s½n]s½n — N]
n ¼ m — N þ 1
J½m; N]¼ 1 —
m
P
n ¼ m — N þ 1
m
s2½n — N] P
n ¼ m — N þ 1
s2½n]
: ð2:7Þ