Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Ординатура / Офтальмология / Английские материалы / Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
12.16 Mб
Скачать

500 14 Speech, Text and Braille Conversion Technology

suggestions for future directions for research and development with a view to resolving these problems are given (Section 14.7.3).

The following two remarks refer to topics which will not be covered in this chapter:

Readers require some understanding of the fundamentals of sound and hearing, in order to understand speech technology. These topics are not presented here, as they are covered in the first chapter of the previous volume of this AT book series (Hersh and Johnson 2003), to which readers are referred.

The algorithms used in speech signal processing will not be discussed. The interested reader is referred to standard textbooks, such as those by Deller et al. (1993) and Rabiner and Juang (1993).

14.2 Prerequisites for Speech and Text Conversion Technology

14.2.1 The Spectral Structure of Speech

From the speech signal to the spectrogram

From the physical point of view, the speech waveform is a function of sound pressure against time. The speech signal can be recorded and processed using a microphone, lowpass filter, analog to digital converter, and a sample and hold circuit to give a digital representation of the signal in terms of a sequence of discrete measured values called samples. Good speech quality can be obtained with a bandwidth of 8 kHz. (This is greater than the bandwidth of telephone speech which is between 300 Hz and 3400 Hz.) From the sampling theorem of Kotelnikov and Shannon (Shannon 1949), a sampling frequency of at least 16 kHz is required, giving a time interval of 62.5 μs between two neighbouring samples.

The following steps will be illustrated by means of an example. Figure 14.1a shows the waveform of the word Amplitude. (It was pronounced by a male speaker in German, but the example is language independent.) To give a feel for the quantity of data involved in speech processing, it should be noted that this relatively short word which represents 1.3 s of speech requires (1.3 s)/(62.5 μs ) = 20, 800 samples for a sampling frequency of 16 kHz, which only just satisfies the sampling theorem. Each sample requires 2 bytes of storage to ensure sufficient accuracy.

The human inner ear acts as a spectrum analyser. Consequently, it is useful for technical systems to produce a spectral representation of the speech signal. A spectrum describes the composition of a signal from simple (harmonic) signals at particular frequencies, that is, it is a representation of amplitude against frequency. The required transformations are well known in signal processing and are summarized in Table 14.1. The relevant formulae can be found in most textbooks on signal processing.

However, speech is not a stationary signal. Therefore analysis should be based on segments of the signal (called windows) which can be considered to be “quasi stationary”. A window can be considered to be an “analysis period” and therefore

 

14.2 Prerequisites for Speech and Text Conversion Technology

501

Table 14.1. Overview of the different spectral transforms and the properties of their spectra

 

 

 

 

 

 

Time-continuous signals

Time-discrete signals

 

 

 

 

 

Periodic signals

Fourier series

Discrete Fourier transform (DFT);

 

 

 

special version:

 

 

 

Fast Fourier transform (FFT)

 

 

Non-periodic line spectrum

Periodic line spectrum

 

 

 

 

 

Non-periodic signals

Fourier transform

Discrete time Fourier

 

 

(Fourier integral)

Transform (DTFT)

 

 

Non-periodic continuous spectrum

Periodic continuous spectrum

 

 

 

 

 

the upper row and the rightmost column of Table 14.1 gives the discrete Fourier Transform (DFT) as the appropriate transformation to be applied.

The length of the window plays an important role, so that the longer the window, the more details that can be identified in the spectrum. In the example, this can be observed by comparing the pictures in Figure 14.1b,c. Figure 14.1b has been calculated for a longer window and therefore provides more spectral information. On the other hand, a shorter window allows a better localisation of the spectrum on the time axis. Therefore, choice of an appropriate window length requires tradeoffs between detailed spectral information and precise localisation, as well as the further practical constraint of the number of samples per window to be a power of two. This condition is required by the fast Fourier transform (FFT), which is an efficient algorithm for calculating the DFT. A choice of 256 samples, which corresponds to a window of 16 ms, is a good compromise.

This process results in the analysis of a single window. The complete characterisation of a speech signal requires the window to be shifted in short time steps. This results in a sequence of separate short time spectra which is represented graphically in Figure 14.1d. The amplitude of the spectrum is plotted in the time-frequency plane. This results in essentially the same input information as is available to a speech recognizer. The resulting graph, such as the “waterfall” shown in Figure 14.1d is not easy to interpret for a human observer. An easier to understand visual representation can be obtained in the form of a quasi-geographical map with the amplitude of the spectrum either coded in grey scale or represented by different colours. This is illustrated in Figure 14.2.

The maps which are produced in this way are called spectrograms. What the resulting spectrogram looks like will generally be influenced by the window length. In particular a longer window, resulting in greater spectral detail, as shown in Figure 14.1b, will give a narrowband spectrogram of the type shown in Figure 14.2a, whereas a shorter window, resulting in better time resolution, but less spectral detail as shown in Figure 14.1c will give a broadband spectrogram of the type shown in Figure 14.2b.

502 14 Speech, Text and Braille Conversion Technology

Figure 14.1a–d. Example, showing the way from the speech signal to the spectrogram: a acoustical waveform (sound pressure vs time) of the word “Amplitude” pronounced in German by a male speaker; b spectrum of the sound [i] of the given word, calculated by fast Fourier transform (FFT) of a speech segment (window) of 32 ms. For such a “long” window, the spectral details can be observed very well (narrowband spectrum); c spectrum of the same sound [i], calculated from a window of merely 8 ms. In this case, a better presentation of the spectral envelope is obtained (broadband spectrum); d if the complete word is analysed window by window, we obtain a sequence of spectra according to b or c, respectively, which form a relief from mountains and valleys over the time-frequency plane

14.2 Prerequisites for Speech and Text Conversion Technology

503

Figure 14.2. Visualization of the sequence of spectra by means of a spectrogram (continued from the example presented in Figure 14.1). Because the three-dimensional presentation from Figure 14.1d is hard to interpret, a map-like presentation as a top view of the spectral “landscape” is preferred, called spectrogram. In a spectrogram, the abscissa acts as the time axis, the ordinate as the frequency axis, and the spectral magnitude is coded in colours or in a grey scale: a narrowband spectrogram of our example word “Amplitude”, composed from spectra like Figure 14.1b; b broadband spectrogram of the same word, composed from spectra like Figure 14.1c

Excitation source and articulation tract

There are a number of different types of speech sounds, which are produced in slightly different ways. One of the main distinctions is between voiced and unvoiced sounds. Voiced sounds are produced by a process called phonation in which an air stream is conducted from the lungs through the larynx and leads to a quasiperiodic opening and closing of the vocal cords. The resulting speech signal is quasi-periodic. Unvoiced sounds are produced without phonation.

In the spectrograms of Figure 14.2 voiced sounds are clearly apparent, whereas unvoiced sounds are not very distinctive. In particular, voiced sounds show clear regularities or periodicities either in the frequency direction of the narrowband spectrogram, or in the time direction of the broadband spectrogram. These pe-

504 14 Speech, Text and Braille Conversion Technology

Figure 14.3a–c. The articulation of sounds and the linear model of it: a articulatory organs of the human; b linear model of the production of voiced sounds; c block diagram of a parametric speech synthesis system basing on the linear approach of b

riodicities reflect the periodic excitation of the larynx which produces voiced sounds.

Figure 14.3a illustrates the fact that the vibrations of the vocal cords of the larynx are influenced by the shape of the different cavities of the articulation tract that they pass through on their way from the larynx to mouth. The resulting modulation of speech allows the speaker to position the lips, tongue and other articulatory organs appropriately to produce the desired sound. Fortunately the process is automatic for people speaking their native language. However, it is generally less easy for people trying to produce the correct sounds in a foreign language.