Добавил:

Sekretar kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Ростовский Государственный Медицинский Университет

Предмет:

Медицина общая

Файл:

Ординатура / Офтальмология / Английские материалы / Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf

Скачиваний:

Добавлен:

28.03.2026

Размер:

12.16 Mб

Скачать

☆

<<< < Предыдущая 125 126 127 128 129 130 131 132 133 134 135 136137 / 192137 138 139 140 141 142 143 144 145 146 147 148 149 > Следующая >>>

500 14 Speech, Text and Braille Conversion Technology

suggestions for future directions for research and development with a view to resolving these problems are given (Section 14.7.3).

The following two remarks refer to topics which will not be covered in this chapter:

•Readers require some understanding of the fundamentals of sound and hearing, in order to understand speech technology. These topics are not presented here, as they are covered in the ﬁrst chapter of the previous volume of this AT book series (Hersh and Johnson 2003), to which readers are referred.

•The algorithms used in speech signal processing will not be discussed. The interested reader is referred to standard textbooks, such as those by Deller et al. (1993) and Rabiner and Juang (1993).

14.2 Prerequisites for Speech and Text Conversion Technology

14.2.1 The Spectral Structure of Speech

From the speech signal to the spectrogram

From the physical point of view, the speech waveform is a function of sound pressure against time. The speech signal can be recorded and processed using a microphone, lowpass ﬁlter, analog to digital converter, and a sample and hold circuit to give a digital representation of the signal in terms of a sequence of discrete measured values called samples. Good speech quality can be obtained with a bandwidth of 8 kHz. (This is greater than the bandwidth of telephone speech which is between 300 Hz and 3400 Hz.) From the sampling theorem of Kotelnikov and Shannon (Shannon 1949), a sampling frequency of at least 16 kHz is required, giving a time interval of 62.5 μs between two neighbouring samples.

The following steps will be illustrated by means of an example. Figure 14.1a shows the waveform of the word Amplitude. (It was pronounced by a male speaker in German, but the example is language independent.) To give a feel for the quantity of data involved in speech processing, it should be noted that this relatively short word which represents 1.3 s of speech requires (1.3 s)/(62.5 μs ) = 20, 800 samples for a sampling frequency of 16 kHz, which only just satisﬁes the sampling theorem. Each sample requires 2 bytes of storage to ensure sufﬁcient accuracy.

The human inner ear acts as a spectrum analyser. Consequently, it is useful for technical systems to produce a spectral representation of the speech signal. A spectrum describes the composition of a signal from simple (harmonic) signals at particular frequencies, that is, it is a representation of amplitude against frequency. The required transformations are well known in signal processing and are summarized in Table 14.1. The relevant formulae can be found in most textbooks on signal processing.

However, speech is not a stationary signal. Therefore analysis should be based on segments of the signal (called windows) which can be considered to be “quasi stationary”. A window can be considered to be an “analysis period” and therefore

	14.2 Prerequisites for Speech and Text Conversion Technology		501
Table 14.1. Overview of the different spectral transforms and the properties of their spectra

	Time-continuous signals	Time-discrete signals

Periodic signals	Fourier series	Discrete Fourier transform (DFT);
		special version:
		Fast Fourier transform (FFT)
	Non-periodic line spectrum	Periodic line spectrum

Non-periodic signals	Fourier transform	Discrete time Fourier
	(Fourier integral)	Transform (DTFT)
	Non-periodic continuous spectrum	Periodic continuous spectrum

the upper row and the rightmost column of Table 14.1 gives the discrete Fourier Transform (DFT) as the appropriate transformation to be applied.

The length of the window plays an important role, so that the longer the window, the more details that can be identiﬁed in the spectrum. In the example, this can be observed by comparing the pictures in Figure 14.1b,c. Figure 14.1b has been calculated for a longer window and therefore provides more spectral information. On the other hand, a shorter window allows a better localisation of the spectrum on the time axis. Therefore, choice of an appropriate window length requires tradeoffs between detailed spectral information and precise localisation, as well as the further practical constraint of the number of samples per window to be a power of two. This condition is required by the fast Fourier transform (FFT), which is an efﬁcient algorithm for calculating the DFT. A choice of 256 samples, which corresponds to a window of 16 ms, is a good compromise.

This process results in the analysis of a single window. The complete characterisation of a speech signal requires the window to be shifted in short time steps. This results in a sequence of separate short time spectra which is represented graphically in Figure 14.1d. The amplitude of the spectrum is plotted in the time-frequency plane. This results in essentially the same input information as is available to a speech recognizer. The resulting graph, such as the “waterfall” shown in Figure 14.1d is not easy to interpret for a human observer. An easier to understand visual representation can be obtained in the form of a quasi-geographical map with the amplitude of the spectrum either coded in grey scale or represented by different colours. This is illustrated in Figure 14.2.

The maps which are produced in this way are called spectrograms. What the resulting spectrogram looks like will generally be inﬂuenced by the window length. In particular a longer window, resulting in greater spectral detail, as shown in Figure 14.1b, will give a narrowband spectrogram of the type shown in Figure 14.2a, whereas a shorter window, resulting in better time resolution, but less spectral detail as shown in Figure 14.1c will give a broadband spectrogram of the type shown in Figure 14.2b.

502 14 Speech, Text and Braille Conversion Technology

Figure 14.1a–d. Example, showing the way from the speech signal to the spectrogram: a acoustical waveform (sound pressure vs time) of the word “Amplitude” pronounced in German by a male speaker; b spectrum of the sound [i] of the given word, calculated by fast Fourier transform (FFT) of a speech segment (window) of 32 ms. For such a “long” window, the spectral details can be observed very well (narrowband spectrum); c spectrum of the same sound [i], calculated from a window of merely 8 ms. In this case, a better presentation of the spectral envelope is obtained (broadband spectrum); d if the complete word is analysed window by window, we obtain a sequence of spectra according to b or c, respectively, which form a relief from mountains and valleys over the time-frequency plane

14.2 Prerequisites for Speech and Text Conversion Technology

503

Figure 14.2. Visualization of the sequence of spectra by means of a spectrogram (continued from the example presented in Figure 14.1). Because the three-dimensional presentation from Figure 14.1d is hard to interpret, a map-like presentation as a top view of the spectral “landscape” is preferred, called spectrogram. In a spectrogram, the abscissa acts as the time axis, the ordinate as the frequency axis, and the spectral magnitude is coded in colours or in a grey scale: a narrowband spectrogram of our example word “Amplitude”, composed from spectra like Figure 14.1b; b broadband spectrogram of the same word, composed from spectra like Figure 14.1c

Excitation source and articulation tract

There are a number of different types of speech sounds, which are produced in slightly different ways. One of the main distinctions is between voiced and unvoiced sounds. Voiced sounds are produced by a process called phonation in which an air stream is conducted from the lungs through the larynx and leads to a quasiperiodic opening and closing of the vocal cords. The resulting speech signal is quasi-periodic. Unvoiced sounds are produced without phonation.

In the spectrograms of Figure 14.2 voiced sounds are clearly apparent, whereas unvoiced sounds are not very distinctive. In particular, voiced sounds show clear regularities or periodicities either in the frequency direction of the narrowband spectrogram, or in the time direction of the broadband spectrogram. These pe-

504 14 Speech, Text and Braille Conversion Technology

Figure 14.3a–c. The articulation of sounds and the linear model of it: a articulatory organs of the human; b linear model of the production of voiced sounds; c block diagram of a parametric speech synthesis system basing on the linear approach of b

riodicities reﬂect the periodic excitation of the larynx which produces voiced sounds.

Figure 14.3a illustrates the fact that the vibrations of the vocal cords of the larynx are inﬂuenced by the shape of the different cavities of the articulation tract that they pass through on their way from the larynx to mouth. The resulting modulation of speech allows the speaker to position the lips, tongue and other articulatory organs appropriately to produce the desired sound. Fortunately the process is automatic for people speaking their native language. However, it is generally less easy for people trying to produce the correct sounds in a foreign language.

<<< < Предыдущая 125 126 127 128 129 130 131 132 133 134 135 136137 / 192137 138 139 140 141 142 143 144 145 146 147 148 149 > Следующая >>>

Соседние файлы в папке Английские материалы

#
28.03.202611.17 Mб0Artificial Sight Basic Research, Biomedical Engineering, and Clinical Advances_Humayun, Weiland, Chader_2007.pdf
#
28.03.20263.39 Mб0Artisan Lens Effects on Vision Quality, the Corneal Endothelium and Vision-Related Quality of Life _Saxena,_2009.pdf
#
28.03.20266.39 Mб0Arvind's Atlas of Fungal Corneal Ulcers_Prajna_2008.pdf
#
28.03.202630.04 Mб0Asian Blepharoplasty and the Eyelid Crease_Chen_2006.pdf
#
28.03.20266.71 Mб0Assessing and Treating Glaucoma in Children of the Developing World_Helveston, Smallwood_2009.pdf
#
28.03.202612.16 Mб0Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf
#
28.03.202620.72 Mб0Astigmatism – Optics, Physiology and Management_Goggin_2012.pdf
#
28.03.20266.62 Mб0At the Crossing Pediatric Ophthalmology And Strabismus_Balkan, Ellis. Eustis_2004.pdf
#
28.03.202610.56 Mб0Atlas of Aesthetic Eyelid and Periocular Surgery_Spinelli, Lewis, Elahi_2004.pdf
#
28.03.202616.27 Mб0Atlas of Clinical and Surgical Orbital Anatomy 2nd edition_Dutton_2011.chm
#
28.03.202617.68 Mб0Atlas of Confocal Laser Scanning In-vivo Microscopy in Opthalmology - Principles and Applications in Diagnostic and Therapeutic Ophtalmology_Guthoff, Baudouin, Stave_2006.pdf