Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Ординатура / Офтальмология / Английские материалы / Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
12.16 Mб
Скачать

14.4 Text-to-speech Conversion

521

14.4 Text-to-speech Conversion

14.4.1 Principles of Speech Production

Human and synthetic speech production

Human speech production is a very complex process (Levelt 1989). The complex steps required to produce an utterance can be divided into the following two categories:

The planning and decision processes in the brain required to produce a formulation following the grammatical rules of the relevant language from semantic contents or an intention to speak.

Activation of the muscles controlling the breath and the synchronous movement of the articulators to produce an acoustical waveform which is radiated by the mouth.

There is an area of AI called generation which models this complex interaction of thinking and speaking. Its main aim is the conversion of nonverbal information into natural language (Görz et al. 2000). An illustrative system which includes a generation component has already been discussed and illustrated in Figure 14.14. Coupling the generation component with a speech synthesizer produces a contents- to-speech or concept-to-speech system (CTS system) which models the process of human speech production. However, in many applications, the input information is already available in the form of written language (text) and only the simpler structure of a text-to-speech (TTS) system is required.

Text-to-speech systems

This section will discuss the main principles of TTS systems. The block diagram of a TTS system is derived from the right (synthesis) branch of the general speech processing system in Figure 14.4, giving the structure in Figure 14.15.

Comparison of Figures 14.4 (universal analysis and synthesis system), 14.11 (speech-to-text) and 14.15 (text-to-speech) shows that the inclusion of prosodic information (intonation, sound duration, sound energy) is expressed by a separate box in Figure 14.15. Careful and correct treatment of prosodic elements increases the naturalness of synthesised speech and this is important for user acceptance. As discussed in Section 14.2.3, this task is not easy and a body of research in the last decade has focused on improving the quality of prosody in TTS systems (for example, see Hoffmann et al. 1999a).

The most crucial part of a TTS system is the rightmost box in Figure 14.15 which aims to produce a speech signal (acoustical synthesis). There are two approaches, parametric and concatenative speech synthesis, which will be discussed in the next section.

522 14 Speech, Text and Braille Conversion Technology

Figure 14.15. Structure of a TTS system. This scheme corresponds to the synthesis branch of the UASR shown in Figure 14.4

14.4.2 Principles of Acoustical Synthesis

Parametric speech synthesis

The construction of a technological system to produce speech can be performed in several ways. The most obvious solution applies a model of the human articulation system. It is necessary to control the parameters of this model in order to produce the different speech sounds. This concept of a parametric speech synthesizer was discussed briefly in Section 14.2.1 with reference to Figure 14.3c. In this special case, it was necessary to control the following parameters: the filter parameters, the gain, the position of the switch between voiced and unvoiced sounds and the frequency of the generator for voiced sounds.

The idea of parametric speech synthesis considerably predates electronics and the most successful and famous mechanical model of the articulation system was invented by Wolfgang von Kempelen (Kempelen 1791).

Since the problems of storage and transmission of speech were satisfactorily resolved relatively early in communications engineering, there were several early attempts at electronic speech synthesis. The first electronic speech synthesis systems were a consequence of the development of powerful transmission systems based on a German patent (Schmidt 1932). The earliest implementation was Dudley’s Vocoder (voice coder) in 1936.

Following the development of the Vocoder, a number of parametric synthesis systems along the lines of Figure 14.3c were produced. The same principle can be found on all the hardware platforms produced in recent decades from electronic valves via discrete transistor circuitry, integrated circuits and microprocessors to state-of-the-art DSPs (Digital Signal Processors). Figure 14.16 illustrates this

14.4 Text-to-speech Conversion

523

Figure 14.16a,b. Selected formant synthesizers developed at the TU Dresden. a partial view at the three formant synthesizer SYNI 2 from the year 1975. This device was basing on germanium transistor technology and was controlled manually or by a paper tape reader. The following synthesizer versions were computer controlled according to the availability of process-control computers or, later on, microprocessors. b this layout photo shows the final point of this line, the formant synthesizer chip VOICE 1 which was developed with the Fraunhofer Institute IMS in Dresden

development by means of an example. In this example, the linear filter of the block diagram in Figure 14.3c is designed to produce a formant structure based on Figure 14.7 for formant synthesis. It should be noted that only two decades of development separate the very different implementations shown in the two photographs.

The development of computers made possible the use of parametric speech synthesis, as they could be used to send the control parameters to the synthesizer hardware in real-time. However, the quality of the synthesized speech was poor due to significant differences between the human speech production system and the model.

524 14 Speech, Text and Braille Conversion Technology

Concatenative speech synthesis

The limited quality of parametric speech synthesis led to repeated attempts to synthesize speech by concatenating short segments of speech previously spoken by real speakers. Before the development of digital computers this so-called synthesis in the time domain required very complicated analog equipment. The introduction of computer control did not resolve the problems immediately due to the limited magnetic core memory of only a few kilobytes of the early process-control computers, such as the PDP-8 series. This is clearly insufficient for storing even a very short digitized speech signal, since, as shown in Section 14.2.1, 1.3 s of speech requires more than 20,000 samples and approximately 40 kB of memory. Therefore, the broad development of synthesis in time domain or concatenative synthesis started later when cheap semiconductor memories were introduced resulting in the development of personal computers.

Producing concatenated speech from the stored waveforms of single sounds (allophones) poses a specific problem. Real speech is highly dynamic, and the transitions from one sound to the next (the effects of coarticulation) are difficult to model in the concatenation software. As a standard solution to this problem, combinations of two sounds, called diphones, are selected from natural speech and stored to form a diphone inventory for the synthesis system. The waveform of a diphone starts in the middle of the first sound and ends in the middle of the second sound, as shown in Figure 14.17.

A complete utterance can now be produced by forming a series of the corresponding diphones. To minimize audible distortions, the diphones cannot be linked together without a certain overlap. This is performed by an overlap-and- add algorithm (OLA). The most commonly used OLA algorithm is known as TD-PSOLA (time domain period synchronous OLA). Its description can be found in textbooks such as (Dutoit 1997). This method gives a smooth concatenation and also allows the duration of the signal to be lengthened or shortened to a certain extent. This is required to control the fundamental frequency (pitch) which is the most important prosody parameter (see Section 14.2.3).

There is a potential danger of reduction in quality of the synthesized speech at the concatenation points. Other concatenation algorithms than PSOLA have been

Figure 14.17. Example for a diphone. It shows the right half of the fricative sound [f ] and the left half of the vowel [u] from the English word beautiful which was pronounced by a female native speaker. The diphone length is approximately 200 ms