Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Ординатура / Офтальмология / Английские материалы / Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
12.16 Mб
Скачать

14.4 Text-to-speech Conversion

525

Table 14.5. Inventories of the TTS system of the Dresden University (Dresden Speech Synthesis DRESS) as available in 2003

Language

Number and type

Speaker

Size (16 Bit PCM)

 

of speech units

 

 

 

 

 

 

German

1212 diphones

1 male, 3 female

5 MB

US English

1595 diphones

1 female

7 MB

Russian

572 allophones

1 male

0.5 MB

Mandarin Chinese

3049 syllables

1 male

27 MB

Italian

1224 diphones

1 male

4 MB

Klingon

299 allophones

1 male

1.3 MB

 

 

 

 

implemented with varying degrees of success. Speech quality can be improved by avoiding concatenation points by using speech units which are longer than diphones. However this increases the number of units and the memory space required for their storage. The complete set of units (whether diphones or other speech units) is called the inventory. It forms the acoustical database in the block diagram in Figure 14.15. To give an idea of the order of magnitude of the memory required, Table 14.5 characterizes the inventories for a particular TTS system (Hoffmann et al. 1999a). It should be noted that the table includes databases of very different types of units and consequently of very different memory sizes.

14.4.3 Equipment and Applications

Performance criteria

The structure of state-of-the-art TTS systems follows that of Figure 14.15. In nearly all cases, the acoustic component is implemented using the concatenative method in the time domain. The following criteria can be used to evaluate the performance of a TTS system:

Intelligibility. This is the primary criterion but it only plays a minor role in contemporary discussions as most TTS systems produce speech with good intelligibility.

Naturalness. Listeners are very sensitive to the phenomena which make synthetic speech sound unlike that of a natural speaker. Naturalness is influenced by both the segmental quality (the quality of the units forming the acoustic database) and the prosody model. It is also very important that the text analysis block of Figure 14.15 supplies the prosody control module with exact input information. Increased naturalness can also be achieved by modelling the so-called spontaneous effects in human speech. They mainly include speech rhythm and variations in pronunciation (Werner et al. 2004).

Multilinguality. TTS systems which can produce speech in different languages are very useful and of greater commercial value. This requires one set of databases in the scheme of Figure 14.15 for each language (cf. the example

526 14 Speech, Text and Braille Conversion Technology

in Table 14.5). Changing the language simply requires the system to change databases. However database memory requirements for this type of system increase linearly with the number of languages and therefore a “universal” inventory containing the sounds of all the languages required for a particular TTS application has been investigated. The quality of this so-called polyglot synthesis was very limited.

Speaker characteristics. As shown in Table 14.3, the speech signal carries paraand nonlinguistic information. Therefore, there are advantages in providing TTS systems with different speaking styles and the ability to express emotions and offer different speakers. Again, the simplest way to do so is to implement different databases which can be drawn on for the specific configuration. However, this solution leads to the same memory problem as the introduction of different languages and therefore researchers are investigating algorithmic methods for influencing the signal directly. For instance, it is possible to switch from a male voice to a female one and vice versa using an algorithm which takes into account the gender specific differences of the human vocal tract length (Eichner et al. 2004).

Resources. The cost of a TTS application depends on the quantity of resources required by the system. These resources include the computing power required to calculate the synthetic speech signal in real-time and the memory required by both the algorithms and the databases. It is frequently the memory requirements which have the greatest impacts on the total cost. The total amount of memory required by a TTS system is called its footprint. The cost of the system is particularly important in bulk applications like TTS in mobile phones or other embedded solutions.

Diphone-based TTS systems became established as the baseline technology during the 1990s. Current developments are largely influenced by the need to reduce resource requirements. There are currently two different classes of TTS systems, PCor server-based systems and embedded systems, with different types of trade-offs between speech quality and resource requirements in the two cases.

PCor server-based systems

Resource requirements are not particular significant for TTS systems which run on a PC or workstation. Therefore it is feasible to use speech units which are larger than diphones in PCor server-based TTS systems. In extreme cases, very large databases (corpora) of recorded speech are used. For instance, the Verbmobil system (Wahlster 2000) used a corpus of 3 h of speech to synthesize utterances from its restricted domain. However, such corpora require gigabytes of memory to achieve their aim of producing very natural sounding speech and still have a number of unsolved problems (Hess 2003):

Cost functions. A given text can be synthesized by different segments of the corpus. The selection of the non-uniform segments that are best suited to the

14.4 Text-to-speech Conversion

527

given utterance is a complicated optimization problem, which can be solved using cost functions. There is still ongoing research on the design of these cost functions.

Coverage. Even large corpora cannot contain all the word forms of a given language. Languages have a large number of rarely used words. If these rare words occur in a text to be synthesized, they must be composed of smaller units such as syllables or diphones. This results in the concentration of a large number of concatenation points in a specific part of the synthesized speech and the danger of audible effects.

Labelling. Before speech segments from the corpus can be used, they must be identified and labelled accordingly. This cannot be carried out manually for large corpora, but the quality of automatic labellers is still unsatisfactory.

Embedded systems

In the other cases of TTS systems which are integrated (embedded) into PC-independent low-cost applications the costs of the system are largely determined by its footprint, which should therefore be minimized as far as possible. As an illustration, Table 14.6 gives the costs of on-chip RAM for several different memory capacities.

A complete TTS system with a footprint of less than 1 MB was achieved for the first time with the microDRESS version of the TTS system DRESS (presented in Table 14.5) (Hoffmann et al. 2003). Since 1 MB is rather small compared with the memory requirements of the data in the baseline system in Table 14.5, it is necessary to concentrate effort on reducing the inventory size. This can be done in two steps. The first step generally involves a reduction to telephone quality covering the bandwidth from 300 Hz to 3400 Hz, with an associated reduction in the number of samples required for the inventory. The second step involves the application of coding algorithms, which are frequently used in communication systems. The application of simple coding schemes is sufficient to give a footprint of 1 MB. A smaller footprint could be obtained by the use of more powerful coding algorithms, but would have the disadvantage of reducing the speech quality as is known from the speech encoders and decoders (combined known as codecs) which are applied in telecommunications (Chu 2003). This reduction in quality can be offset to a limited extent by carefully maintaining the inventory.

Table 14.6. On-chip RAM as cost driver. Data from the year 2002 (Schnell et al. 2002)

RAM (kB)

0

120

144

292

512

1024

 

 

 

 

 

 

 

Costs (EUR)

0.83

1.27

1.49

1.93

3.19

5.00