Добавил:

Sekretar kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Ростовский Государственный Медицинский Университет

Предмет:

Медицина общая

Файл:

Ординатура / Офтальмология / Английские материалы / Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf

Скачиваний:

Добавлен:

28.03.2026

Размер:

12.16 Mб

Скачать

☆

►Содержание►

<<< < Предыдущая 130 131 132 133 134 135 136 137 138 139 140 141142 / 192142 143 144 145 146 147 148 149 150 151 152 153 154 > Следующая >>>

14.4 Text-to-speech Conversion

525

Table 14.5. Inventories of the TTS system of the Dresden University (Dresden Speech Synthesis DRESS) as available in 2003

Language	Number and type	Speaker	Size (16 Bit PCM)
	of speech units

German	1212 diphones	1 male, 3 female	5 MB
US English	1595 diphones	1 female	7 MB
Russian	572 allophones	1 male	0.5 MB
Mandarin Chinese	3049 syllables	1 male	27 MB
Italian	1224 diphones	1 male	4 MB
Klingon	299 allophones	1 male	1.3 MB

implemented with varying degrees of success. Speech quality can be improved by avoiding concatenation points by using speech units which are longer than diphones. However this increases the number of units and the memory space required for their storage. The complete set of units (whether diphones or other speech units) is called the inventory. It forms the acoustical database in the block diagram in Figure 14.15. To give an idea of the order of magnitude of the memory required, Table 14.5 characterizes the inventories for a particular TTS system (Hoffmann et al. 1999a). It should be noted that the table includes databases of very different types of units and consequently of very different memory sizes.

14.4.3 Equipment and Applications

Performance criteria

The structure of state-of-the-art TTS systems follows that of Figure 14.15. In nearly all cases, the acoustic component is implemented using the concatenative method in the time domain. The following criteria can be used to evaluate the performance of a TTS system:

•Intelligibility. This is the primary criterion but it only plays a minor role in contemporary discussions as most TTS systems produce speech with good intelligibility.

•Naturalness. Listeners are very sensitive to the phenomena which make synthetic speech sound unlike that of a natural speaker. Naturalness is inﬂuenced by both the segmental quality (the quality of the units forming the acoustic database) and the prosody model. It is also very important that the text analysis block of Figure 14.15 supplies the prosody control module with exact input information. Increased naturalness can also be achieved by modelling the so-called spontaneous effects in human speech. They mainly include speech rhythm and variations in pronunciation (Werner et al. 2004).

•Multilinguality. TTS systems which can produce speech in different languages are very useful and of greater commercial value. This requires one set of databases in the scheme of Figure 14.15 for each language (cf. the example

526 14 Speech, Text and Braille Conversion Technology

in Table 14.5). Changing the language simply requires the system to change databases. However database memory requirements for this type of system increase linearly with the number of languages and therefore a “universal” inventory containing the sounds of all the languages required for a particular TTS application has been investigated. The quality of this so-called polyglot synthesis was very limited.

•Speaker characteristics. As shown in Table 14.3, the speech signal carries paraand nonlinguistic information. Therefore, there are advantages in providing TTS systems with different speaking styles and the ability to express emotions and offer different speakers. Again, the simplest way to do so is to implement different databases which can be drawn on for the speciﬁc conﬁguration. However, this solution leads to the same memory problem as the introduction of different languages and therefore researchers are investigating algorithmic methods for inﬂuencing the signal directly. For instance, it is possible to switch from a male voice to a female one and vice versa using an algorithm which takes into account the gender speciﬁc differences of the human vocal tract length (Eichner et al. 2004).

•Resources. The cost of a TTS application depends on the quantity of resources required by the system. These resources include the computing power required to calculate the synthetic speech signal in real-time and the memory required by both the algorithms and the databases. It is frequently the memory requirements which have the greatest impacts on the total cost. The total amount of memory required by a TTS system is called its footprint. The cost of the system is particularly important in bulk applications like TTS in mobile phones or other embedded solutions.

Diphone-based TTS systems became established as the baseline technology during the 1990s. Current developments are largely inﬂuenced by the need to reduce resource requirements. There are currently two different classes of TTS systems, PCor server-based systems and embedded systems, with different types of trade-offs between speech quality and resource requirements in the two cases.

PCor server-based systems

Resource requirements are not particular signiﬁcant for TTS systems which run on a PC or workstation. Therefore it is feasible to use speech units which are larger than diphones in PCor server-based TTS systems. In extreme cases, very large databases (corpora) of recorded speech are used. For instance, the Verbmobil system (Wahlster 2000) used a corpus of 3 h of speech to synthesize utterances from its restricted domain. However, such corpora require gigabytes of memory to achieve their aim of producing very natural sounding speech and still have a number of unsolved problems (Hess 2003):

•Cost functions. A given text can be synthesized by different segments of the corpus. The selection of the non-uniform segments that are best suited to the

14.4 Text-to-speech Conversion

527

given utterance is a complicated optimization problem, which can be solved using cost functions. There is still ongoing research on the design of these cost functions.

•Coverage. Even large corpora cannot contain all the word forms of a given language. Languages have a large number of rarely used words. If these rare words occur in a text to be synthesized, they must be composed of smaller units such as syllables or diphones. This results in the concentration of a large number of concatenation points in a speciﬁc part of the synthesized speech and the danger of audible effects.

•Labelling. Before speech segments from the corpus can be used, they must be identiﬁed and labelled accordingly. This cannot be carried out manually for large corpora, but the quality of automatic labellers is still unsatisfactory.

Embedded systems

In the other cases of TTS systems which are integrated (embedded) into PC-independent low-cost applications the costs of the system are largely determined by its footprint, which should therefore be minimized as far as possible. As an illustration, Table 14.6 gives the costs of on-chip RAM for several different memory capacities.

A complete TTS system with a footprint of less than 1 MB was achieved for the ﬁrst time with the microDRESS version of the TTS system DRESS (presented in Table 14.5) (Hoffmann et al. 2003). Since 1 MB is rather small compared with the memory requirements of the data in the baseline system in Table 14.5, it is necessary to concentrate effort on reducing the inventory size. This can be done in two steps. The ﬁrst step generally involves a reduction to telephone quality covering the bandwidth from 300 Hz to 3400 Hz, with an associated reduction in the number of samples required for the inventory. The second step involves the application of coding algorithms, which are frequently used in communication systems. The application of simple coding schemes is sufﬁcient to give a footprint of 1 MB. A smaller footprint could be obtained by the use of more powerful coding algorithms, but would have the disadvantage of reducing the speech quality as is known from the speech encoders and decoders (combined known as codecs) which are applied in telecommunications (Chu 2003). This reduction in quality can be offset to a limited extent by carefully maintaining the inventory.

Table 14.6. On-chip RAM as cost driver. Data from the year 2002 (Schnell et al. 2002)

RAM (kB)	0	120	144	292	512	1024

Costs (EUR)	0.83	1.27	1.49	1.93	3.19	5.00

<<< < Предыдущая 130 131 132 133 134 135 136 137 138 139 140 141142 / 192142 143 144 145 146 147 148 149 150 151 152 153 154 > Следующая >>>

Соседние файлы в папке Английские материалы

#
28.03.202611.17 Mб0Artificial Sight Basic Research, Biomedical Engineering, and Clinical Advances_Humayun, Weiland, Chader_2007.pdf
#
28.03.20263.39 Mб0Artisan Lens Effects on Vision Quality, the Corneal Endothelium and Vision-Related Quality of Life _Saxena,_2009.pdf
#
28.03.20266.39 Mб0Arvind's Atlas of Fungal Corneal Ulcers_Prajna_2008.pdf
#
28.03.202630.04 Mб0Asian Blepharoplasty and the Eyelid Crease_Chen_2006.pdf
#
28.03.20266.71 Mб0Assessing and Treating Glaucoma in Children of the Developing World_Helveston, Smallwood_2009.pdf
#
28.03.202612.16 Mб0Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf
#
28.03.202620.72 Mб0Astigmatism – Optics, Physiology and Management_Goggin_2012.pdf
#
28.03.20266.62 Mб0At the Crossing Pediatric Ophthalmology And Strabismus_Balkan, Ellis. Eustis_2004.pdf
#
28.03.202610.56 Mб0Atlas of Aesthetic Eyelid and Periocular Surgery_Spinelli, Lewis, Elahi_2004.pdf
#
28.03.202616.27 Mб0Atlas of Clinical and Surgical Orbital Anatomy 2nd edition_Dutton_2011.chm
#
28.03.202617.68 Mб0Atlas of Confocal Laser Scanning In-vivo Microscopy in Opthalmology - Principles and Applications in Diagnostic and Therapeutic Ophtalmology_Guthoff, Baudouin, Stave_2006.pdf