Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Ординатура / Офтальмология / Английские материалы / Assistive Technology for Visually Impaired and Blinde People_Hersh,Jonson_2008.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
12.16 Mб
Скачать

14.7 Discussion and the Future Outlook

543

14.6.6 Access to Telecommunication Devices

There has been considerable development of telecommunication systems over the last decade and this has had an impact on telecommunication use by visually impaired and blind people. “Historically, people with a hearing disability have been the group facing the most problems when using telephones; however, with the ever increasing reliance on visual means for displaying information, it is increasingly visually impaired people who have been confronted with access problems” (Roe, 2001, p 30).

Speech technology can provide potential solutions, as in the case of the following input/output functions for mobile phones:

Speech recognition is frequently used for voice-dialling. This feature was originally developed mainly for hands-free telephony in cars.

Speech synthesis will be increasingly used for improving the user interface (speech MMI), caller name announcement, reading short messages (SMS) and remote access to e-mails.

Although these features were developed originally to provide improved performance for sighted users, they are very useful for visually impaired people and illustrate the benefits of a design for all approach. The technical prerequisites are the development of embedded speech input/output solutions (Hoffmann et al. 2004).

Despite the benefits of design for all, it is not able to resolve all problems and therefore visually impaired telecommunications users also require some special equipment. For instance, the Braillino system shown in Figure 14.21b is illustrated in combination with the Nokia Communicator. However, it can be used more generally with any mobile phone that uses the Symbian operating system, which is the global industry standard operating system for smartphones (www.symbian.com). This includes the Series 60 Phones (without an alphanumeric keyboard) and the Series 80 Phones (with an organizer function and an alphanumeric keyboard). The connection can be wireless via Bluetooth. From the functional point of view, the communication software (called Talks&Braille) acts as a screen reader for the Symbian operating system.

14.7 Discussion and the Future Outlook

14.7.1 End-user Studies

Potential users of speech technology would like to have (comparative) information on the performance of the available systems. However, it is difficult to obtain global comparative evaluations, due to the complexity of the systems and the fact that the evaluation criteria depend on the intended application. The studies carried out to date can be grouped and discussed as follows.

544 14 Speech, Text and Braille Conversion Technology

Evaluation of research systems

Progress in speech technology is normally measured in terms of improved word recognition rates (for recognizers) or improved scores when rating the naturalness (for synthesizers). Therefore, there are presentations giving an ongoing evaluation of research systems at the leading conferences. The availability of common databases allows the results of the evaluation of different systems to be compared. However, these research-oriented results relate to systems that are not yet commercially available, rather than the current state of the market.

Comparison with human performance

It is natural to compare speech technology with human performance. Every user of speech technology soon notices that it does not perform nearly as well as a person, but there are few quantitative assessments of this difference in performance. A fundamental investigation was carried out by Lippmann (1997) for speech recognizers. He demonstrated how the recognition rate breaks down in the presence of environmental noise, whereas human listeners perform essentially better. Corresponding results can be obtained by rating the quality of speech synthesis using a mean opinion score (MOS) scale ranging from 1 (bad) to 5 (excellent). The naturalness of human speech is rated close to 5, but the output of TTS systems is generally valued somewhere in the middle range; between 1.73 and 3.74 according to the survey by Alvarez and Huckvale (2002). Considerable further research will be required to close the gap in both speech recognition and speech synthesis compared to a human listener or speaker, respectively.

Evaluation of commercial systems

Before including speech technology in a product, a company generally evaluates a number of competing systems, though the results are only published occasionally. This type of study gives an interesting insight into the real performance of the available products. For example, Maase et al. (2003) investigated the performance of command and control speech recognizers for controlling a kitchen device. Usability studies showed that the users are accepting this kind of control for recognition rates greater than 85%. Tests with eight different products showed that this performance was never reached in real environments. Typical parts of these results are shown in Figure 14.23.

General product studies are very time-consuming and expensive. Therefore, they require a sponsor who has not got a vested interest in one of the products. For instance the study of ten different dictation systems (Flach et al. 2000) mentioned in Section 14.3.3 was originally produced for a computer journal. A more recent study (Stiftung Warentest 2004) of six dictation systems was carried out without publishing the recognition rates. The system with the best performance is indicated in the previously shown Table 14.8.

 

 

 

 

 

 

 

14.7 Discussion and the Future Outlook

545

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 14.23a,b. Selected results from the study of Maase et al. (2003). The diagrams show the recognition rate of selected C&C recognizers for different noises (a) and different speaker positions (b). The speaker positions describe different places in the usability lab with growing distance (from 1 to 7 m). Reprinted by courtesy of the authors

Evaluation for user groups with special needs

There is clearly a need for studies of speech support and dictation systems for blind and visually impaired people. Unfortunately, there is a distinct lack of large scale user studies of speech support systems for this user group. However, there are several more general studies which include consideration of speech technology to a certain extent. A number of such investigations have considered improving learning environments for blind students (Kahlisch 1998). Another emerging field is the study of the assistive technology needs for elderly people. Since many elderly people have acquired visual impairments, these studies include useful material on speech-related technologies. Figure 14.24 presents an example.

14.7.2 Discussion and Issues Arising

An overview of the remarks in this chapter shows that the performance of speech input/output systems is by no means perfect, despite improved algorithms, larger databases, increased memories, and growing computing power. In general, this still somewhat disappointing performance is due to the extreme complexity of human speech processing which it is difficult to satisfactory approximate by technical systems. Although there is not space to discuss the reasons for this less than satisfactory performance in detail, some of the reasons for this are briefly summa-

546 14 Speech, Text and Braille Conversion Technology

Figure 14.24. Example of a usability study. The diagram shows the acceptance of speech controlled services by different user groups according to the study of Hampicke (2004). Reprinted by courtesy of the author. Score of 6: in any case. Score of 3: medium. Score of 0: in no case. The legend describes the grade of visual impairment

rized in Table 14.10. Examining this table leads to the following conclusions about important future directions for basic research in speech technology:

Speech understanding.

Acoustic front end.

Modelling human speech and language processing.

These topics are all highly interdisciplinary and will require interdisciplinary work.

14.7.3 Future Developments

As discussed in this chapter, speech technology has established itself as a stable and successful component of assistive technology. Speech technology is also becoming increasingly successful in other fields with a greater economic impact, including in the telecommunications area for communication with call centres and telephone banking. Although beyond the remit of this chapter, a survey of user opinions of this technology would be interesting, since there is at least anecdotal evidence that users prefer to communicate with a person and are highly dissatisfied with call centres. According to recent data (Sohn 2004), the turnover in business applications of speech technology will grow from $ 540 millions currently worldwide to $ 1600 millions in the year 2007.

This growth in the use of speech technology is not surprising in view of the importance of speech in telephone applications and consequently also for their automation. The importance of speech input/output systems relative to other media is likely to grow, as can be seen from Table 14.11.

What will this tendency mean for blind and visually impaired people? Developments in speech technology will improve access to interfaces for an increasing range of applications for this group (though not for deafblind people). The resulting benefits are likely to be substantial and cover applications ranging from access to numerous knowledge sources to improved accessibility of household appliances.

14.7 Discussion and the Future Outlook

547

Table 14.10. Actual research problems in speech technology, explained by means of the general scheme of a speech processing system (Figure 14.4)

Where are

How can the problems

What can research do

Examples for first

the problems

be described?

to solve the problems?

solutions

localized

 

 

 

 

in Figure 14.4?

 

 

 

 

 

 

 

 

At the top

Our systems do not

Develop speech

Speech-to-speech

of the figure

understand what they

understanding, cooperate

translation systems like

 

do. The scheme is ending

with computer

Verbmobil

 

at text level without

linguistics/AI/semiotics

(Wahlster 2000)

 

semantic components

 

In speech synthesis:

 

 

 

concept-to-speech (CTS)

 

 

 

instead of TTS

At the bottom

The acoustic channel

Consider the system

Acoustic signal processing

of the figure

between the user and the

(recognizer or

such as:

 

converters (microphone

synthesizer, respectively)

Microphone arrays.

 

or loudspeaker,

and the acoustic

 

respectively) is still

environment as a unit

Noise suppression.

 

neglected in most cases

and develop the

Source separation.

 

 

“acoustic frontend”

 

 

 

 

 

 

 

Directed sound supply.

In the

Because our

Although a technical

Many activities in

components

understanding of human

system needs not to be

modelling prosody in

of the figure

speech processing is far

a close copy of the

close cooperation of

 

from an applicable level,

biological counterpart,

engineers and

 

the models which we use

we need essentially more

phoneticians during

 

are more or less

knowledge of human

the last decade;

 

mathematical or empiric

speech production and

Research systems which

 

 

perception

model human acoustic

 

 

 

processing

 

 

 

 

 

Table 14.11. How to interact with future systems? An overview from Weyrich (2003)

Small devices

Speech

Service robots

Speech and gestures, artifical skin, emotions

Federation of systems

Speech and gestures, emotions

e-Business

Active dialogue systems, interactive multimedia

Augmented reality systems

Speech, gestures

 

 

548 14 Speech, Text and Braille Conversion Technology

Talking products which are of interest to both sighted and visually impaired people are more attractive to companies due to their larger markets and therefore this type of product is more likely to be widely available from standard suppliers and at a reasonable price than specialised products for visually impaired people. For instance, blind and many visually impaired users require speech (or tactile) output to state the function of the key being pressed or the knob setting on the (complex) control panel of a washing machine. This audio option may also be of interest to sighted users. The inclusion of both speech and tactile output could be considered part of a design for all approach, but, as already indicated, though design for all should be part of good design practice, it will never totally replace the need for assistive devices.

There is therefore considerable potential for increasing accessibility to blind and visually impaired people, though further technical developments will be required. However, it should also be noted that access to new technologies is limited by a number of different factors, including geography and poverty. The term ‘digital divide’ is often used to describe the difference between people who do and do not have access to modern technologies and the resulting disadvantages, whereas the term eInclusion is used for access to the information society by disabled people and other potentially disadvantaged groups. While it is important to ensure that blind and visually impaired people are able to fully participate in the information society, it should also be recognised that some people, both blind and sighted, do not like technology. It will therefore also be important to ensure that there are also low technology accessibility solutions for blind and visually impaired people and that information is available in a number of different formats, including but not solely electronically.

Speech and language technology will always be compared to natural human speech and language. Therefore, regardless of progress, they are likely to be found wanting for a long time to come, if not permanently. This presents an ongoing challenge, which is probably much greater than that encountered in many other disciplines. As Waibel and Lee (1990) state in their preface to Readings in Speech Recognition: “Many advances have been made during these past decades; but every new technique and every solved puzzle opens a host of new questions and points us in new directions. Indeed, speech is such an intimate expression of our humanity—of our thoughts and emotions—that speech recognition is likely to remain an intellectual frontier as long as we search for a deeper understanding of ourselves in general, and intelligent behaviour in particular.”

Acknowledgement. As can be seen from the list of references, the material in this chapter is based on research results and teaching material of the chair for speech communication at the Technische Universität Dresden. The author would like to take the opportunity to thank his team for their fruitful cooperation on many projects.

Special thanks for helpful discussions and support to Professor Dieter Mehnert, formerly at the Humboldt-Universität zu Berlin, Professor Klaus Fellbaum, Brandenburgische Technische Universität Cottbus, Professor Gerhard Weber, Universität Kiel, and Dr. Lothar Seveke, Computer für Behinderte GmbH, Dresden.