Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Семинар / Диссертации / The Netherlands, 2011.pdf
Скачиваний:
29
Добавлен:
19.05.2015
Размер:
2.6 Mб
Скачать

2.3 Speech

Although vision-based facial expression analysis provides (reasonably) good results nowadays, it is complex. Physiognomies of faces vary considerably between individuals due to age, ethnicity, gender, facial hair, cosmetic products, and occluding objects (e.g., glasses and hair). Furthermore, a face can appear to be distinct from itself due to pose and/or lighting changes. For a more elaborate discussion on these issues, I refer to the surveys of Fasel and Luettin [192] and Tian, Kanade, and Cohn [652]. For a comparative evaluation of 2D versus 3D FACS-based emotion recognition, I refer to the recent study of Savran, Sankur, and Bilge [574]. In sum, in highly controlled environments, vision-based emotion recognition works properly; however, when constraints are loosened, it might fail [192, 652, 680, 717]. This makes it a powerful but also fragile affective signal.

2.3 Speech

Although speech-based emotion recognition shows many parallels with vision-based emotion recognition, it also has some distinct characteristics; see also Tables 2.2 and 2.3. A prominent difference is that speech-based emotion recognition has been applied on both healthy subjects and on subjects with varying disorders (e.g., [677]). Research on speech-based emotion recognition with people with disorders has revealed a lot of information on emotional speech.

For decades audio-based emotion recognition has been conducted with a limited set of features (64; see also Table 2.3), without the use of any feature selection or reduction [579, 696]. During the last decade, more often a brute force strategy was employed [590], using hundreds or even thousands of features (e.g., see [644, 725]). In parallel with the explosion of the number of features, feature selection/reduction strategies claimed their position. Machine’s recognition rate of emotional speech ranges from Banse and Scherer [27], who report 25%/40% correct classification of 14 emotions, to Wu, Falk, and Chan [725], who report 87%/92% correct classification of 7 emotions. The latter results, however, are in contrast with the results on a structured benchmark reported by Schuller, Batliner, Steidl, and Seppi [590] at the InterSpeech 2009 emotion challenge: 66% 71% (2 classes) and 38% 44% (5 classes).

Similarly to vision-based approaches, audio-based emotion recognition suffers from environmental noise (e.g., from a radio or conversations in the background). Moreover, audio recordings are influenced by acoustic characteristics of environments; using templates for distinct environments could relieve this burden. Moreover, recording of sound could also cancel out some of the noise, although such an approach is in itself already very challenging. Similarly to vision-based emotion recognition, audio-based emotion recognition can be best employed in a controlled setting, as is the case in most studies.

27

2 A review of Affective Computing

Table 2.2: Speech signal analysis: A sample from history.

Throughout the previous century, extensive investigations have been conducted on the functional anatomy of the muscles of the larynx [281, 386]. It was shown that when phonation starts, an increase in electrical activity emerges in the laryngeal muscles. Also with respiration, slight electrical activity was found in the laryngeal muscles. These processes are highly complex as speech is an act of large motor complexity, requiring the activity of over 100 muscles [386]. These studies helped to understand the mechanisms of the larynx during phonation; cf. [653]. Moreover, algorithms were developed to extract features (and their parameters) from the human voice. This aided further research towards the mapping of physical features, such as frequency, power, and time, on their psychological counterparts, pitch, loudness, and duration [114].

In the current research, the physical features are assessed for one specific cause: stress detection. One of the promising features for voice-induced stress detection is the fundamental frequency (F0), which is a core feature in this study. The F0 of speech is defined as the number of openings and closings of the vocal folds within a certain time window, which occurs in a cyclic manner. These cycles are systematically reflected in the electrical impedance of the muscles of the larynx. In particular, the cricothyroid muscle has been shown to have a direct relation with all major F0 features [117]. In addition, it should be noted that F0 has a relation with another, very important, muscle: the heart. It was shown that the F0 of a sustained vowel is modulated over a time period equal to that of the speaker’s heart cycle, illustrating its ability to express one’s emotional state [501].

Through recording of speech signals, their features (e.g., amplitude and F0) can be conveniently determined. This has the advantage that no obtrusive measurement is necessary. Only a microphone, an amplifier, and a recording device are needed. Subsequently, for the determination of F0, appropriate filters (either hardware or software) can increase the relative amplitude of the lowest frequencies and reduce the high and mid-frequency energy in the signal. The resulting signal contains little energy above the first harmonic. In practice, the energy above the first harmonic is filtered, in a last phase of processing.

Harris and Weiss [263] were the first to apply Fourier analysis to compute the F0 from the speech signal. Some alternatives for this approach have been presented in the literature; for example, wavelets [712]. However, the use of Fourier analysis has become the dominant approach. Consequently, various modifications on the original work of Harris and Weiss [263] have been applied and various software and hardware pitch extractors were introduced throughout the second half of the 20th century (cf. [168] and [537]). For the current study, we adopted the approach of Boersma [53] to determine the F0 of speech.

28

29

Table 2.3: Review of 12 representative machine learning studies employing speech to recognize emotions.

source

year

# subjects

features

selection /

classifiers

target

classification

 

 

 

 

reduction

 

 

 

 

 

 

 

 

 

 

 

[398]

1962

3

 

 

10 humans

8 utterances

85%

[369]

1985

65

12/8/4

3 experiments

 

7 scales

n.a.

[27]

1996

12

 

 

12 humans

14/11 emotions

48%/55%

 

 

 

18

 

LRM,LDA

14 emotions

25%/40%

[491]

2003

12

 

 

3 humans

6 emotions

66%

 

 

 

64

distortion

HMM

 

78%

[503]

2003

6

 

 

8 humans

5 emotions

57%/77%

 

 

 

200/

GA,NB, k-NN

k-NN, DT, ANN, K , LRM,

4 emotions

50%-96%

 

 

 

20/20

 

SVM, VFI, DT, NB, AdaBoost

 

 

[463]

2007

11/12

38

PCA,SFS,GA

SVM,k-NN,ANN,NB,K ,DT

2/6 emotions

71-79%/60-73%

[580]

2009

10

> 200

k-means

k-NN

7 emotions

70%

 

 

 

 

 

20 humans

 

85%

[424]

2010

10

383

J1 criterium,LDA

SVM,GMM

7 emotions

78%

[644]

2010

10

1054

SFS

SVM

7 emotions

83%/85%

 

 

4

 

 

 

3 emotions

88%/92%

[618]

2010

10

173

various

DT,SVM

8 emotions

75%/83%

[677]

2011

25

65

SBS

LRM

SUD scale

83%

and Chapter 8

 

 

 

 

 

69%

[725]

2011

10

442

SFS,LDA

SVM

7 emotions

87%-92%

 

 

19

 

SFS

correlation

9 emotions

81%

 

 

 

 

 

human

 

65%

 

 

28

 

SFS

correlation

 

69%

 

 

 

 

 

human

 

56%

 

 

19+28

 

SFS

correlation

 

73%

 

 

 

 

 

human

 

61%

Legend: AdaBoost: Adaptive Boosting; ANN: Artificial Neural Network; DT: Decision Tree; GA: Genetic Algorithm; GMM: Gaussian Mixture Models; HMM: Hidden Markov Models; K*: Instance-based classifier, with a distance measure based on entropy; k-NN: k-Nearest Neighbors; LDA: Fisher’s Linear Discriminant Analysis; LRM: Linear Regression Model; NB: Naïve-Bayes; PCA: Principal Component Analysis; SBS: Sequential Backward Selection; SFS: Sequential Forward Selection; SVM: Support Vector Machine; VFI: Voted Features Interval.

Speech 3.2

Соседние файлы в папке Диссертации