- •Stellingen
- •Propositions
- •List of Figures
- •List of Tables
- •1 Introduction
- •Introduction
- •Affect, emotion, and related constructs
- •Affective Computing: A concise overview
- •The closed loop model
- •Three disciplines
- •Human-Computer Interaction (HCI)
- •Health Informatics
- •Three disciplines, one family
- •Outline
- •2 A review of Affective Computing
- •Introduction
- •Vision
- •Speech
- •Biosignals
- •A review
- •Time for a change
- •3 Statistical moments as signal features
- •Introduction
- •Emotion
- •Measures of affect
- •Affective wearables
- •Experiment
- •Participants
- •Equipment and materials
- •Procedure
- •Data reduction
- •Results
- •Discussion
- •Comparison with the literature
- •Use in products
- •4 Time windows and event-related responses
- •Introduction
- •Data reduction
- •Results
- •Mapping events on signals
- •Discussion and conclusion
- •Interpreting the signals measured
- •Looking back and forth
- •5 Emotion models, environment, personality, and demographics
- •Introduction
- •Emotions
- •Modeling emotion
- •Ubiquitous signals of emotion
- •Method
- •Participants
- •International Affective Picture System (IAPS)
- •Digital Rating System (DRS)
- •Signal processing
- •Signal selection
- •Speech signal
- •Heart rate variability (HRV) extraction
- •Normalization
- •Results
- •Considerations with the analysis
- •The (dimensional) valence-arousal (VA) model
- •The six basic emotions
- •The valence-arousal (VA) model versus basic emotions
- •Discussion
- •Conclusion
- •6 Static versus dynamic stimuli
- •Introduction
- •Emotion
- •Method
- •Preparation for analysis
- •Results
- •Considerations with the analysis
- •The (dimensional) valence-arousal (VA) model
- •The six basic emotions
- •The valence-arousal (VA) model versus basic emotions
- •Static versus dynamic stimuli
- •Conclusion
- •IV. Towards affective computing
- •Introduction
- •Data set
- •Procedure
- •Preprocessing
- •Normalization
- •Baseline matrix
- •Feature selection
- •k-Nearest Neighbors (k-NN)
- •Support vector machines (SVM)
- •Multi-Layer Perceptron (MLP) neural network
- •Discussion
- •Conclusions
- •8 Two clinical case studies on bimodal health-related stress assessment
- •Introduction
- •Post-Traumatic Stress Disorder (PTSD)
- •Storytelling and reliving the past
- •Emotion detection by means of speech signal analysis
- •The Subjective Unit of Distress (SUD)
- •Design and procedure
- •Features extracted from the speech signal
- •Results
- •Results of the Stress-Provoking Story (SPS) sessions
- •Results of the Re-Living (RL) sessions
- •Overview of the features
- •Discussion
- •Stress-Provoking Stories (SPS) study
- •Re-Living (RL) study
- •Stress-Provoking Stories (SPS) versus Re-Living (RL)
- •Conclusions
- •9 Cross-validation of bimodal health-related stress assessment
- •Introduction
- •Speech signal processing
- •Outlier removal
- •Parameter selection
- •Dimensionality Reduction
- •k-Nearest Neighbors (k-NN)
- •Support vector machines (SVM)
- •Multi-Layer Perceptron (MLP) neural network
- •Results
- •Cross-validation
- •Assessment of the experimental design
- •Discussion
- •Conclusion
- •10 Guidelines for ASP
- •Introduction
- •Signal processing guidelines
- •Physical sensing characteristics
- •Temporal construction
- •Normalization
- •Context
- •Pattern recognition guidelines
- •Validation
- •Triangulation
- •Conclusion
- •11 Discussion
- •Introduction
- •Hot topics: On the value of this monograph
- •Applications: Here and now!
- •TV experience
- •Knowledge representations
- •Computer-Aided Diagnosis (CAD)
- •Visions of the future
- •Robot nannies
- •Digital Human Model
- •Conclusion
- •Bibliography
- •Summary
- •Samenvatting
- •Dankwoord
- •Curriculum Vitae
- •Publications and Patents: A selection
- •Publications
- •Patents
- •SIKS Dissertation Series
5.5 Signal processing
ECG
speech signal
reductionnoise |
errorsrecordingofremoval |
|
& |
|
|
|
|
personality traits |
|
||
|
|
extroversion |
|
neuroticism |
|
|
variance HR |
|
|
|
denoted emotions |
HR |
SD HR |
&substractionbaseline selectionfeature |
|
valence and |
arousal |
|
|
|
|||
|
|
|
arousal |
||
|
MAD HR |
A |
|
||
|
dimensons |
|
|||
|
N |
|
|||
|
|
|
|||
|
intensity |
|
negative |
||
|
O |
|
|||
F0 |
|
|
|
positive |
|
|
|
V |
|
||
SD F0 |
|
|
valence |
||
|
|
|
|||
|
|
A |
|
||
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
6 emotion |
valence |
|
|
|
|
classes |
|
|
energy |
|
|
|
|
|
|
|
|
|
|
|
environment |
|
|
gender |
age |
|
[ home / office ] |
|
|
||
|
|
|
|
|
|
Figure 5.2: The processing scheme of Unobtrusive Sensing of Emotions (USE). It shows how the physiological signals (i.e., speech and the ECG), the emotions as denoted by people, personality traits, people’s gender, and the environment are all combined in one ANOVA. Age was determined but not processed. Note that the ANOVA can also be replaced by a classifier or an agent, as a module of an AmI [694].
Explanation of the abbreviations: ECG: electrocardiogram; HR: heart rate; F0: fundamental frequency of pitch; SD: standard deviation; MAD: mean absolute deviation; and ANOVA: ANalysis Of VAriance.
5.5 Signal processing
This section describes how all of the data was recorded and, subsequently, processed (see also Figure 5.2). Speech utterances were recorded continuously by means of a standard Trust multi function headset with microphone. The recording was performed in SoundForge 4.5.278 (sample rate: 44.100 Hz; sample size: 16 bit). Parallel with the speech recording, a continuous recording of the ECG was done through a modified Polar ECG measurement belt. The Polar ECG belt was connected to a data acquisition tool (NI USB-6008). Its output was recorded in a LabVIEW 7.1 program, with a sample rate of 200 Hz.
5.5.1 Signal selection
The speech signal of three participants was not recorded due to technical problems. For one other participant, the speech signal was too noisy. These four participants were excluded from further analysis. With four other participants, either a significant amount of noise was present in their ECG or the signal was even completely absent. These participants were omitted from further processing.
83
5 Emotion models, environment, personality, and demographics
Since one of the main aims was to unveil any possible added value of speech and ECG features to each other, all data was omitted from analysis of the eight participants whose ECG or speech signals were not recorded appropriately. This resulted in a total of 32 participants (i.e., 16 men and 16 women), whose signals were processed. Regrettably and surprisingly, the eight participants whose data was not processed, all participated in the office-like environment. So, 20 participants participated in this research in a home-like environment and 12 of participants sat down in an office-like environment. Conveniently, of these 32 participants, men and women were equally present in both environments.
5.5.2 Speech signal
For each participant, approximately 25 minutes of sound was recorded during the study. However, since only parts in which they spoke are of interest, the parts in which the participants did not speak were omitted from further processing. Some preprocessing of the speech signal was required before the features could actually be extracted from the signal. We started with the segmentation of the recorded speech signal in such a way that the speech signal was determined separately for each picture. Next, the abnormalities in the speech signals were removed. This resolved all technical inconveniences, such as: recorded breathing, tapping on the table, coughing, cleaning the throat, and yawning. This resulted in a ‘clean’ signal, as is also illustrated in Figures 5.3a and 5.3b.
After the selection of the appropriate speech signal segments and their normalization, the feature extraction was conducted. Several parameters derived from speech have been investigated in a variety of settings with respect to their use in the determination of people’s emotional state. Although no general consensus exists concerning the parameters to be used, much evidence exists for the standard deviation (SD) of the fundamental frequency (F0) (SD F0), the intensity of air pressure (I), and the energy of speech (E) [182, 590, 644, 725, 739]. We will limit the set of features to these, as an extensive comparison of speech features falls beyond the scope of this study.
For a domain [0, T ], the energy (E) is defined as:
1 |
Z0 |
T |
|
|
x2(t) dt, |
(5.1) |
|||
T |
where x(t) is the amplitude or sound pressure of the signal in Pa (Pascal) [54]. Its discrete equivalent is:
N −1
1 X x2(ti), (5.2)
N i=0
where N is the number of samples.
84
5.5 Signal processing
(mW) |
0.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Signal |
|
|
|
|
|
|
|
|
|
|
|
Amplitude |
0 |
|
|
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
|
|
|
|
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
|
|
3 |
||||||||
|
|
|
|
|
Time (seconds) |
|
|
|
|
(Hz) |
400 |
|
|
|
|
|
|
|
Pitch |
|
Frequency |
200 |
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
||
|
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
||
|
3 |
|||||||||
(Pa) |
0.05 |
|
|
|
Time (seconds) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
Energy |
||
Pressure |
0.04 |
|
|
|
|
|
|
|
||
0.03 |
|
|
|
|
|
|
|
|
||
0.02 |
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
||
Sound |
0.01 |
|
|
|
|
|
|
|
|
|
3 |
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
||
|
||||||||||
|
|
|
|
|
Time (seconds) |
|
|
|
|
(dB) |
80 |
|
|
|
|
|
|
|
Intensity |
|
Pressure |
70 |
|
|
|
|
|
|
|
|
|
60 |
|
|
|
|
|
|
|
|
||
50 |
|
|
|
|
|
|
|
|
||
Air |
|
|
|
|
|
|
|
|
||
3 |
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
||
|
||||||||||
|
|
|
|
|
Time (seconds) |
|
|
|
|
(a) A speech signal and its features of a person in a relaxed state.
(mW) |
0.5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Signal |
|
|
|
|
|
|
|
|
|
|
|
Amplitude |
0 |
|
|
|
|
|
|
|
|
−0.5 |
|
|
|
|
|
|
|
|
|
|
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
|
|
3 |
||||||||
|
|
|
|
|
Time (seconds) |
|
|
|
|
(Hz) |
400 |
|
|
|
|
|
|
|
Pitch |
|
Frequency |
200 |
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
||
|
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
||
|
3 |
|||||||||
(Pa) |
0.05 |
|
|
|
Time (seconds) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
Energy |
||
Pressure |
0.04 |
|
|
|
|
|
|
|
||
0.03 |
|
|
|
|
|
|
|
|
||
0.02 |
|
|
|
|
|
|
|
|
||
Sound |
0.01 |
|
|
|
|
|
|
|
|
|
3 |
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
||
|
||||||||||
|
|
|
|
|
Time (seconds) |
|
|
|
|
(dB) |
80 |
|
|
|
|
|
|
|
Intensity |
|
Pressure |
70 |
|
|
|
|
|
|
|
|
|
60 |
|
|
|
|
|
|
|
|
||
50 |
|
|
|
|
|
|
|
|
||
Air |
|
|
|
|
|
|
|
|
||
3 |
3.5 |
4 |
4.5 |
5 |
5.5 |
6 |
6.5 |
7 |
||
|
||||||||||
|
|
|
|
|
Time (seconds) |
|
|
|
|
(b) A speech signal and its features of a person in a sad, tensed state.
Figure 5.3: Two samples of speech signals from the same person (an adult man) and their accompanying extracted fundamental frequencies of pitch (F0) (Hz), energy of speech (Pa), and intensity of air pressure (dB). In both cases, energy and intensity of speech show a similar behavior. The difference in variability of F0 between (a) and (b) indicates the difference in experienced emotions.
85
5 Emotion models, environment, personality, and demographics
For a domain [0, T ], intensity (I) is defined as:
1 |
Z0 |
T |
|
|
x2(t) dt, |
(5.3) |
|||
10 log10 T P02 |
where P0 = 2 · 10−5 Pa is the auditory threshold [54]. I is computed over the discrete signal in the following manner:
N −1
10 log10 1P2 X x2(ti). (5.4)
N 0 i=0
It is expressed in dB (decibels) relative to P0.
Both the I and the E are directly calculated over the clean speech signal. To determine the F0 from the clean speech signal, a fast Fourier transform has to be applied over the signal. Subsequently, its SD is calculated; see also Eq. 5.5. For a more detailed description of the processing scheme, we refer to [53].
5.5.3 Heart rate variability (HRV) extraction
From the ECG signal a large number of features can be derived that are said to relate to the emotional state of people [14, 304, 327, 349, 672, 676]. This research did, however, not aim to provide an extensive comparison of ECG features. Instead, the use of the combination of the ECG signal with the speech signal was explored. Therefore, one well-known distinctive feature of the ECG was chosen: the variance of heart rate (HRV).
The output of the ECG measurement belt has a constant (baseline) value during the pause between two heart beats. Each new heart beat is characterized by a typical slope consisting of four elements, called: P, Q, R, and S (see Figure 5.4). A heart beat is said to be
Figure 5.4: A schematic representation of an electrocardiogram (ECG) denoting four R- waves, from which three R-R intervals can be determined. Subsequently, the heart rate and its variance (denoted as standard deviation (SD), variability, or mean absolute deviation (MAD)) can be determined.
86
