Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Ординатура / Офтальмология / Английские материалы / Auditory and Visual Sensations_Ando, Cariani_2009.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
12.86 Mб
Скачать

Chapter 9

Applications (II) – Speech Reception

in Sound Fields

Our correlation-based auditory signal processing model can be applied to the perception of a single syllable in a sound field. Reception of speech signals reproduced in the frontal direction for a direct sound with a single echo is well described in terms of temporal factors that are extracted from the running ACF of the sound signal. Effects of noise disturbances from different horizontal angles on error rates of single syllable identification were investigated. Results show that syllable nonidentification (NI) rates can be predicted from temporal factors extracted from the ACF and from spatial factors extracted from the IACF. Perceptual dissimilarity of sounds due to different source locations was assessed in psychophysical experiments and accurately estimated using temporal and spatial factors.

9.1 Effects of Temporal Factors on Speech Reception

This section describes a method of calculating the speech intelligibility (SI) of each syllable in terms of the distance between the direct sound as a template and the sound field with a single echo. In the calculation of the distance, three temporal factors as well as the sound energy F(0) extracted from the ACF are taken into account. The speech transmission index (STI) for the global speech intelligibility of sound fields is well known (Houtgast et al., 1980). We have also proposed a method for calculating the speech intelligibility of single syllables (Ando et al., 1999, 2000).

A perceptual distance between a template source signal and a sound-field signal is introduced. This distance quantifies the effect of room reverberations in altering the perceptual representation of the signal, such that identification is degraded. The greater the perceptual distance between the two signals, the more different they are perceived to be. The greater the perceptual difference between the signal at its source and the signal received by the listener, the lower the likelihood that the listener will recognize and identify it correctly.

A distance between the template source signal and the sound-field signal is introduced. Let SKT be the characteristics of an isolated template-syllable K, and let SXSF be that of another syllable X in the sound field; symbol T refers to the template, and SF is the sound field. Let CKT and CXSF be characteristics of the

Y. Ando, P. Cariani (Guest ed.), Auditory and Visual Sensations,

179

DOI 10.1007/b13253_9, C Springer Science+Business Media, LLC 2009

 

180

9 Applications (II) – Speech Reception in Sound Fields

isolated template SKT of syllable K and another syllable X in a sound field SXSF being processed in the auditory-brain system. Then, the distance between SKT and SXSF is given by

dk = D SXSF,SKT

= CXSF,CKT

 

 

 

(9.1)

where

 

 

 

 

 

 

 

 

 

 

I

SF

 

 

 

T

 

 

 

Dτe (X,K) =

|log(τei

log(τei

|

/I

 

X

K

 

 

i=1

 

 

 

 

 

 

 

 

 

I

SF

 

 

 

T

 

 

 

Dφ1(X,K) =

|log(φ1i

 

log(φ1i

|

/I

 

X

 

K

 

 

i=1

 

 

 

 

 

 

 

 

 

I

SF

 

 

 

T

 

 

 

Dτ 1(X,K) =

|log(τ1i

log(τ1i

|

/I

(9.2)

X

K

 

i=1

 

 

 

 

 

 

 

 

 

I

 

SF

 

 

SF

 

D (0)(X,K) =

|log( (0)i

/ (0)max

 

X

 

X

 

 

 

i=1

 

 

 

 

 

 

 

 

 

log( (0)i

T

/ (0)max

T

|

/I

 

 

K

K

 

and K and X represent the syllable number of template (T) and the syllable in a sound field (SF), respectively. Also, i is the frame number of the running ACF, and I is the total frame number.

Let N be the total number of single syllables, then the percentile SI of the syllable K for one of four factors may be obtained as

SIK(ψ) = 100N exp

dK

· · ·

dK dK

· · ·

dK

(9.3)

d1

dK - 1

 

dk + 1

dN

where ψ = τe, τ1, φ1, or (0).

Let us now demonstrate an example for estimating the intelligibility of single syllables in the sound field, which consists of a direct sound and a single echo. The amplitude of the echo was the same as that of the direct sound, and the delay time of the single echo t1 was varied in the range between 0 and 480 ms. The test signal consisted of Japanese single syllables with maskers (Fig. 9.1). The masker was used to control the percentage of correctly identified signals over a wide range, thereby avoiding saturation effects when identification would otherwise approach 100%. The direct sound without maskers was used as a template. As shown in Fig. 9.2, the important initial parts of the normalized running ACF for the range of was analyzed.

9.1 Effects of Temporal Factors on Speech Reception

181

Fig. 9.1 An example of a single syllable with artificial nonmeaningful forward and backward maskers. The direct sound without maskers was used as a template. CV is a single syllable consisting of a consonant and a vowel

Fig. 9.2 Relative sound energy of a single syllable as a function of time. The important former half part of the running ACF φ(0) < 0.5 of both a template and a test syllable was analyzed. C and V are parts of a consonant and a vowel, respectively

(0) > 0.5

(9.4)

In the analyses, two distances for the direct sound and the echo were calculated with the four factors. One was the distance between the template and the direct sound with the maskers. The other was the distance between the template and the echo with the maskers. The shorter of two distances was selected in the calculation of syllable intelligibility. Using the distances given by Equation (9.2), we calculated the intelligibility of all syllables due to each factor by using Equation (9.3). The SI in total due to the four factors (see Section 5.2) is combined linearly, so that

SI = aSI(τe) + bSI(τ1) + cSI(φ1) + dSI( (0))

(9.5)

where a, b, c, and d are weighting coefficients signifying contributions of the four factors, which are determined so as to maximize SI. If the sound-pressure level is fixed at a constant and spatial factors are invariable, then the fourth term is eliminated and SI is the left hemisphere specialization.

Japanese syllables classified by the consonant and vowel categories of Table 9.1 were used because listeners recognize these phonetic elements in syllables with minimal confusion (Korenaga, 1997). The frontal loudspeaker in an anechoic chamber

182

 

 

 

 

9 Applications (II) – Speech Reception in Sound Fields

 

Table 9.1 Categorization of Japanese single syllables

 

 

 

 

 

(a) Unvoiced consonant

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Consonant

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vowel

 

K

 

S

 

T

H

 

P

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

 

KA

 

SA

 

TA

HA

 

PA

 

Not contracted

 

I

 

KI

 

SI

 

TI

HI

 

PI

 

(Category A)

 

U

 

KU

 

SU

 

TU

HU

 

PU

 

 

 

E

 

KE

 

SE

 

TE

HE

 

PE

 

 

 

O

 

KO

 

SO

 

TO

HO

 

PO

 

Contracted

 

YA

 

KYA

 

SYA

 

TYA

HYA

 

PYA

 

(Category B)

 

YU

 

KYU

 

SHU

 

TYU

HYU

 

PYU

 

 

 

YO

 

KYO

 

SHO

 

TYO

HYO

 

PYO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(b) Voiced consonant

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Consonant

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vowel N

M

Y

R

W

G

Z

D

B

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Not

A

 

NA

MA

YA

RA

WA

GA

ZA

DA

BA

 

I

 

NI

MI

RI

GI

ZI

BI

 

contracted

 

 

U

 

NU

MU

YU

RU

GU

ZU

BU

 

(Category C)

 

 

E

 

NE

ME

RE

GE

ZE

DE

BE

 

 

 

 

 

O

 

NO

MO

YO

RO

GO

ZO

DO

BO

 

Contracted

YA

 

NYA

MYA

RYA

GYA

ZYA

BYA

 

(Category D)

YU

 

NYU

MYU

RYU

GY U

ZYU

BYU

 

 

YO

 

NYU

MYO

RYU

GYO

ZYO

BYO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

reproduced both of the direct sound and the echo as well as the noise masker. The running ACF with the integration interval 2T = 30 ms with the running step of 10 ms was analyzed. Twenty-one subjects participated in the experiments, so that about a 5% (= 1/21) error might result if a single subject judged wrong.

Results of both calculated and tested intelligibility for a single syllable belonging to each category as a function of the delay time of echo are demonstrated in Fig. 9.3a–d. Averaged results for each category are shown in Fig. 9.4a–d, and averaged values for all syllables are shown in Fig. 9.5. Because we can hear the single syllables twice after the delay time of echo in the condition t1 > 100 ms, the local maxima appeared around 300 ms. In the larger delay range of the reflection of more than 300 ms, intelligibility is decreased due to effects of the masker. Even so, calculated values are in good agreement with the tested values. This implies that the four factors extracted from the ACF of the source signals and the sound-field signals are effective in identifying speech. Contributions of the four factor to the SI obtained by the multiple regressions are indicated in Table 9.2. It is found that the most significant factor of the four is the effective duration of the ACF, τe.

It is concluded, therefore, that:

9.1

Effects of Temporal Factors on Speech Reception

183

a

b

 

c

d

Fig. 9.3 Examples of calculated and tested intelligibilities for the single syllable belonging to each category. (a) /ha/. (b) /be/. (c) /hya/. (d) /nyu/. Twenty-one subjects participated in the experiments

a

b

c

d

Fig. 9.4 Averaged results of calculated and tested intelligibilities for each category. (a) Category A. (b) Category B. (c) Category C. (d) Category D

184

9 Applications (II) – Speech Reception in Sound Fields

Fig. 9.5 Mean values of calculated and tested intelligibilities for all syllables used

Table 9.2 Contribution of each temporal factor extracted from the running ACF on speech intelligibility. Values were normalized by their maxima of four coefficients obtained from the multiple regression analysess

Consonant

Vowel

τe

τ1

φ1

(0)

 

A

0.50

0.60

0.36

1.00

 

U

0.07

0.08

0.25

1.00

 

E

1.00

0.66

0.23

0.46

Unvoiced

O

0.62

1.00

0.13

0.25

 

YA

1.00

0.17

0.55

0.56

 

YU

1.00

0.19

0.30

0.47

 

YO

1.00

0.30

0.41

0.88

 

A

0.74

0.92

0.89

1.00

 

I

0.88

0.38

0.01

1.00

 

U

1.00

0.80

0.30

0.29

 

E

0.19

0.42

1.00

0.01

Voiced

O

1.00

0.25

0.09

0.80

 

YA

1.00

0.17

0.55

0.56

 

YU

1.00

0.19

0.30

0.47

 

YO

0.25

0.21

1.00

0.51

 

 

 

 

 

1.00

 

 

 

 

 

 

1.The speech identification in the sound field may be described by the four factors extracted from the running ACF of the target signal and the sound field with the signal.

2.The most significant factor is the effective duration of the ACF, τe, which is intimately related to [ t1]p and whose cortical response correlates are mainly associated with the left hemisphere (Table 5.1).