Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Учебники / Hearing - From Sensory Processing to Perception Kollmeier 2007

.pdf
Скачиваний:
157
Добавлен:
07.06.2016
Размер:
6.36 Mб
Скачать

Towards Predicting Consonant Confusions of Degraded Speech

549

of 40). The lower-right four panels show the analogous display for the response of State-2 neurons to the word tont. The lower-left (and the upperright) panels show the response to the opposite word. The response to stimuli matched to the state neurons peaks at a time-instance associated with the end-time of the initial diphone. For stimuli of the opposite token there is a small response. Further study of the TMC is underway.

4Summary

We are developing a model of diphone perception based on models of salient properties of peripheral and central auditory processing. The model comprises a closed-loop peripheral auditory model (which provides a representation of speech that is robust against sustained background noise) connected to a template-match circuit based upon basic concepts of neural processing. Our strategy is to tune the PAM in isolation from the TMC, then freeze the PAM and use it to tune the parameters of the TMC. As probe-stimuli we use speech in the presence of noise. Speech stimuli provide rich, relevant time-varying spectral patterns, and the presence of noise imposes focus on the salient speech cues. Our measure of success is the ability of the model to predict consonant confusions in noise.

Acknowledgment. This work is supported by the U.S. Air Force Office of Scientific Research.

References

Ghitza O (1993) Processing of spoken CVCs in the auditory periphery: I. Psychophysics. JASA 94(5):2507–2516

Goldstein JL (1990) Modeling rapid waveform compression on the basilar membrane as a multiple-bandpass-nonlinearity filtering. Hear Res 49:39–60

Hant JJ, Alwan A (2003) A psychoacoustic-masking model to predict the perception of speechlike stimuli in noise. Speech Commun 40:291–313

Hopfield JJ (2004) Encoding for computation: recognizing brief dynamical patterns by exploiting effects of weak rhythms on action-potential timing. PNAS 101(16):6255–6260

Voiers WD (1983) Evaluating processed speech using the diagnostic rhyme test. Speech Technol 1(4):30–39

Comment by Greenberg

The task used in your study has a very limited number of alternatives (two); it is essentially a 1-bit response paradigm. Although there are compelling reasons for using a task with such limited entropy, do you believe that the improvement in performance observed using the efferent system component

550

O. Ghitza et al.

in your representation would generalize to tasks with far greater entropy (such as open-set word identification)?

Reply

Our model comprises an efferent-inspired peripheral auditory model (PAM) connected to a template-match circuit (TMC). We believe that robustness against background noise is provided principally by the signal processing performed by the peripheral circuitry, and that consonant confusions (e.g. in open-set word identification) result from errors in the internal representation (since the “shield” provided by the periphery is not perfect) and from computational properties of the neural template-matching circuit. Your comment, therefore, raises two questions: (1) does the resulting closed-loop cochlear model – which matches human performance in the DRT task for noisy speech – capture the signal processing principles that indeed are responsible for providing the shield against background noise?; and (2) assuming that the answer to question (1) is yes, do we believe that the templatematching circuit (suggested in Sect. 3) can be tuned to predict consonant confusions in an open-set task? The second question is currently being studied; hence we can’t provide an answer yet. As for the answer to question (1), our methodology calls for adjusting the parameters of the PAM with minimal interference of the TMC. The reason for choosing the DRT paradigm, binary in nature, is to reduce the role of the back-end to a minimum. Note, that although we predict human performance in a binary task, the parameters of the model were tuned to match errors between minimal pairs jointly along all Jakobsonian dimensions. Hence we believe that the spectro-temporal patterns generated by the resulting closed-loop cochlear model are an adequate model of the internal representation of degraded speech.

59 The Influence of Masker Type on the Binaural Intelligibility Level Difference

S. THEO GOVERTS1, MARIEKE DELREUX2 , JOOST M. FESTEN1,

AND TAMMO HOUTGAST1

1Introduction

The Binaural Intelligibility Level Difference (BILD) was first described by Licklider (1948). It is a manifestation of binaural unmasking, the advantage of binaural over monaural hearing of a signal S against the background of a spatially separated noise N. In headphone experiments, often a design with a N0S0 presentation vs a N0Sπ presentation is used, in which the noise is presented homophasic and the signal either homophasic or antiphasic. The BILD is then defined as the difference in the speech reception threshold (SRT) in the N0S0 and N0Sπ presentation mode. Blauert (1997) provides an overview of experimental work on BMLD and BILD. Estimating the BILD requires SRT measurements in the N0S0 and N0Sπ presentation modes. Since the diotic N0S0 stimuli contain no binaural information, they can be considered as an estimation of monaural speech perception (Siegel and Colburn 1983). The BILD for a stationary masker is known to be about 4–7 dB (e.g. Blauert 1997; Johansson and Arlinger 2002).

We are interested in the BILD for fluctuating maskers because of their relevance for daily life. Assessing speech intelligibility in the presence of a fluctuating masker in a N0Sπ condition, several components should be taken into account (see Fig. 1): point of departure is the diotic speech reception threshold (SRT) for stationary noise (Fig. 1a.). If temporal modulations are introduced there will be release of masking (masking release, MR), the SRT reduces (Fig. 1b.). A typical value is of this masking release is 10 dB. On the other hand, if an interaural phase shift is introduced in the speech-signal the SRT will be reduced because of binaural unmasking (Fig. 1c.). A typical value of this binaural unmasking is 5 dB. The question addressed is this chapter is whether there is an interaction between the diotic (“monaural”) release of masking and the binaural unmasking. So, can the reduction in SRT in the condition with modulated noise and interaural phase shifted speech be predicted by adding the values of masking release and binaural unmasking, i.e. 10 + 5 = 15 dB or is it different?

1Audiology, ENT, VU university medical center, Amsterdam, Netherlands, st.goverts@vumc.nl, jm.festen@vumc.nl, t.houtgast@vumc.nl

2EXP ORL, Leuven University, Leuven, Belgium, marieke.delreux@student.kuleuven.be

Hearing – From Sensory Processing to Perception

B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007

552

S.T. Goverts et al.

Fig. 1 Envelope of speech and masker (with estimated forward masking): a point of departure, stationary masker and a N0S0 presentation; b masking release, fluctuating masker and a N0S0 presentation; c binaural unmasking, fluctuating masker and a N0Sπ presentation; d masking release+binaural unmasking, fluctuating masker and a N0Sπ presentation

The BILD is investigated for: (1) 16-Hz block-modulated speech-shaped noise; (2) speech-shaped noise with speech-like fluctuations; and (3) ongoing speech of an interfering same-sex talker. While the spectro-temporal content of noises (2) and (3) are roughly the same, we hypothesize that the effect of informational masking can be investigated by comparing results for those noises.

2Materials and Method

2.1Stimuli

The speech stimuli consisted of lists of 13 everyday Dutch sentences of eight to nine syllables read by a female speaker (Versfeld et al. 2000). Based on this speech material a stationary masking noise was derived with a long-term spectrum that resembled the long-term spectrum of the female voice. Based on this stationary noise two fluctuating noises were derived: (1) a 16-Hz block-modulated speechshaped noise with duty cycle 50% and a modulation

The Influence of Masker Type on the Binaural Intelligibility Level Difference

553

 

Table 1 List of conditions

 

 

 

 

 

 

 

 

Condition

Masker type

Level [dB]

 

 

 

 

 

 

 

ST65

Stationary noise

65

 

 

BL65

Block-modulated noise

65

 

 

BL75

Block-modulated noise

75

 

 

FL65

Noise with speech-like fluctuations

65

 

 

FL75

Noise with speech-like fluctuations

75

 

 

TA65

Interfering, same-sex talker

65

 

 

 

 

 

 

depth of 100%; and (2) a speech shaped noise with speech-like fluctuations (Festen and Plomp, 1990). Finally, continuous speech of an interfering female speaker (Plomp and Mimpen, 1979) was used as masker with additional informational content. The entire experiment was controlled by a personal computer. Subjects were tested individually in a soundproof room.

2.2Subjects

Five female and seven male normal hearing subjects participated in this study. Their thresholds did not exceed 15 dB HL and their age ranged between 20 and 30 years. All subjects were native speakers of Dutch.

2.3Conditions and Procedure

The BILD, defined as the difference in threshold for speech in noise between diotic presentation of speech (SRTN0S0) and dichotic presentation of speech (SRTN0Sπ), is deduced for six masker conditions as listed in Table 1. The SRT in noise is defined as the signal-to-noise ratio (SNR) at which 50% of the sentences are reproduced correctly. Measurement procedures were in accordance with Plomp and Mimpen (1979), varying the SNR adaptively in an up-down procedure using 2-dB steps. Each single SRT-measurement is based on one 13-sentence list. All measurements were performed three times. While we were interested in differences between the BILD in the six conditions, the order of conditions was balanced over subjects.

3Results

For the 12 normal-hearing subjects the SRT was measured with N0S0 and N0Sπ presentation as described in the Methods section. For each condition the

BILD was calculated as the SRTN0S0 minus the SRTN0Sπand in the MR as the SRT-STATN0S0 minus the SRT-CONDN0S0.

554

S.T. Goverts et al.

Average results and standard deviations are listed in Table 2. An ANOVA was performed on the data to investigate significance of differences. For the BILD, results that are significantly different from the STAT65 condition are indicated by*. The data for BILD and SRT in N0S0 and N0Sπ presentation mode

Table 2 Mean data and standard deviations of the measured SRTs and calculated BILDs and MRs. For the BILD, results that are significantly different from the BILD STAT65 are indicated by*

 

SRTN0S0 [dB]

 

SRTN0Sπ [dB]

 

BILD [dB]

 

 

MR [dB]

 

 

 

 

 

 

 

 

mean

stdev

 

mean

stdev

 

mean

stdev

 

mean

stdev

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Condition

 

 

 

 

 

 

 

 

 

 

 

ST65

−4.5

0.5

−9.1

0.7

 

4.6

0.5

 

 

 

BL65

−14.8

1.8

−17.3

1.8

 

2.6*

0.8

10.3

2.0

BL75

−18.5

2.7

−20.4

2.2

 

1.9*

1.0

14.0

2.9

FL65

−11.5

0.8

−14.5

0.8

 

2.9*

1.2

7.1

0.8

FL75

−11.4

1.2

−14.8

1.1

 

3.4

1.6

6.9

1.3

TA65

−11.8

1.5

−14.2

1.4

 

2.3*

1.8

7.3

1.4

 

 

 

 

 

 

 

 

 

 

 

 

Fig. 2 a Mean BILD data and standard deviations. b Mean SRT data for N0S0 (open circles) and N0Sπ and standard deviations. c Mean BILD data (and standard deviations) plotted vs the absolute level of the diotic speech for the different conditions. d Mean BILD data (and standard deviation) plotted vs the diotic SRT for the different conditions, which is directly related to the masking release for those conditions

The Influence of Masker Type on the Binaural Intelligibility Level Difference

555

are also plotted in Fig.2a,b. The SRTN0S0 data for the different conditions, and thus the MR data, are in line with the literature. The BILD results show that

the binaural unmasking for the fluctuating maskers is reduced compared to the stationary noise. This is in line with Carhart et al. (1966) who found a slightly reduced BILD for modulated compared to stationary noise (3.9 vs 4.5 dB). Coming back to the question posed in the introduction, the total advantage in the BL65 N0Sπ presentation mode, compared to the diotic ST65 condition is about 13 dB, which is less than the sum of binaural unmasking and masking release, being 4.6+10.3=15 dB.

4Analysis and Discussion

4.1Relation of BILD with Absolute Level of the Diotic Speech

The reduced binaural unmasking might be caused by reduced absolute level (Blauert 1997). In an earlier study, we found a dependence of binaural unmasking on level, for levels typically below 50 dB SPL (Goverts 2004; Goverts and Hougast 2007). To investigate this in Fig. 2c., the mean BILD is plotted vs the absolute level of the diotic speech for the different conditions. There is no strong relation between the BILD and the average absolute level of the diotic speech. The difference in absolute level seems to be no reason for the difference in binaural unmasking. This is in line with other literature findings (e.g. Carhart et al. 1966, 1969).

4.2The Relation Between Masking Release and BILD

In Fig. 2d the mean BILD is plotted vs the SRT of the diotic speech for the different conditions which is, of course, directly related to masking release. The BILD appears to be related to the SRT for diotic speech. These data suggest a lower binaural unmasking for conditions in which a higher diotic (“monaural”) masking released is found. In order to understand this relation we should inspect the envelope of the different maskers. In Fig. 3a–c for ST65, BL65, and FL65 respectively the envelope as well as the long term average of the speech in the diotic condition, is plotted. It can be seen that the proportion of masked speech in the diotic presentation varies considerably among the three maskers. This is further illustrated in Fig. 3d–f, where the distribution of instantaneous signal-to-noise ratios is given. Our hypothesis is that the reduced BILD for the fluctuating maskers is caused by a reduced proportion of time for which the instantaneous signal-to-noise ratio is in the range in which binaural unmasking is active. To model these results, we assume a binaural unmasking of 5 dB for all signal-to-noise ratios up to a critical value (CV) where unmasking is active

556

S.T. Goverts et al.

Fig. 3 a–c The envelopes of the maskers ST65, BL65, and FL65 respectively is plotted by dots. For this qualitative analysis, envelopes of the broadband signals are calculated. Forward masking is estimated assuming a decay to zero in 200 ms on a log-timescale. For each condition the long term average of speech at diotic SRT level is given by the drawn line. d–f The distribution of instantaneous signal-to-noise ratios is given for those conditions

and no unmasking at all for the higher signal-to-noise ratios. For this critical value two possibilities can be considered: (1) for signal-to-noise ratio of more than 15 dB speech perception is not influenced by the noise (as is well known in the Speech Intelligibility Index (SII) approach, for example); and (2) for signal-to-noise ratio of more than 0 dB normal hearing reach an intelligibility of 100% for sentences in stationary noise. To evaluate this hypothesis in a very qualitative way we computed a weighted BILD, multiplying the distribution with a simple weighting function of 5 for signal-to-noise ratios up to CV and 0 for the higher ratios. The results are given in Fig. 4. If we compare this to the actual BILD data of Fig. 2a we find a rather good correspondence, especially for CV = 0. Thus the reduced binaural unmasking for fluctuating maskers can be understood in terms of the reduced presence of the effective diotic masker, compared to a stationary

The Influence of Masker Type on the Binaural Intelligibility Level Difference

557

Fig. 4 Weighted BILD values for the six conditions for two values of the critical value (CV) above which binaural unmasking is modeled to be not-effective

masker. Due to the temporal gaps in the noise in the diotic condition that are below the long term average level of the speech binaural unmasking cannot be as effective as in stationary noise. In this line we can understand the data for all used maskers.

4.3Informational Masking

Comparing the data of the conditions FL65 and TA65 we see no additional effect of informational masking. The results for the TA65 condition can be understood on the base of the spectro-temporal and energetic properties of the masker. Binaural unmasking is not influenced by informational masking at least not in the type and degree as used in this study. This is in line with Carhart et al. (1969) who found for modulated noise and interfering speech similar results of masking release and binaural unmasking.

558

S.T. Goverts et al.

5Conclusion

Binaural unmasking of speech for fluctuating maskers is reduced in comparison to results in stationary noise. Departing from diotic (“monaural”) speech intelligibility in noise, the effects of masking release and binaural unmasking cannot simply be added to predict dichotic speech intelligibility in a masker with temporal fluctuations. The reduction of binaural unmasking can be understood in terms of reduced presence of effective diotic masker in conditions of masking release. Using this hypothesis we can in a very qualitative way predict binaural unmasking for a block-modulated masker, a masker with speech-like modulations and an interfering talker. This means that the relative importance of binaural importance diminishes in daily life for normal hearing subjects.

Acknowledgements. We thank Hans van Beek for technical assistance.

References

Blauert J (1997) Spatial hearing: the psychophics of human sound localization, rev edn. MIT Press, Cambridge

Carhart RT, Tillman TW, Greetis ES (1966) Inaural masking of speech by periodically modulated noise. J Acoust Soc Am 39:1037–1050

Carhart RT, Tillman TW, Greetis ES (1969) Perceptual masking in mulitiple sound backgrounds. J Acoust Soc Am 45:694–703

Festen JM, Plomp R (1990) Effects of fluctuating noise and interfering speech on the speechreception threshold for impaired and normal hearing. J Acoust Soc Am 88:1725–1736

Goverts ST ( 2004) Assessment of spatial and binaural hearing in hearing impaired listeners. PhD thesis, VU Iniversity, Amsterdam

Goverts ST, Hougast T (2007) The BILD of hearing-impaired subjects – the role of suprathreshold coding. To be submitted to J Acoust Soc Am

Johansson MSK, Arlinger SD (2002) Binaural masking level difference for speech signals in noise. Int J Aud 41:279–284

Licklider J (1948) The influence of interaural phase relations upon the masking speech by white noise. J Acoust Soc Am 20:150–159

Plomp R, Mimpen AM (1979) Improving the reliability of testing the speech reception threshold for sentences. Audiology 18:43–52

Siegel RA, Colburn HS (1983) Internal and external noise in binaural detection. Hear Res 11:117–123

Versfeld NJ, Daalder L, Festen JM, Houtgast T (2000) Method for the selection of sentence materials for efficient measurement of the speech reception threshold. J Acoust Soc Am 106:1671–1684