Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Учебники / Hearing - From Sensory Processing to Perception Kollmeier 2007

.pdf
Скачиваний:
157
Добавлен:
07.06.2016
Размер:
6.36 Mб
Скачать

Perceptual Compensation for Reverberation

539

4Conclusions

The perceptual compensation mechanism seems to be able to pick up information about reverberation not only from running speech but also from other types of sounds that form a context for test words. In these cases, the effects of reverberation on neighbouring words are reduced when the context’s reverberation is increased to match the test word’s reverberation. Contexts that are effective in this respect include broad-band ‘noise-like’ sounds that have a steady spectrum with no changes in the shape of the shortterm spectral envelope over time. In addition, certain ‘tonal’ contexts can be effective, as long as they have several component frequency-bands.

Single-band ‘tonal’ contexts were sharply-filtered noise, with the centre frequency and corresponding bandwidth of an auditory filter. These sounds were processed to give the temporal envelope found in the auditory filter concerned when the speech context was played through it. Results with these sounds indicate that they give little or no compensation at the centrefrequencies that were tested. Nevertheless, the temporal envelope of each of these sounds does bring about compensation when it is heard in a broad range of the ear’s frequency channels, as shown by the results with the broadband, ‘noise-like’ versions of these contexts. These findings suggest that compensation is confined to the frequency region occupied by the context, which leaves the bulk of the test-word’s frequency-content unaffected by a singleband tonal context

The threeand five-band tonal contexts were more ‘speech-like’, as they were the sum of single-band contexts using auditory filters with different centrefrequencies. Results with these contexts indicate they are effective in generating compensation, and that the five-band context is the most effective.

When the overall, broad-band temporal envelope of the five-band context was heard in a wide range of the ear’s frequency channels, with its ‘noiselike’ counterpart, the compensation effect diminished. The compensation in the five-band noise-like condition was less than the compensation in tonal conditions, and it was less than the compensation in any of the single-band noise-like conditions. These results are consistent with the idea that compensation is informed by the presence or absence of sharp offsets in the context’s temporal envelope (Watkins 2005), which are less prominent in the broad-band temporal envelope of a five-band context than they are in a nar- row-band temporal envelope from this sound. This ‘smoothing’ of the broad-band temporal envelope would seem to be a straightforward consequence of the spectro-temporal fluctuations that inhere in speech, which gives imperfectly correlated temporal envelopes in different frequency bands. Consequently, temporal envelopes of speech in the ear’s numerous frequency channels are able to give more information about a room’s acoustic properties than is apparent from broad-band temporal envelope of the signal.

540

 

 

A. Watkins and S. Makin

Table 2 Compensation effects with noise-like contexts

 

 

 

 

 

 

 

 

 

Mean

Standard

 

p if <0.05,

Noise-like context

compensation

error

t(5)

else n.s.

 

 

 

 

 

Single band, fc = 0.25 kHz

3.39

0.34

10.03

<0.001

Single band, fc = 1 kHz

2.33

0.32

7.25

<0.001

Single band, fc = 4 kHz

3.39

0.26

12.83

<0.001

Three-band

2.61

0.79

3.30

<0.05

Five-band

2.06

0.36

5.72

<0.01

 

 

 

 

 

Speech contexts tended to be more effective than any of the ‘tonal’ or ‘noise-like’ contexts, giving larger compensation effects. This probably reflects the wide range of frequency channels available when listening to speech. The temporal envelopes in some of these channels might well be more informative about the presence of reverberation than any of the channels that were selected for study in the present experiments.

Acknowledgment. This research was supported by a grant to the first author from EPSRC.

References

Allen JB, Berkley DA (1979) Image method for efficiently simulating small-room acoustics. J Acoust Soc Am 62:943–950

Furui S (1986) On the role of spectral transition for speech perception. J Acoust Soc Am 80:1016–1025

Gardner B, Martin K (1994) HRTF measurements of a KEMAR dummy-head microphone. Perceptual Computing - Technical Report #280. MIT Media Lab

Hartmann WM (1998) Signals, sound and sensation. Springer, Berlin Heidelberg New York ISO (1997) Acoustics - measurement of the reverberation time of rooms with reference to other

acoustical parameters. ISO 3382. International Organization for Standardization, Geneva Schroeder MR (1965) New method of measuring reverberation time. J Acoust Soc Am

37:409–412

Stecker GC, Hafter ER (2000) An effect of temporal asymmetry on loudness. J Acoust Soc Am 107:3358–3368

Watkins AJ (1992) Perceptual compensation for effects of reverberation on amplitude-envelope cues to the ‘slay’-‘splay’ distinction. Proc Inst Acoust 14:125–132

Watkins AJ (2005) Perceptual compensation for effects of reverberation in speech identification. J Acoust Soc Am 118:249–262

58 Towards Predicting Consonant Confusions

of Degraded Speech

O. GHITZA1, D. MESSING2, L. DELHORNE2, L. BRAIDA2, E. BRUCKERT1,

AND M. SONDHI3

1Introduction

The work described here arose from the need to understand and predict speech confusions caused by acoustic interference and by hearing impairment. Current predictors of speech intelligibility are inadequate for making such predictions (even for normal-hearing listeners). The Articulation Index, and related measures, STI and SII, are geared to predicting speech intelligibility. But such measures only predict average intelligibility, not error patterns, and they make predictions for a limited set of acoustic conditions (linear filtering, additive noise, reverberation).

We aim at predicting consonant confusions made by normally-hearing listeners, listening to degraded speech. Our prediction engine comprises an efferent-inspired peripheral auditory model (PAM) connected to a template-match circuit (TMC) based upon basic concepts of neural processing. The extent to which this engine is an accurate model of auditory perception will be measured by its ability to predict consonant confusions in the presence of noise. The approach we have taken involves two separate steps. First, we tune the parameters of the PAM in isolation from the TMC. We then freeze the resulting PAM and use it to tune the parameters of the TMC. In Sect. 2 we describe a closed-loop model of the auditory periphery that comprises a nonlinear model of the cochlea (Goldstein 1990) with efferent-inspired feedback. To adjust the parameters of the PAM with a minimal interference of the TMC we use confusion patterns for speech segments generated in a paradigm with a minimal cognitive load (DRT; Voiers 1983). To reduce further PAM-TMC interaction we have synthesized DRT word-pairs, restricting stimulus differences to the initial diphones. In Sect. 3 we describe initial steps in a study towards predicting confusions of naturally spoken diphones, i.e. tokens that inherently exhibit phonemic variability. We describe a TMC inspired by principles of cortical neural processing (Hopfield 2004). A desirable property of the circuit is insensitivity to time-scale variations of the input stimuli. We demonstrate the validity of this hypothesis in the context of the DRT consonant discrimination task.

1Sensimetrics Corporation, Somerville, Massachusetts, USA, oded@sens.com

2Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

3Avaya Research Laboratory, Basking Ridge, New Jersey, USA

Hearing – From Sensory Processing to Perception

B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007

542

O. Ghitza et al.

2Peripheral Auditory Model (PAM)

We have developed a closed-loop model of the auditory periphery (PAM) that was inspired by current evidence about the role of the efferent system in regulating the operating point of the cochlea. This regulation results in auditory nerve (AN) representation that is less sensitive to changes in the environmental conditions. In implementing the PAM we use a bank of overlapping cochlear channels uniformly distributed along the ERB scale, four channels per ERB. Each cochlear channel comprises a nonlinear filter and a generic model of the Inner Hair Cell (IHC) – half-wave rectification followed by lowpass filtering, representing the reduction of synchrony with CF. The dynamic range of the simulated IHC response is restricted to a dynamic-range window (DRW), representing the observed dynamic range at the AN level.

The filter is Goldstein’s model of nonlinear cochlear mechanics (MBPNL; Goldstein 1990). This model operates in the time domain and changes its gain and bandwidth with changes in the input intensity, in accordance with observed physiological and psychophysical behavior. The model is shown in Fig. 1. The lower path (H1/H2) is a compressive nonlinear filter that represents the sensitive, narrowband nonlinearity at the tip of the basilar membrane tuning curves. The upper path (H3/H2) is a linear filter that represents the insensitive, broad-band linear tail response of basilar-membrane tuning curves. A parameter G controls the gain of the tip of the basilar membrane tuning curves. To mimic best psychophysical tuning curves of a healthy cochlea in quiet, the tip gain is set to G = 40 dB (Goldstein 1990). The “isoinput” frequency response of an MBPNL filter at CF of 3400 Hz is shown in Fig. 2, upper-left panel.

 

Expanding

H3(w)

Memoryless

Nonlinearity

wc

 

S

H1(w)

Compressing

Memoryless

Nonlinearity H2(w)

wc R

Stapes

wc

GAIN

Basilar

Velocity

 

 

Membrane

 

 

 

 

Displacement

 

 

Efferent

 

 

Control

Fig. 1 Goldstein’s MBPNL model (Goldstein1990)

Towards Predicting Consonant Confusions of Degraded Speech

543

Fig. 2 “Iso-input” frequency response, CF = 3400 Hz. Inside box are input levels in dB SPL

As for the efferent-inspired part of the model we mimic the effect of the Medial Olivocochlear efferent path (MOC). Morphologically, MOC neurons project to different places along the cochlear partition in a tonotopical manner, making synapse connections to the outer hair cells and, hence, affecting the mechanical properties of the cochlea (e.g. increase in basilar membrane stiffness).

Therefore, we introduce a frequency dependent feedback mechanism which controls the tip-gain (G) of each MBPNL channel according to the intensity level of the sustained noise in that frequency band. By reducing G the MBPNL response to weaker stimuli (e.g. background noise) is attenuated. The lower-right panel, for example, shows the MBPNL response for G=10 dB. Compared to G = 40 dB, the response to high energy stimuli is hardly affected, while the response for low energy stimuli (e.g. 20 dB SPL) is reduced by some 30 dB. In our realization of the model, the value of the tip gain (G) per cochlear channel is adjusted so that intensity of background noise at the output will not exceed a prescribed value.

544

O. Ghitza et al.

Fig. 3 Simulated IHC response for open-loop PAM (left) and for closed-loop PAM (right)

Figure 3 shows – in terms of a spectrogram – simulated IHC responses to the diphone je (as in “jab”) in two noise conditions (70 dB SPL/10 dB SNR and 50 dB SPL/10 dB SNR), for an open-loop MBPNL-based system (left-hand side) and for the closed-loop system (right-hand side). Due to the nature of the noise-responsive feedback, the closed-loop system produces spectrograms that fluctuate less with changes in noise intensity compared to spectrograms produced by the open-loop system. This property is desirable for stabilizing the performance of the template-matching operation under varying noise conditions, as reflected in the quantitative evaluation reported next.

2.1Quantitative Evaluation – Isolating PAM from Template Matching

The evaluation system comprises the PAM followed by the TMC. Ideally, to eliminate PAM-TMC interaction, errors due to template matching should be reduced to zero. In reality we could only minimize interaction. This was achieved by using the following three steps: (1) we use the simplest possible psychophysical task in the context of speech perception, namely a binary discrimination test. In particular, we use Voiers’ DRT (Voiers 1983) which presents the

Towards Predicting Consonant Confusions of Degraded Speech

545

subject with a two alternative forced choice between two alternative CVC words that differ in their initial consonants (i.e. a minimal pair). Such task minimizes the influence of cognitive and memory factors while maintaining the complex acoustic cues that differentiate initial diphones (recall the central role of diphones in speech perception, e.g. Ghitza 1993); (2) we use the DRT paradigm with synthetic speech stimuli. An acoustic realization of the DRT word-pairs was synthesized so that the target values for the formants of the vowel in a word-pair are identical, restricting stimulus differences to the initial diphones; and (3) we use a “frozen speech” methodology (e.g. Hant and Alwan 2003), namely, the same acoustic speech token is being used for training and for testing, so that testing tokens differs from training tokens only by the acoustic distortion.

These three steps presumably result in a reduction in the number of errors induced by the template matching. Recall that a template-match operation comprises measuring the distance of the unknown token to the templates, and labeling the unknown token as the template with the smaller distance. Hence, template matching is defined by the distance measure and the choice of templates. As a distance measure we use the minimum mean squares error. This is an effective choice here because: (1) by using synthetic speech stimuli, the identical target values of the vowel formants for the two words results in zero error in time-frequency cells associated with the final diphone; and (2) by using frozen-speech stimuli, a distortion in a given time-frequency cell is generated locally (by noise component within the range of the cell) and is independent of noise at other cells.

We have conducted formal DRT sessions using the synthetic stimuli in quiet and in additive noise, using speech-shape noise at three levels (70, 60 and 50 dB SPL) and at three SNRs (10, 5 and 0 dB). The data was collected from six subjects (four repetitions each), all students with normal hearing. All subjects have zero errors in quiet. Figure 4 shows errors produced by a DRT mimic with open-loop (upper panel) and closed-loop (lower panel) PAMs. Signal conditions are the same as those used to collect the human data. The DRT-mimic data is averaged over four exemplars of the database, each differing in the realization of the added noise. Templates were created for the 60 dB SPL/5 dB SNR condition. The abscissa marks the six Jakobsonian dimensions: Voicing, Nasality, Sustention, Sibilation, Graveness and Compactness (denoted VC, NS, ST, SB, GV and CM, respectively). The “+” sign stands for attribute present and the “−” sign for attribute absent. Bars show the difference between mean machine and human scores. The lines indicate plus and minus one standard deviation of the human data. Gray bars indicate that the difference is greater than one standard deviation. Scores with the openloop PAM are worse than those of human scores. Scores for the closed-loop PAM are superior to human scores and the difference is similar for all conditions. We are currently developing an iterative procedure for adjusting the parameters of the PAM (constrained by physiological plausibility) so as to match scores to those achieved by humans. The resulting PAM will then be frozen and used to formulate the template match operation.

546

O. Ghitza et al.

(%)

40

 

 

 

 

 

30

 

 

 

 

 

Error

 

 

 

 

 

20

 

 

 

 

 

- Human

10

 

 

 

 

 

0

 

 

 

 

 

Error

−10

 

 

 

 

 

Mimic

−20

 

 

 

Abs Mean = 15%

−30

 

 

 

 

 

 

 

 

 

 

+ −

+ −

+ −

+ −

+ −

+ −

 

VC

NS

ST

SB

GV

CM

 

40

 

 

 

 

 

(%)

30

 

 

 

 

 

Error

20

 

 

 

 

 

 

 

 

 

 

 

- Human

10

 

 

 

 

 

0

 

 

 

 

 

Error

−10

 

 

 

 

 

Mimic

−20

 

 

 

Abs Mean = 8%

 

 

 

 

 

−30

 

 

 

 

 

 

 

 

 

 

+ −

+ −

+ −

+ −

+ −

+ −

 

VC

NS

ST

SB

GV

CM

Fig. 4 DRT mimic scores for open-loop PAM (upper) and closed-loop PAM (lower)

3A Template-Matching Circuit (TMC)

In developing the PAM (Sect. 2) we deployed a psychophysical task with a minimal cognitive load and used speech stimuli with restricted phonemic variation. In contrast, the parameters of the TMC will be tuned so as to predict human performance in a consonant identification task (i.e. predicting a confusion matrix). Towards this goal we seek a perceptually-relevant distortion measure between speech tokens that inherently exhibit phonemic variability. In this section we describe a template-matching circuit inspired by principles of cortical neural processing (Hopfield 2004). A block diagram of the circuit is shown in Fig. 5. It comprises three stages: a front-end, a layer of

Towards Predicting Consonant Confusions of Degraded Speech

547

 

Layer- I

 

LayerII

Front End

IAF

“Patch”

Coincidence

Neurons

Neurons

 

1

 

 

 

1

100

26

N=6000

“Gamma” M=26×100

Oscillator

Fig. 5 A block-diagram of the template-match circuit

“integrate and fire” (IAF) neurons (Layer-I neurons) and a layer of coincidence neurons (Layer-II neurons). The front-end is the auditory model described in Section 2. Each neuron in Layer-I is characterized by the differential equation du(t)/dt+u(t)/RC = i(t)/C, where i(t) is the input current, u(t) the output voltage and RC is the time-constant of the circuit. Once u(t) reaches a prescribed threshold value the neuron produces a firing and u(t) is shunted to zero. The parameters of all Layer-I neurons are identical except of the threshold-of-firing. All Layer-I neurons are driven by one, global, underlying sub-threshold oscillatory current A · cosgt. Hence, the input current to the n-th IAF cell is in(t) = xn(t)+A · cosgt, where xn(t) the output of the n-th cochlear channel. In our realization RC = 20 ms and the frequency of the Gamma oscillator is 25 Hz. Each channel drives 100 Layer-I neurons which

548

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O. Ghitza et al.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

daunt

 

 

 

 

 

 

 

 

 

 

taunt

 

 

 

 

 

 

 

 

State-1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

20

30

 

10

20

30

 

 

10

20

30

 

 

10

20

30

 

 

 

40

1

 

1

 

1

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

neurons

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.5

 

 

 

 

 

0.5

 

 

 

 

 

 

 

0.5

 

 

 

 

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

0

 

 

 

 

 

 

 

0

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

20

30

10

20

30

 

10

20

30

 

10

20

30

 

 

 

 

10

 

 

 

 

 

 

 

 

State-2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

20

30

 

10

20

30

 

 

10

20

30

 

 

10

20

30

 

 

 

140

 

 

 

 

 

 

 

 

1

 

 

 

 

 

1

 

 

 

 

 

 

1

 

 

 

 

1

 

 

 

 

 

 

 

 

neurons

0.5

 

 

 

 

 

0.5

 

 

 

 

 

 

 

0.5

 

 

 

 

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

0

 

 

 

 

 

 

0

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20

 

10

20

30

10

20

30

 

10

20

30

 

 

 

 

10

30

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig. 6 Illustrating the performance of the TMC in the context of the DRT discrimination task

differ only in their threshold-of-firing. Therefore, in our realization the number of Layer-I neurons is M = 2600. The final stage comprises N = 6000 Layer-II coincidence neurons. Each Layer-II neuron is driven by K randomly selected “patches” of Layer-I neurons (in our system K = 6). A patch is composed of L Layer-I neurons with successive thresholds – all driven by the same frequency channel (here L = 10). The computational principle realized by the proposed circuit can be summarized as follows. A given Layer-II neuron fires at time t0 only if all K Layer-I patches fire simultaneously at time t0. And a patch of Layer-I neurons fires at time t0 only if the time-evolution of the frequency channel prior to that time drives one of the L neurons in the patch to its threshold precisely at time t0. Hence each Layer-II neuron is “tuned” to a particular time-frequency template expressed in terms of the time evolution of K frequency channels. The same Layer-II neuron will also fire, albeit in a delayed time, if the time-evolution of all K cochlear channels is scaled by the same factor (this is so because all corresponding Layer-I neurons will reach their threshold with a similar time delay).

Figure 6 illustrates the discrimination power of the circuit in the DRT context. Assume that we have identified 40 Layer-II neurons that are most sensitive to the time-frequency template of the initial diphone of the word “daunt” (phonetically transcribed as dont). Similarly, we have identified 140 neurons for the word “taunt” (transcribed as tont). We term these sets of neurons “State-1” and “State-2” neurons, respectively. The upper-left two panels of Fig. 6 show a spectrographic display of the front-end to the first 200 ms of two realizations of the word dont spoken by a single speaker (note phonemic variability). Below each spectrogram is a time-histogram of the number of state neurons firing to the corresponding stimuli (shown is the fraction out