
Учебники / Hearing - From Sensory Processing to Perception Kollmeier 2007
.pdf282 |
M. Kashino et al. |
when the percept changed from “one stream” to “two streams” and in the opposite direction separately in the two ∆ƒ conditions (z = 0). In the ∆ƒ≈1/6 octave condition where “one stream” percept was dominant, more activation was observed at the transition from “one stream” to “two streams” than at the transition in the opposite direction. On the other hand, in the ∆ƒ≈1/2 octave condition where “two streams” percept was dominant, more activation was observed at the transition from “two streams” to “one stream”. Apparently, escaping from a dominant percept is associated with higher activation.
The involvement of the auditory areas in auditory streaming has been shown by single-unit recordings in animals (Micheyl et al. 2005) and by magneto-encephalography in human (Gutschalk et al. 2005), but not by fMRI (Cusack 2005). The discrepancy may be due to the poor temporal resolution of fMRI. However, the present study revealed the activation of the auditory areas related to auditory streaming using fMRI, taking advantage of the event-related image acquisition time-locked to the changes in percepts. Our findings indicate that the formation of streams may involve multiple neural sites from subcortical to suprasensory levels.
4Conclusions
We have uncovered several features of perceptual transitions in auditory streaming, including stochasticity, interval distributions, and time-dependent transition rates. These features cannot readily be explained by the previous theories of streaming such as the peripheral channeling theory (Beauvois and Meddis 1991; Hartmann and Johnson 1991), the pitch-jump detector theory (Anstis and Saida 1985), and the accumulation-of-evidence theory (Bregman 1990). Some recent studies assume that neural habituation observed in the auditory areas provides a basis of streaming (Fishman et al. 2001; Micheyl et al. 2005). However, such a theory cannot explain the interaction between the direction of transition and ∆ƒ shown in the present fMRI experiment. We suggest that the theory of auditory streaming should incorporate the neural dynamics. Detailed data on the perceptual transitions in auditory streaming and their neural correlates would provide important constraints on the development of such models.
References
Alain C, Cortese F, Picton TW (1998) Event-related activity associated with auditory pattern processing. Neuroreport 15:3537–3541
Anstis S, Saida S (1985) Adaptation to auditory streaming of frequency-modulated tones. J Exp Psychol Hum Percept Perform 11:257–271
Beauvois MW, Meddis R (1991) A computer model of auditory stream segregation. Q J Exp Psychol 43:517–541
The Dynamics of Auditory Streaming: Psychophysics, Neuroimaging, and Modeling |
283 |
Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT, Cambridge, MA
Carlyon RP (2004) How the brain separates sounds. Trends Cogn Sci 8:465–471
Carlyon RP, Cusack R, Foxton JM, Robertson IH (2001) Effects of attention and unilateral neglect on auditory stream segregation. J Exp Psychol Hum Percept Perform 27:115–127 Cusack R (2005) Intraparietal sulcus and perceptual organization. J Cogn Neurosci 17:641–651 Fishman YI, Reser DH, Arezzo JC, Steinschneider M (2001) Neural correlates of auditory stream
segregation in primary auditory cortex of the awake monkey. Hear Res 151:167–187 Fishman YI, Arezzo JC, Steinschneider M (2004) Auditory stream segregation in monkey audi-
tory cortex: effects of frequency separation, presentation rate, and tone duration. J Acoust Soc Am 116:1656–1670
Gutschalk A, Micheyl C, Melcher JR, Rupp A, Scherg M, Oxenham AJ (2005) Neuromagnetic correlates of streaming in human auditory cortex. J Neurosci 25:5382–5388
Hartmann WM, Johnson D (1991) Stream segregation and peripheral channeling. Music Percept 9:155–183
McCabe SL, Denham MJ (1997) A model of auditory streaming. J Acoust Soc Am 101:1611–1621 Micheyl C, Tian B, Carlyon RP, Rauschecker JP (2005) Perceptual organization of tone
sequences in the auditory cortex of awake macaques. Neuron 48:139–148
Näätänen R, Tervaniemi M, Sussman E, Paavilainen P, Winkler I (2001) ‘Primitive intelligence’ in the auditory cortex. Trends Neurosci 24:283–288
Sussman E, Ritter W, Vaghan HG Jr (1999) An investigation of the auditory streaming effect using event-related brain potentials. Psychophysiology 36:22–34
van Noorden LPAS (1975) Temporal coherence in the perception of tone sequences. Unpublished doctoral dissertation, Eindhoven University of Technology
Zhou YH, Gao JB, White KD, Merk I, Yao K (2004) Perceptual dominance time distributions in multistable visual perception. Biol Cybern 90:256–263
Comment by Langner
Looking at a Neckar-cube for some minutes, I see the two possible threedimensional perspectives oscillate, first with a slow period of several seconds until it gets faster and is totally blurred into a two-dimensional set of lines at the end. It seems to me that at least one of the transition curves in Fig. 5 shows periodic peaks which may indicate a similar oscillation behaviour for your paradigm.
Reply
We reexamined the individual data of perceptual transitions, but did not find clear evidence for such an oscillation. Related findings have been reported by Pressnitzer and Hupe (2006), who found that the first percept is significantly longer than subsequent ones both in visual and auditory bistable perception, but there is no long-term trend in the duration of phases after the first one.
References
Pressnitzer D, Hupe JM (2006) Temporal dynamics of auditory and visual bistability reveal common principles of perceptual organization. Curr Biol 16:1351–1357

31 Auditory Stream Segregation Based on Speaker Size, and Identification of Size-Modulated Vowel Sequences
MINORU TSUZAKI1, CHIHIRO TAKESHIMA1, TOSHIO IRINO2, AND ROY D. PATTERSON3
1Introduction
When a receiver of acoustic signals is surrounded by several vibrating bodies, it becomes important to “sort out” sound energies into subparts appropriately to represent the original sources. This issue is called a problem of source segregation, and has been investigated in several ways as a core of the auditory scene analysis. Pitch, or a perceptual attribute corresponding to the fundamental periodicity, has been regarded as one of significant cues for sound segregation. It has been also known that “timbre” can function as another cue (Bregman 1990). However, there are still some problems with the ambiguity in the definition of timbre.
The work by Irino and Patterson (2002) on the wavelet-Mellin image has drawn attention to the scale dimension in natural sounds, and information about the size of the resonators in a source. The existence of the scale information illustrates the ambiguity of timbre: is it just a dimension of timbre or is it a dimension of perception like pitch. If it is a dimension that can be separated from the rest of the timbre information using Mellin transform as suggested by Irino and Patterson (2002), this would explain listeners’ ability to estimate speaker size as reported recently by Ives et al. (2005).
Tsuzaki and Irino (2004) tried to estimate the temporal resolution of this computational process by investigating the identification of vowel sequences whose “size” was modulated sinusoidally. A puzzling aspect of the experimental results was that the performance did not show a monotonic, low-pass characteristic. Although it tended to drop when the modulation period was around 250 ms, the performance became better at shorter modulation periods. The sinusoidal modulation of Tsuzaki and Irino (2004) was applied independent of vowel duration in the sequence. Informal listening indicated that the stimuli with the short modulation period, i.e., the most rapidly size-modulated speech sounded as if two people were speaking identical utterances simultaneously.
1Kyoto City University of Arts, Kyoto, Japan, minoru.tsuzaki@kcua.ac.jp
2Department of Design Information Sciences, Wakayama, Wakayama University, Japan, irino @sys.wakayama-u.ac.jp
3Centre for the Neural Basis of Hearing, Cambridge University, Cambridge, UK, rdp1@cam.ac.uk
Hearing – From Sensory Processing to Perception
B. Kollmeier, G. Klump, V. Hohmann, U. Langemann, M. Mauermann, S. Uppenkamp, and J. Verhey (Eds.) © Springer-Verlag Berlin Heidelberg 2007
286 |
M. Tsuzaki et al. |
The implication was that listeners segregated the speech into two auditory streams based on the size information. The fast modulation condition in the study of Tsuzaki and Irino (2004) produced the perception of size modulation without disrupting the vowel-type information. We wonder how the perception would change if the size and vowel-type information shape changed coincidentally. Would you hear two concurrent speakers saying different things? Or, would the perception simply become chaotic? To answer these questions, two experiments were conducted. The first experiment was to investigate the identification of the vowels in size-modulated sequences. The second experiment was to evaluate detection of a target vowel in size-modulated vowel sequences.
2 Experiment 1: Identification of Vowels in
Size-Modulated Sequences
The purpose of the first experiment was to investigate the effects of the depth and speed of size modulation on the identification of vowel sequences whose size parameter alternated between two values vowel by vowel. If the auditory system is able to extract size information and to build images of two concurrent sources, the identification of the whole sequence will become difficult as faster and deeper modulation is applied because of the difficulty in judging the order.
2.1Stimulus
Vowel sequences were synthesized with a channel vocoder, STRAIGHT (Kawahara et al. 1999; Kawahara and Irino, 2005) based on sampled natural utterances by a Japanese male speaker. All the sequences had six segments, and each segment contained one of five Japanese vowels, i.e., “a”, “e”, “i”, “o”, and “u”. The sequences were generated by concatenating “doublets” of two vowels. For example, a sequence, “aiu” was generated by concatenating “ai” and “iu”, where the middle of the “i” segment was used for the transition. In the transition, the spectrum changed gradually from that of the first “i” into that of the second to minimize the discontinuity associated with the concatenation. Forty sequences shown in Table 1 were prepared as base sequences. The first and last segments always contained the same vowel, and the middle four segments were permutations of the other four vowels.
Size modification was applied by dilating or compressing the frequency axis of the STRAIGHT spectra for the vowel. Dilation raises formant frequencies proportionally, which corresponds to a reduction of vocal tract size; conversely, compression lowers the formants. The size modulation was achieved by alternating the dilation and compression segment by segment. The modulation depth, defined by the amount of the size modification in one direction was either, a quarter, or half, an octave.

Auditory Stream Segregation Based on Speaker Size |
287 |
Table 1 List of vowel sequences
The other main factor was the speed of size modulation. The speaking rate of the original utterances was slower than natural speech: the average segment duration was 340 ms. Two conditions were prepared from the recordings by reducing the segment duration to either one half, or one quarter, of the original duration. For convenience, the former will be referred as the “fast” condition, and the, the “slow” condition.
The F0 pattern of the original sequence was used for all the stimuli. Accordingly, there were no abrupt changes in the F0 contour at the segment boundaries.
2.2Listeners
Six students of Kyoto City University of Arts participated in the experiment. Their audiograms were normal, and they were paid for the participation. Four listeners were assigned to the slow condition and the other two, to the fast condition.
2.3Procedure
Modulation depth was a within-listener factor, while modulation speed was a between-group factor. Each listener was presented three modulation depths, i.e., 0, 1/4, and 1/2 octave, either in the fast, or slow form. The task of the listeners was to identify all six of the vowels in each sequence in the correct order using virtual buttons on a GUI labeled with the five vowel names. No feedback was provided.
Each listener received 40 trials at each modulation depth and 20 trials in the no modulation condition for each of 40 sequences. So there were 4000 trials per listener.
The stimuli were synthesized off-line in advance on a workstation (Apple PowerMac G5), and presented to the listeners by a DSP system controlled by a workstation (SymbolicSound Capybara 350 + Apple iMac G5) through a headphone (Sennheiser HD 600 amplified with Luxman P1).

288 |
M. Tsuzaki et al. |
2.4Results and Discussion
The vowels in the initial and final positions were regarded as “fillers” and were discarded from the analysis because of strong primacy and recency effects. For each modulation speed, the percentage of trials where they identified the four central vowels correctly was calculated, and plotted as a function of the modulation depth in Fig. 1. Percent correct decreases as modulation depth increases both in the fast and slow conditions. In addition, the listeners performed better in the slow condition than in the fast condition.
Performance in the control condition with no modulation was not perfect. To estimate deterioration caused by the size modulation, the ratio of ‘percent correct in test condition’ to ‘percent correct in the control condition’ was calculated for each listener and each token. The geometric mean of these scores is plotted as a function of modulation depth in Fig. 2.
The data are consistent with the hypothesis that the auditory system segregates the sequence based on speaker size. If the sequence were segregated into two streams, i.e., one from a “longer” vocal tract and the other from a “shorter” vocal tract, it would become difficult to perceive the correct order of the vowels. Because the task was to identify each sequence in the correct order, it would become difficult if the sequence broke into two streams. It is reasonable to assume that the segregation will be augmented when the separation of the size becomes larger as well as when the alternation occurs faster, as in the case of the pitch based segregation.
It could also be that the observed deterioration with increasing size modulation was caused by a constraint of Mellin Image construction, i.e., a limit on the temporal resolution of the process. This seems unlikely, however, for the following reason. The two lines in Figs. 2 and 1 are almost parallel, and
Fig. 1 Percent correct sequence idenfification plotted as a function of the modulation depth with the modulation speed as the parameter

Auditory Stream Segregation Based on Speaker Size |
289 |
Fig. 2 Ratio of percent correct in the test and control conditions as a function of modulation depth, with modulation speed as a parameter
this indicates that there is no interaction between these factors. If the observed deterioration with increasing size modulation were caused by a restriction in the rate of Mellin Image construction, we would expect an interaction.
The segregation hypothesis implies that the size information is properly extracted and normalized. This predicts that the identification of individual vowels will not suffer significantly from the size modulation. The purpose of Experiment 2 was to check this prediction by requiring listeners to detect a vowel in size modulated sequences.
3 Experiment 2: Detection of a Target Vowel in Size-Modulated Vowel Sequences
The purpose of Experiment 2 was to investigate the identification of indivisual vowels in the size-modulated sequences. The target detection task was chosen to avoid judgments about order and to minimize change in stimulus characteristics.
3.1Stimulus
Half of the stimuli were identical to those in Experiment 1 They are called “positive” stimuli, i.e., stimuli containing a target vowel, which was either the third, or fourth vowels in the sequence. The “negative” stimuli were copies of the positive stimuli in which the target vowel was replaced by a different
290 |
M. Tsuzaki et al. |
vowel with the restriction that it was not the same as the first/last vowel, and not the same as the preceding or following vowel. For example, if the positive stimulus was “aeioua” with “i” as the target vowel, the negative stimulus was “aeuoua”.
The size modulation was applied in the same manner as Experiment 1, using the two modulation depths, 1/4, and 1/2 octave. There were also control conditions, with stimuli having no modulation. The speed of modulation was limited to the slow condition in this experiment.
3.2Procedure
The task of listeners was simply to say whether the target vowel existed in the sequence or not. At the start of each trial, the ‘target’ vowel was displayed on the computer screen both on a positive, and negative trials. The listener provided their answers (yes or no) by clicking one of two buttons on the GUI. Feedback was provided at the end of each trial. Each listener was presented 20 trials with each modulation depth (0, 1/4, or 1/2 octave) in each target position (3rd or 4th), for each of 40 pairs of positive and negative stimuli. Thus, there were 4800 trials per listener over 10 experimental sessions.
3.3Listeners
Four students of Kyoto City University of Arts participated in the experiment. Their audiograms were normal, and they were paid for their participation.
3.4Results and Discussion
The percentage of responses averaged over listeners and sequence type is shown in Table 2. The reduction in percent correct with increasing modulation rate was smaller than in Experiment 1. To evaluate the effect of size modulation, the ratio of percent correct in each test condition to that in the control condition was calculated, for each listener and each token, as in Experiment 1. The geometric mean is plotted as a function of the modulation in Fig. 3, together with the results from Experiment 1. If there were no effect of size modulation, the scores should be close to unity. The effect of modulation depth is much smaller than in Experiment 1. This supports the segregation hypothesis which assumes that the perceptual problem with the size modulated stimuli was mainly caused by the difficulty in judging the correct order of the vowels due to stream segregation.

Auditory Stream Segregation Based on Speaker Size |
291 |
|||
|
Table 2 Average percent correct in |
|||
|
Experiment 2 |
|
|
|
|
|
|
|
|
|
Modulation Depth |
|
|
|
|
|
|
|
|
|
0 oct |
1/4 oct |
1/2 oct |
|
99 |
96 |
87 |
|
|
|
|
|
|
|
Fig. 3 Ratio of percent correct in the modulated condition to that in the unmodulated condition, plotted as a function of the modulation depth, for both Experiments 1 and 2
4General Discussion
The results of Experiment 1 suggested that it becomes difficult to perceive the vowels as a single sequence, when vocal tract size alternates segment by segment. On the other hand, it was less difficult to recognize a single vowel as shown in Experiment 2. One could explain this task difference by assuming that the sequences were segregated on the basis of perceived size. For example, when a sequence like “aeuioa” were presented with the size factor alternating between “long” and “short” segment by segment, it would be perceived as two concurrent streams, i.e., a big person saying “a-u-o-” and a small person saying “-e-i-a”. And it would be difficult to judge whether the second “e” came before “u” in the third position or after it. Problems in judging order are typical when sounds are segregated into streams.
The fact that perceived speaker size functions as streaming cue suggests that it is used in source identification, as might be expected. Although body size increases as animals mature, it is a very gradual
292 |
M. Tsuzaki et al. |
process, and over the course of a communication, size is normally fixed for a given source.
It is worth noting that the listeners in Experiment 2 had to organize the stimuli pre-attentively. Although the target vowel was presented visually at the start of the trial, they were not told the position of the target within stream, nor which stream it would appear in. The listeners had to store and remember the two streams to make the judgment.
Acknowledgments. Work supported in part by the Grant-in-Aid for Scientific Research (C) No. 17530529, and (B) 18300060, JSPS. Author RP was supported by the UK MRC (G9900369, G0500221) during the research.
References
Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT Press, Massachusetts
Irino T, Patterson R (2002) Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform. Speech Commun 36:181–203
Ives DT, Smith DRR, Patterson RD (2005) Discrimination of speaker size from syllable phrases. J Acoust Soc Am 118(6):3816–3822
Kawahara H, Irino T (2005) Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation. In: Divenyi P (ed) Speech separation by humans and machines. Kluwer Academic Pub, Dordrechet, pp 167–180
Kawahara H, Masuda-Katsuse I, de Cheveigé A (1999) Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Commun 27:187–207
Tsuzaki M, Irino T (2004) Perception of size-modulated speech: the relation between the modulation period and the vowel identification. Trans Tech Committee Psychol Physiolog Acoust, Acoust Soc Jpn H-2004-125, 34
Comment by Divenyi
You are asking your subjects to identify the sequence of six vowels modulated in the size domain as two interleaved three-vowel sequences. You consider a correct identification as an indication that the two sizes did not form two streams. In the other case, I think that one of the sizes (I guess the smaller and higher-pitched one) will be more salient than the other. If this is true, I think that a smart subject could produce a correct identification on every trial on which the sequence separates into two streams. He/she would do it by simply listening to the salient half sequence in the salient stream and reconstruct the sequence from that information plus the fact of knowing that every sequence contains all five vowels and that the first and last vowels are identical.