Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Учебники / Auditory Perception - An Analysis and Synthesis Warren 2008

.pdf
Скачиваний:
199
Добавлен:
07.06.2016
Размер:
1.56 Mб
Скачать

Classical pitch studies 65

The ANSI recommendation considers that the pitch of any particular sound can be described in terms of the frequency of a sinusoidal tone judged to have the same pitch, so that pitch is limited to the audible frequency range of sinusoidal tones extending from about 20 through 16,000 Hz. However, acoustic repetition of waveforms can be perceived as a global percept at rates well below the pitch limit for waveforms other than sinusoids, and we will name such periodic sounds as infratones or infratonal stimuli and their corresponding sensory attribute infrapitch. Thus, the topic of detectable acoustic periodicity repetition involves both tonal and infratonal sounds producing sensations of pitch and infrapitch, respectively. The term iterance will be used as a general term encompassing the perceptual attributes of both pitch and infrapitch.

Classical pitch studies

The ancient Greeks appreciated that sounds correspond to vibratory movement of the air, and analogies were made between sound vibrations and water waves (see Hunt, 1978). They had an interest in the nature of pitch and the basis for musical intervals, and Pythagoras in the sixth century BCE noted that simple integral ratios of the length of two vibrating strings corresponded to the common intervals (e.g., a ratio of 2:1 for an octave, a ratio of 3:2 for a fifth). In the seventeenth century, Galileo noted that if the Greeks had varied the pitches produced by a string by altering either the diameter or the tension rather than length, then the pitch would have been found proportional to the square root of the physical dimension, and the octave and fifth would correspond to ratios of 4:1 and 9:4 respectively. Galileo then described an elegant experiment demonstrating that the octave did indeed correspond to a frequency ratio of 2:1. He observed that when the rim of a goblet containing water was stroked, standing waves appeared on the surface of the liquid. By slight changes in the manner of stroking it was possible to have the pitch jump an octave and, when that occurred, the standing waves changed in length by a factor of precisely two. Galileo also noted that when a hard metal point was drawn over the surface of a soft brass plate, a particular pitch could be heard while, at the same time, a series of grooves with a periodic pattern (basically, a phonographic recording) appeared on the brass surface. With the appropriate velocity and pressure of the stylus, two sets of periodic patterns corresponding to the musical interval of a fifth were generated. When the spacings of the patterns were compared, Galileo found them to have the ratio of 3:2, indicating that this was the ratio of acoustic periodicities producing this interval.

66 Perception of acoustic repetition: pitch and infrapitch

Figure 3.1 The acoustic siren as used in Seebeck’s time. Compressed air passing through the tube (c) releases a puff of air each time it is aligned with a hole in the disk (A) which is rotated by a cord (f) passing over a grooved driveshaft (b). (From Helmholtz, 1954/1877.)

Modern experimental work on pitch perception may be considered to have started with Seebeck’s experiments with a siren (see Figure 3.1). By forcing puffs of compressed air through holes in a disk rotating at a constant speed, periodic sounds consisting of a variety of puff-patterns were produced corresponding to the choice of distances separating the holes. For example, when the disk contained holes separated by the distances a, then b, then a, etc. (a, b, a, b, a, . . . ), the pitch heard was equivalent to that produced by a disk containing half the number of holes with a single distance c (equal to a þ b) separating adjacent openings. When the distances a and b were made equal, the apparent period was halved, and the pitch increased by one octave. As a result of these experiments with repeated patterns consisting of two puffs, as well as more complex patterns, Seebeck (1841) concluded that the pitches heard corresponded to the period of the overall repeated pattern. Thus, it appeared to him that the number of complete statements of the periodic waveform per second determined the pitch. Ohm (1843) stated that Seebeck’s contention that the pitches heard were based upon the repetition period of the overall pattern of puffs was incorrect, and that a Fourier analysis of the periodic signals into harmonic components took place with the pitch being determined by the frequency of the spectral fundamental. Seebeck (1843) countered by claiming that the spectral fundamental was not necessary for hearing a pitch equivalent to that frequency; he pointed out that, even when a spectral analysis showed that the fundamental was very weak or absent, the pitch of the fundamental (which corresponded to the waveform-repetition frequency) was still the

Classical pitch studies 67

Figure 3.2 Seebeck’s waveforms and their corresponding spectra. The spacing of the holes in the siren’s disk producing these sounds is given in degrees. The numbers used to describe the harmonics of the line spectra shown on the right are based upon a waveform period corresponding to 20 (shown as A). (Adapted from Schouten, 1940a.)

dominant pitch heard. The spectra of some of the stimuli generated by Seebeck as summarized by Schouten (1940a) are illustrated in Figure 3.2. Seebeck suggested that the higher harmonic components might combine to cause the pitch corresponding to the fundamental to be heard, even when the fundamental was absent. As Schouten (1970) pointed out, this suggestion concerning the role of upper harmonics foreshadowed later nonspectral theories (including his own). In addition, there is more recent evidence that the stimuli employed by Seebeck (pulse trains with unequal alternate intervals) produce discharge patterns in auditory nerve fibers that are correlated with the pitches that are heard (see Evans, 1986).

Ohm (1844) had dismissed Seebeck’s observations that a pitch could be heard corresponding to an absent or weak fundamental as merely an auditory illusion, to which Seebeck (1844) replied that the term ‘‘illusion’’ was inappropriate since only the ear could decide how tones should be heard. (For review of this controversy between Seebeck and Ohm, see Schouten (1970) and de Boer (1976).) In the second half of the nineteenth century, Helmholtz (1954/ 1877) backed Ohm’s position in this controversy. Considering the ear to be an imperfect spectral analyzer, Helmholtz described distortion products which were capable of generating the fundamental frequency within the ear, even when missing as a Fourier component of a periodic stimulus. Helmholtz was well aware that a complex tone appears to have a single pitch rather than a

68Perception of acoustic repetition: pitch and infrapitch

cluster of pitches corresponding to the individual harmonics. He attributed the perception of a single pitch to the adoption by unskilled listeners of a ‘‘synthetic’’ mode of listening to the entire complex aggregate of components (resulting in a single pitch corresponding to the fundamental frequency and having a timbre, or quality, reflecting the harmonic composition), rather than an ‘‘analytical’’ mode available to skilled listeners in which the pitches of component harmonics could be abstracted. Helmholtz stated that unskilled listeners could be trained to hear individual lower harmonics in a complex tone ‘‘with comparative ease.’’ One recommended procedure involved first playing the harmonic component by itself at a soft level, and then immediately substituting the complex tone at a louder level: the harmonic could then be heard to continue as a component within the complex tone.

Helmholtz’s position that the analysis of complex tones into a harmonic series of sinusoidal components was responsible for the pitch of complex tones had great influence, largely because he also offered a plausible explanation of how this spectral analysis could be accomplished by the ear. As described in Chapter 1, he suggested that the cochlea acted as if it contained a set of graded resonators, each of which responded selectively to a particular component frequency. Low frequencies were considered to produce sympathetic vibrations at the apical end, and high frequencies at the basal end of the cochlea. His first version of the theory identified the rods of Corti as the resonant bodies, but this was later amended to consider resonating transverse fibers embedded in the basilar membrane as being responsible for spectral analysis. We have seen in Chapter 1 that Be´ke´sy modified the basis of Helmholtz’s place theory from simple resonance to a traveling wave, with spectral analysis corresponding to the loci of maximal displacements produced by slow-velocity waves (much slower than sound waves) sweeping along the basilar membrane from base to apex. Chapter 1 also discussed recent evidence indicating that spectral analysis may involve not only the loci of maximal displacements along the basilar membrane, but also resonant tuning of the stereocilia of the receptor cells in a manner reminiscent of Helmholtz’s first version of his resonance theory involving the sympathetic vibration of the rods of Corti.

There appears to be general agreement today that sound is subject to nonlinear distortions within the ear as Helmholtz had suggested. These distortions can introduce harmonic components when the stimulus consists of a sinusoidal tone, and can produce ‘‘combination tones’’ through the interaction of pairs of component frequencies. Among the more thoroughly studied combination tones are the simple difference tone (f2 – f1), and the cubic difference tone (2f1 – f2). Thus, sinusoidal tones of 900 Hz and 1,100 Hz can produce a simple difference tone of 200 Hz and a cubic difference tone of 700 Hz.

Masking 69

Whereas Helmholtz attributed the production of nonlinear distortion to the movements of the tympanic membrane and the ossicular chain within the middle ear, more recent evidence has emphasized nonlinearity within the inner ear (see Rhode and Robles, 1974; Plomp, 1976; Cooper and Rhode, 1993).

Nonlinear distortions and the spectral analyses leading to pitch perception occur within the ear prior to neural stimulation. In addition, a temporal analysis of recurrent patterns of neural response appears to be involved in the perception of pitch. Before dealing with the temporal analysis of acoustic repetition, let us consider the topics of masking and critical bands which can help in understanding the nature of both frequency (place) and temporal (periodicity) coding.

Masking

It is well known that a louder sound can, under some conditions, mask (or prevent us from hearing) an otherwise audible fainter sound. Masking is not only a topic of direct practical interest, but also has been used widely to further our understanding of auditory processing.

Wegel and Lane (1924) used pure tones ranging from 200 Hz through 3,500 Hz as maskers. The masker was presented at a fixed SPL, and the threshold for a second sinusoidal (masked) tone was determined for various frequencies. Figure 3.3 presents their masked threshold function (sometimes called a masked audiogram) for different frequencies with a 1,200 Hz masker at 80 dB SPL. It can be observed that higher thresholds (corresponding to greater masking) were obtained for frequencies which were above rather than below the masking tone (the so-called upward spread of masking). The masked audiogram for frequencies near and at intervals above the frequency of the masker is marked by discontinuities or notches. There is general agreement on the basis for the notch centered on the masker frequency of 1,200 Hz: when two tones close to each other in frequency are mixed, a single pitch of intermediate frequency is perceived, and the loudness fluctuations of first-order beats are heard. The beat rate is equal to the difference in frequencies of the tones, so that if the tones are 1,200 Hz and 1,206 Hz, beats are heard at the rate of six per second, with loudness minima occurring when the pressure crests of one sinusoidal waveform coincide with pressure troughs of the other. The threshold for detecting the addition of a sinusoidal tone having the same frequency and phase as the louder ‘‘masker’’ is the just noticeable difference (jnd), and has been used to measure jnds (see Reisz, 1928). The basis for the occurrence of notches at harmonics of the masker frequency is somewhat more controversial. Wegel and Lane attributed these dips to fluctuations in the

70 Perception of acoustic repetition: pitch and infrapitch

SENSATION LEVEL OF SIGNAL (dB)

90

80

70

60

50

40

30

20

10

0

400

 

 

 

Masker Frequency

 

 

 

 

600

800

1000

1200

1600

2000

2800

3600

FREQUENCY OF MASKED TONE (Hz)

Figure 3.3 Masking of one tone by another. The 1,200 Hz masker has a fixed intensity (80 dB SPL), and the masked threshold is given as sensation level (or dB above unmasked threshold). (Adapted from Wegel and Lane, 1924.)

intensity of the masked tone caused by interactions with aural harmonics of the masker (that is, harmonic distortion products generated within the ear), and these notches in the masking function have been used to estimate the extent of harmonic distortion (Fletcher, 1930; Opheim and Flottorp, 1955; Lawrence and Yantis, 1956). However, this explanation has been criticized on a variety of grounds (see Chocholle and Legouix, 1957a, 1957b; Meyer, 1957). Plomp (1967b) studied the perceptual interaction of tones mistuned slightly from consonance, and provided evidence suggesting that higher-order beats (that is, beats involving mistuning from a consonance other than unison) are based upon detection of periodic fluctuations in the phase relations of the spectral components, even when these components are separated by several octaves. This comparison of temporal or phase information from widely separated cochlear loci could be responsible for the notches at harmonics of the lower frequency tone (2,400 Hz and 3,600 Hz) that appear in Figure 3.3.

The complicating effects of beats in masking experiments can be reduced by using narrow-band noise as the masker. Figure 3.4 shows masked audiograms measured by Egan and Hake (1950) using a narrow-band noise centered at 410 Hz presented at various intensity levels. The notches observed with tonal maskers, if present at all, are very much reduced in magnitude, and the tonal

Masking 71

Figure 3.4 Masking of tones by different levels of a narrow-band noise centered at

410 Hz. (Adapted from Egan and Hake, 1950.)

threshold curves are almost symmetrical about the logarithm of the center frequency of the band of masking noise when it is present at its lowest intensity levels. An increase in the level of the masking noise band results in an asymmetrical spread of excitation along the basilar membrane, and produces an upward spread of masking that becomes quite pronounced at the highest intensity level, as can be seen in Figure 3.4.

In addition to simultaneous masking, there are two types of nonsimultaneous masking. In ‘‘forward’’ masking, a louder preceding sound prevents detection of a brief faint sound. In ‘‘backward’’ masking, a brief faint sound is made inaudible by a louder subsequent sound. Both types of nonsimultaneous masking usually have effective durations of less than 100 ms.

Forward masking may correspond in part to the time required for the receptors to regain their sensitivity and/or the persistence of activity after exposure to a louder sound. This time is quite short, usually only tens of milliseconds. Backward masking is more difficult to account for. One possible basis that has been suggested is that the subsequent louder sound produces neural activity which travels at a greater velocity and overtakes the fainter stimulus on the way to the central nervous system (R. L. Miller, 1947), thus effectively becoming a special case of simultaneous masking. It is also possible that central processing of the fainter sound takes an appreciable amount of time, and the louder sound disrupts the processing at some critical stage. (For a detailed discussion and comparison of the various types of masking, see Buus, 1997).

72 Perception of acoustic repetition: pitch and infrapitch

Critical bands

Fletcher (1940) interpreted earlier studies of masking as indicating that the basilar membrane operates as a bank of filters, each having a limited resolving power corresponding to what he called a ‘‘critical band.’’ He considered that louder sounds could prevent detection of (or mask) fainter sounds when they stimulated the same critical bands. Fletcher attempted to measure widths of critical bands at various center frequencies by using noise to mask tonal signals. The long-term average power in a 1 Hz wide band within a broadband noise is called the ‘‘noise power density’’ and abbreviated as N0. Fletcher started with broadband noise having a constant N0 at all frequencies (that is, white or Gaussian noise), and measured the masked threshold for a tone presented along with the noise. He decreased the bandwidth of the noise keeping N0 of the remaining noise fixed, and found that little or no effect was observed on the masked threshold when the frequencies removed from the noise were beyond a critical distance from the tone. Fletcher’s conclusion that a narrow ‘‘critical band’’ is mainly responsible for masking is now generally accepted, and has proved extremely valuable in understanding the interaction of components having different frequencies. However, Fletcher made the further assumption that the total power of a noise within a critical band is the same as the power of a tone centered within that band at its masked threshold. This assumption has been questioned by Scharf (1970), who suggested that estimates of the width of the critical band based on the ratio of tonal power at masked threshold to N0 are roughly 40 percent less than the width of the critical band measured by other methods involving direct frequency interactions on the basilar membrane. Values obtained using Fletcher’s method are sometimes called ‘‘critical ratios,’’ with the term ‘‘critical band’’ being reserved for values obtained with other procedures as shown in Figure 3.5. However, Spiegel (1981) has argued that Fletcher’s method is valid when used with suitable precautions, either for pure tones masked with noise of various bandwidths, or for narrow-band noise masked with wider-band noise. The widths of critical bands have also been calculated in terms of equivalent rectangular bands (ERBs) (see Moore, 2003).

Greenwood (1961, 1990) has presented evidence that each critical band covers about the same distance on the basilar membrane in mammals, and has estimated the width corresponding to a critical band to be approximately 1 mm for humans.

Comodulation and masking reduction

As described in the previous section, when random or stochastic noise is presented along with a tone, only the spectral components of the noise that

Comodulation and masking reduction 73

Figure 3.5 The width of critical bands as a function of center frequency. Values are presented from several sources. Data of Hawkins and Stevens were transformed by multiplying their values (considered as ‘‘critical ratios’’) by 2.5. (From Scharf, 1970.)

lie within a critical bandwidth of the tone have an appreciable effect in elevating the tone’s threshold. Thus, if a tone is maintained at the center of a noise band having a uniform power density and the bandwidth of the noise is expanded from an initial value below a critical bandwidth, the threshold of the tone increases only until the boundaries of the critical band are reached – further increase of bandwidth has no appreciable effect. However, under certain conditions, the presence of noise beyond the limits of the critical band can decrease the threshold of the tone (Hall, Haggard, and Fernandes, 1984). This

74Perception of acoustic repetition: pitch and infrapitch

‘‘comodulation masking release’’ (CMR) has been the subject of several studies. Usually, the tone is centered in one noise band (the target band) and the influence of a second noise band (the cue band) is investigated. CMR occurs when the target band and the cue band are comodulated – that is, when the amplitudes of the two bands fluctuate together. Comodulation of different frequency bands in everyday life usually indicates a common sound source, but there is an uncertainty concerning the mechanism by which the correlation of amplitude changes within the two noise bands reduces the threshold for the tone (for a summary and extended discussion of CMR, see Moore, 2003).

Place theory of pitch

Plomp (1968) and de Boer (1976) have pointed out that there are two classical place theories: one considers that there is a Fourier analysis of limited resolution along the basilar membrane, with lower frequencies stimulating the apical end and higher frequencies the basal end; the other considers that pitch is determined by the places stimulated. The first place theory is supported by overwhelming evidence; the second seems to be only partly true – that is, place is not the sole determinant of pitch. For a sinusoidal tone, the locus of maximum stimulation changes regularly with frequency only from about 50 through 16,000 Hz, so that place cannot account for low pitches from 20 through 50 Hz. In addition, the pitch of a sinusoidal tone does not change appreciably with amplitude despite the neurophysiological evidence that the place of maximal excitation does change appreciably, shifting toward the base of the cochlea as the level is increased. Measurements reported by Chatterjee and Zwislocki (1997) indicate that this shift can correspond to a distance equivalent to as much as one or two octaves over an 80 dB range. Further, the very small just noticeable differences (jnds) in frequency of sinusoidal tones are difficult to account for by spectral resolution along the basilar membrane. Figure 3.5 shows that at 500 Hz the critical bandwidth is about 100 Hz, yet jnds having values less than 1 Hz have been reported (Nordmark, 1968; Moore, 1974). Although there are mechanisms based on place which have been proposed for discriminating tones separated by considerably less than a critical band (Be´ke´sy, 1960; Tonndorf, 1970; Zwicker, 1970), they have difficulties handling jnds for pure tones as small as those reported by Nordmark and by Moore, as well as the relatively small changes in jnds with frequencies from about 500 Hz to 2,000 Hz.

Classical place theorists have encountered a number of problems in dealing with the pitch of complex tones. As we have seen, a single pitch corresponding to the spectral fundamental is generally heard despite the presence of