- •Preface
- •Introduction
- •1.1 Spatial coordinate systems
- •1.2 Sound fields and their physical characteristics
- •1.2.1 Free-field and sound waves generated by simple sound sources
- •1.2.2 Reflections from boundaries
- •1.2.3 Directivity of sound source radiation
- •1.2.4 Statistical analysis of acoustics in an enclosed space
- •1.2.5 Principle of sound receivers
- •1.3 Auditory system and perception
- •1.3.1 Auditory system and its functions
- •1.3.2 Hearing threshold and loudness
- •1.3.3 Masking
- •1.3.4 Critical band and auditory filter
- •1.4 Artificial head models and binaural signals
- •1.4.1 Artificial head models
- •1.4.2 Binaural signals and head-related transfer functions
- •1.5 Outline of spatial hearing
- •1.6 Localization cues for a single sound source
- •1.6.1 Interaural time difference
- •1.6.2 Interaural level difference
- •1.6.3 Cone of confusion and head movement
- •1.6.4 Spectral cues
- •1.6.5 Discussion on directional localization cues
- •1.6.6 Auditory distance perception
- •1.7 Summing localization and spatial hearing with multiple sources
- •1.7.1 Summing localization with two sound sources
- •1.7.2 The precedence effect
- •1.7.3 Spatial auditory perceptions with partially correlated and uncorrelated source signals
- •1.7.4 Auditory scene analysis and spatial hearing
- •1.7.5 Cocktail party effect
- •1.8 Room reflections and auditory spatial impression
- •1.8.1 Auditory spatial impression
- •1.8.2 Sound field-related measures and auditory spatial impression
- •1.8.3 Binaural-related measures and auditory spatial impression
- •1.9.1 Basic principle of spatial sound
- •1.9.2 Classification of spatial sound
- •1.9.3 Developments and applications of spatial sound
- •1.10 Summary
- •2.1 Basic principle of a two-channel stereophonic sound
- •2.1.1 Interchannel level difference and summing localization equation
- •2.1.2 Effect of frequency
- •2.1.3 Effect of interchannel phase difference
- •2.1.4 Virtual source created by interchannel time difference
- •2.1.5 Limitation of two-channel stereophonic sound
- •2.2.1 XY microphone pair
- •2.2.2 MS transformation and the MS microphone pair
- •2.2.3 Spaced microphone technique
- •2.2.4 Near-coincident microphone technique
- •2.2.5 Spot microphone and pan-pot technique
- •2.2.6 Discussion on microphone and signal simulation techniques for two-channel stereophonic sound
- •2.3 Upmixing and downmixing between two-channel stereophonic and mono signals
- •2.4 Two-channel stereophonic reproduction
- •2.4.1 Standard loudspeaker configuration of two-channel stereophonic sound
- •2.4.2 Influence of front-back deviation of the head
- •2.5 Summary
- •3.1 Physical and psychoacoustic principles of multichannel surround sound
- •3.2 Summing localization in multichannel horizontal surround sound
- •3.2.1 Summing localization equations for multiple horizontal loudspeakers
- •3.2.2 Analysis of the velocity and energy localization vectors of the superposed sound field
- •3.2.3 Discussion on horizontal summing localization equations
- •3.3 Multiple loudspeakers with partly correlated and low-correlated signals
- •3.4 Summary
- •4.1 Discrete quadraphone
- •4.1.1 Outline of the quadraphone
- •4.1.2 Discrete quadraphone with pair-wise amplitude panning
- •4.1.3 Discrete quadraphone with the first-order sound field signal mixing
- •4.1.4 Some discussions on discrete quadraphones
- •4.2 Other horizontal surround sounds with regular loudspeaker configurations
- •4.2.1 Six-channel reproduction with pair-wise amplitude panning
- •4.2.2 The first-order sound field signal mixing and reproduction with M ≥ 3 loudspeakers
- •4.3 Transformation of horizontal sound field signals and Ambisonics
- •4.3.1 Transformation of the first-order horizontal sound field signals
- •4.3.2 The first-order horizontal Ambisonics
- •4.3.3 The higher-order horizontal Ambisonics
- •4.3.4 Discussion and implementation of the horizontal Ambisonics
- •4.4 Summary
- •5.1 Outline of surround sounds with accompanying picture and general uses
- •5.2 5.1-Channel surround sound and its signal mixing analysis
- •5.2.1 Outline of 5.1-channel surround sound
- •5.2.2 Pair-wise amplitude panning for 5.1-channel surround sound
- •5.2.3 Global Ambisonic-like signal mixing for 5.1-channel sound
- •5.2.4 Optimization of three frontal loudspeaker signals and local Ambisonic-like signal mixing
- •5.2.5 Time panning for 5.1-channel surround sound
- •5.3 Other multichannel horizontal surround sounds
- •5.4 Low-frequency effect channel
- •5.5 Summary
- •6.1 Summing localization in multichannel spatial surround sound
- •6.1.1 Summing localization equations for spatial multiple loudspeaker configurations
- •6.1.2 Velocity and energy localization vector analysis for multichannel spatial surround sound
- •6.1.3 Discussion on spatial summing localization equations
- •6.1.4 Relationship with the horizontal summing localization equations
- •6.2 Signal mixing methods for a pair of vertical loudspeakers in the median and sagittal plane
- •6.3 Vector base amplitude panning
- •6.4 Spatial Ambisonic signal mixing and reproduction
- •6.4.1 Principle of spatial Ambisonics
- •6.4.2 Some examples of the first-order spatial Ambisonics
- •6.4.4 Recreating a top virtual source with a horizontal loudspeaker arrangement and Ambisonic signal mixing
- •6.5 Advanced multichannel spatial surround sounds and problems
- •6.5.1 Some advanced multichannel spatial surround sound techniques and systems
- •6.5.2 Object-based spatial sound
- •6.5.3 Some problems related to multichannel spatial surround sound
- •6.6 Summary
- •7.1 Basic considerations on the microphone and signal simulation techniques for multichannel sounds
- •7.2 Microphone techniques for 5.1-channel sound recording
- •7.2.1 Outline of microphone techniques for 5.1-channel sound recording
- •7.2.2 Main microphone techniques for 5.1-channel sound recording
- •7.2.3 Microphone techniques for the recording of three frontal channels
- •7.2.4 Microphone techniques for ambience recording and combination with frontal localization information recording
- •7.2.5 Stereophonic plus center channel recording
- •7.3 Microphone techniques for other multichannel sounds
- •7.3.1 Microphone techniques for other discrete multichannel sounds
- •7.3.2 Microphone techniques for Ambisonic recording
- •7.4 Simulation of localization signals for multichannel sounds
- •7.4.1 Methods of the simulation of directional localization signals
- •7.4.2 Simulation of virtual source distance and extension
- •7.4.3 Simulation of a moving virtual source
- •7.5 Simulation of reflections for stereophonic and multichannel sounds
- •7.5.1 Delay algorithms and discrete reflection simulation
- •7.5.2 IIR filter algorithm of late reverberation
- •7.5.3 FIR, hybrid FIR, and recursive filter algorithms of late reverberation
- •7.5.4 Algorithms of audio signal decorrelation
- •7.5.5 Simulation of room reflections based on physical measurement and calculation
- •7.6 Directional audio coding and multichannel sound signal synthesis
- •7.7 Summary
- •8.1 Matrix surround sound
- •8.1.1 Matrix quadraphone
- •8.1.2 Dolby Surround system
- •8.1.3 Dolby Pro-Logic decoding technique
- •8.1.4 Some developments on matrix surround sound and logic decoding techniques
- •8.2 Downmixing of multichannel sound signals
- •8.3 Upmixing of multichannel sound signals
- •8.3.1 Some considerations in upmixing
- •8.3.2 Simple upmixing methods for front-channel signals
- •8.3.3 Simple methods for Ambient component separation
- •8.3.4 Model and statistical characteristics of two-channel stereophonic signals
- •8.3.5 A scale-signal-based algorithm for upmixing
- •8.3.6 Upmixing algorithm based on principal component analysis
- •8.3.7 Algorithm based on the least mean square error for upmixing
- •8.3.8 Adaptive normalized algorithm based on the least mean square for upmixing
- •8.3.9 Some advanced upmixing algorithms
- •8.4 Summary
- •9.1 Each order approximation of ideal reproduction and Ambisonics
- •9.1.1 Each order approximation of ideal horizontal reproduction
- •9.1.2 Each order approximation of ideal three-dimensional reproduction
- •9.2 General formulation of multichannel sound field reconstruction
- •9.2.1 General formulation of multichannel sound field reconstruction in the spatial domain
- •9.2.2 Formulation of spatial-spectral domain analysis of circular secondary source array
- •9.2.3 Formulation of spatial-spectral domain analysis for a secondary source array on spherical surface
- •9.3 Spatial-spectral domain analysis and driving signals of Ambisonics
- •9.3.1 Reconstructed sound field of horizontal Ambisonics
- •9.3.2 Reconstructed sound field of spatial Ambisonics
- •9.3.3 Mixed-order Ambisonics
- •9.3.4 Near-field compensated higher-order Ambisonics
- •9.3.5 Ambisonic encoding of complex source information
- •9.3.6 Some special applications of spatial-spectral domain analysis of Ambisonics
- •9.4 Some problems related to Ambisonics
- •9.4.1 Secondary source array and stability of Ambisonics
- •9.4.2 Spatial transformation of Ambisonic sound field
- •9.5 Error analysis of Ambisonic-reconstructed sound field
- •9.5.1 Integral error of Ambisonic-reconstructed wavefront
- •9.5.2 Discrete secondary source array and spatial-spectral aliasing error in Ambisonics
- •9.6 Multichannel reconstructed sound field analysis in the spatial domain
- •9.6.1 Basic method for analysis in the spatial domain
- •9.6.2 Minimizing error in reconstructed sound field and summing localization equation
- •9.6.3 Multiple receiver position matching method and its relation to the mode-matching method
- •9.7 Listening room reflection compensation in multichannel sound reproduction
- •9.8 Microphone array for multichannel sound field signal recording
- •9.8.1 Circular microphone array for horizontal Ambisonic recording
- •9.8.2 Spherical microphone array for spatial Ambisonic recording
- •9.8.3 Discussion on microphone array recording
- •9.9 Summary
- •10.1 Basic principle and implementation of wave field synthesis
- •10.1.1 Kirchhoff–Helmholtz boundary integral and WFS
- •10.1.2 Simplification of the types of secondary sources
- •10.1.3 WFS in a horizontal plane with a linear array of secondary sources
- •10.1.4 Finite secondary source array and effect of spatial truncation
- •10.1.5 Discrete secondary source array and spatial aliasing
- •10.1.6 Some issues and related problems on WFS implementation
- •10.2 General theory of WFS
- •10.2.1 Green’s function of Helmholtz equation
- •10.2.2 General theory of three-dimensional WFS
- •10.2.3 General theory of two-dimensional WFS
- •10.2.4 Focused source in WFS
- •10.3 Analysis of WFS in the spatial-spectral domain
- •10.3.1 General formulation and analysis of WFS in the spatial-spectral domain
- •10.3.2 Analysis of the spatial aliasing in WFS
- •10.3.3 Spatial-spectral division method of WFS
- •10.4 Further discussion on sound field reconstruction
- •10.4.1 Comparison among various methods of sound field reconstruction
- •10.4.2 Further analysis of the relationship between acoustical holography and sound field reconstruction
- •10.4.3 Further analysis of the relationship between acoustical holography and Ambisonics
- •10.4.4 Comparison between WFS and Ambisonics
- •10.5 Equalization of WFS under nonideal conditions
- •10.6 Summary
- •11.1 Basic principles of binaural reproduction and virtual auditory display
- •11.1.1 Binaural recording and reproduction
- •11.1.2 Virtual auditory display
- •11.2 Acquisition of HRTFs
- •11.2.1 HRTF measurement
- •11.2.2 HRTF calculation
- •11.2.3 HRTF customization
- •11.3 Basic physical features of HRTFs
- •11.3.1 Time-domain features of far-field HRIRs
- •11.3.2 Frequency domain features of far-field HRTFs
- •11.3.3 Features of near-field HRTFs
- •11.4 HRTF-based filters for binaural synthesis
- •11.5 Spatial interpolation and decomposition of HRTFs
- •11.5.1 Directional interpolation of HRTFs
- •11.5.2 Spatial basis function decomposition and spatial sampling theorem of HRTFs
- •11.5.3 HRTF spatial interpolation and signal mixing for multichannel sound
- •11.5.4 Spectral shape basis function decomposition of HRTFs
- •11.6 Simplification of signal processing for binaural synthesis
- •11.6.1 Virtual loudspeaker-based algorithms
- •11.6.2 Basis function decomposition-based algorithms
- •11.7.1 Principle of headphone equalization
- •11.7.2 Some problems with binaural reproduction and VAD
- •11.8 Binaural reproduction through loudspeakers
- •11.8.1 Basic principle of binaural reproduction through loudspeakers
- •11.8.2 Virtual source distribution in two-front loudspeaker reproduction
- •11.8.3 Head movement and stability of virtual sources in Transaural reproduction
- •11.8.4 Timbre coloration and equalization in transaural reproduction
- •11.9 Virtual reproduction of stereophonic and multichannel surround sound
- •11.9.1 Binaural reproduction of stereophonic and multichannel sound through headphones
- •11.9.2 Stereophonic expansion and enhancement
- •11.9.3 Virtual reproduction of multichannel sound through loudspeakers
- •11.10.1 Binaural room modeling
- •11.10.2 Dynamic virtual auditory environments system
- •11.11 Summary
- •12.1 Physical analysis of binaural pressures in summing virtual source and auditory events
- •12.1.1 Evaluation of binaural pressures and localization cues
- •12.1.2 Method for summing localization analysis
- •12.1.3 Binaural pressure analysis of stereophonic and multichannel sound with amplitude panning
- •12.1.4 Analysis of summing localization with interchannel time difference
- •12.1.5 Analysis of summing localization at the off-central listening position
- •12.1.6 Analysis of interchannel correlation and spatial auditory sensations
- •12.2 Binaural auditory models and analysis of spatial sound reproduction
- •12.2.1 Analysis of lateral localization by using auditory models
- •12.2.2 Analysis of front-back and vertical localization by using a binaural auditory model
- •12.2.3 Binaural loudness models and analysis of the timbre of spatial sound reproduction
- •12.3 Binaural measurement system for assessing spatial sound reproduction
- •12.4 Summary
- •13.1 Analog audio storage and transmission
- •13.1.1 45°/45° Disk recording system
- •13.1.2 Analog magnetic tape audio recorder
- •13.1.3 Analog stereo broadcasting
- •13.2 Basic concepts of digital audio storage and transmission
- •13.3 Quantization noise and shaping
- •13.3.1 Signal-to-quantization noise ratio
- •13.3.2 Quantization noise shaping and 1-Bit DSD coding
- •13.4 Basic principle of digital audio compression and coding
- •13.4.1 Outline of digital audio compression and coding
- •13.4.2 Adaptive differential pulse-code modulation
- •13.4.3 Perceptual audio coding in the time-frequency domain
- •13.4.4 Vector quantization
- •13.4.5 Spatial audio coding
- •13.4.6 Spectral band replication
- •13.4.7 Entropy coding
- •13.4.8 Object-based audio coding
- •13.5 MPEG series of audio coding techniques and standards
- •13.5.1 MPEG-1 audio coding technique
- •13.5.2 MPEG-2 BC audio coding
- •13.5.3 MPEG-2 advanced audio coding
- •13.5.4 MPEG-4 audio coding
- •13.5.5 MPEG parametric coding of multichannel sound and unified speech and audio coding
- •13.5.6 MPEG-H 3D audio
- •13.6 Dolby series of coding techniques
- •13.6.1 Dolby digital coding technique
- •13.6.2 Some advanced Dolby coding techniques
- •13.7 DTS series of coding technique
- •13.8 MLP lossless coding technique
- •13.9 ATRAC technique
- •13.10 Audio video coding standard
- •13.11 Optical disks for audio storage
- •13.11.1 Structure, principle, and classification of optical disks
- •13.11.2 CD family and its audio formats
- •13.11.3 DVD family and its audio formats
- •13.11.4 SACD and its audio formats
- •13.11.5 BD and its audio formats
- •13.12 Digital radio and television broadcasting
- •13.12.1 Outline of digital radio and television broadcasting
- •13.12.2 Eureka-147 digital audio broadcasting
- •13.12.3 Digital radio mondiale
- •13.12.4 In-band on-channel digital audio broadcasting
- •13.12.5 Audio for digital television
- •13.13 Audio storage and transmission by personal computer
- •13.14 Summary
- •14.1 Outline of acoustic conditions and requirements for spatial sound intended for domestic reproduction
- •14.2 Acoustic consideration and design of listening rooms
- •14.3 Arrangement and characteristics of loudspeakers
- •14.3.1 Arrangement of the main loudspeakers in listening rooms
- •14.3.2 Characteristics of the main loudspeakers
- •14.3.3 Bass management and arrangement of subwoofers
- •14.4 Signal and listening level alignment
- •14.5 Standards and guidance for conditions of spatial sound reproduction
- •14.6 Headphones and binaural monitors of spatial sound reproduction
- •14.7 Acoustic conditions for cinema sound reproduction and monitoring
- •14.8 Summary
- •15.1 Outline of psychoacoustic and subjective assessment experiments
- •15.2 Contents and attributes for spatial sound assessment
- •15.3 Auditory comparison and discrimination experiment
- •15.3.1 Paradigms of auditory comparison and discrimination experiment
- •15.3.2 Examples of auditory comparison and discrimination experiment
- •15.4 Subjective assessment of small impairments in spatial sound systems
- •15.5 Subjective assessment of a spatial sound system with intermediate quality
- •15.6 Virtual source localization experiment
- •15.6.1 Basic methods for virtual source localization experiments
- •15.6.2 Preliminary analysis of the results of virtual source localization experiments
- •15.6.3 Some results of virtual source localization experiments
- •15.7 Summary
- •16.1.1 Application to commercial cinema and related problems
- •16.1.2 Applications to domestic reproduction and related problems
- •16.1.3 Applications to automobile audio
- •16.2.1 Applications to virtual reality
- •16.2.2 Applications to communication and information systems
- •16.2.3 Applications to multimedia
- •16.2.4 Applications to mobile and handheld devices
- •16.3 Applications to the scientific experiments of spatial hearing and psychoacoustics
- •16.4 Applications to sound field auralization
- •16.4.1 Auralization in room acoustics
- •16.4.2 Other applications of auralization technique
- •16.5 Applications to clinical medicine
- •16.6 Summary
- •References
- •Index
630 Spatial Sound
3.Scalable sampling rate profile
The scalable sampling rate profile is the simplest of the three profiles. This profile includes a gain control module and a limited order TNS. It does not involve prediction and intensity stereo coding. It can also provide a frequency-scalable signal.
AAC decoding is an inverse course of the aforementioned coding and omitted here. Subjective experiments (Kirby et al., 1996) have indicated that at a compression ratio of
12:1 (at a sampling frequency of 48 kHz, with a bit rate of 64 kbit /s per channel or 320 kbit/s for five channels), the MPEG-2 AAC main profile provides a “indistinguishable” perceived quality. The overall perceived quality of MPEG-2 AAC at 320 kbit/s is better than that of MPEG-2 BC Layer II at 640 kbit/s. The average quality of the latter is not better than that of MPEG-2 AAC at 256 kbit/s. For two-channel stereophonic signals, AAC at 96 kbit/s exhibits an average quality comparable with that of MPEG-1 Layer II at 192 kbit/s or Dolby Digital at 160 kbit/s (Herre and Dietz, 2008). Some earlier subjective experiments (Soulodre et al., 1998) have demonstrated that AAC and Dolby Digital achieves the highest quality at 128 and 192 kbit/s for two-channel stereophonic signals, respectively. Therefore, MPEG-2 AAC is a highly efficient coding method.
13.5.4 MPEG-4 audio coding
MPEG-4 Audio is a low-bit-rate coding standard for multimedia communication and entertainment applications. It was formulated in 1995, with the first and second editions released in 1999 and 2000, respectively (Brandenburg and Bosi, 1997; Väänänen and Huopaniemi, 2004). MPEG-4 Audio combines some previous techniques of high-quality audio coding, speech coding, and computer music with great flexibility and extensibility. It supports the synthetic audio coding (such as computer music), natural audio coding (such as music and speech) and synthetic-natural hybrid coding.
MPEG-4 natural audio coding provides three schemes, e.g., parametric audio coding (Section 13.4.1), code-excited linear prediction (CELP) and general audio (waveform) coding. The preceding two schemes are appropriate for speech or audio coding at a low bit rate. The parametric coding includes the tools of harmonic vector excitation coding (HVXC), as well as harmonic and individual line plus noise (HILN). For natural audio with a sampling frequency higher than 8 kHz and bit rate of 16–64 kbit/s (or higher), MPEG-4 Audio directly codes the wavefront. The core scheme of wavefront coding is the AAC in Section 13.5.3. The block diagram of MPEG-4 AAC is similar to that in Figure 13.18. In comparison with MPEG-2 AAC, MPEG-4 AAC increases the perceptual noise substitution (PNS) and longterm prediction (LTP) tools. PNS tools aim to improve the coding efficiency of noise-like signals. When PNS is used, a noise substitution flag and designation of the power of the coefficients are transmitted instead of quantized spectral components. The decoder inserts pseudo-random values scaled by the proper noise power level. Tone-like signals require a much higher coding resolution than that of noise-like signals. However, tone-like signals are predictable because of its long-term periodicity. The LTP tool uses forward-adaptive longterm prediction to remove the redundancy among the successive blocks.
The first version of MPEG-4 high-efficiency AAC (MPEG-4 HE-AAC v1) was developed in 2003 to improve the coding efficiency for low-bit-rate audio (Herre and Dietz, 2008). Based on the architecture MPEG-4 AAC, the SBR tool in Section 13.4.6 is used in MPEG-4 HE-AAC v1. When bit rates are 20, 32, and 48 kbit/s, the ranges of SBR are 4.5–15.4, 6.8–16.9, and 8.3–16.9 kHz, respectively. Afterward, a parametric stereo coding module is combined into MPEG-4 HE-AAC, resulting in MPEG-4 HE-AAC v2. The bit streams for the side information of SBR and parametric stereo coding are transmitted in the previously unused parts of
Storage and transmission of spatial sound signals 631
the AAC bit stream, enabling the compatibility with existing AAC. The typical bit rate for this side information is a few kilobits per second. The typical bit rate of HE-AAC v2 is 32 kbit/s for stereophonic sound and 160 kbit/s for 5.1-channel sound to achieve near-transparent audio quality (which is obtained by AAC without extension at a bit rate of 320 kbit/s). At the bit rate of 24 kbit/s per channel, HE-AAC improves the coding efficiency by 25% compared with that of previous AAC. With the same quality, the bit rate of HE-AAC v1 is 33% higher than that of HE-AAC-v2. The bit stream of HE-AAC supports up to 48 channels. Although HE-AAC is a part of MPEG-4, it is not limited to be use in interactive multimedia video and audio. Because HE-AAC possesses a high coding efficient, it can be independently use to the cases of audio coding with a strictly limited bandwidth, such as DAB and wireless music download in mobile phone.
One feature of MPEG-4 is that it allows object-based synthetic and natural audio coding. It considers every sound source’s signal (natural or synthesized sound) in the auditory scene as an independent transmitted object or element, then re-synthesizes it into a complete auditory (or more precisely audio-visual) scene in a user terminal. MPEG-4 adopts Audio Binary Format for Scene Description (audio BIFS) as a tool to describe sound scene parameters and achieve sound scene combination, while retaining flexibility in defining combination methods. Users can flexibly compile and combine these objects, and local interaction is allowable for synthesized scenes from different viewing (listening) positions and angles. MPEG-4 supports virtual auditory environment applications, which have actually become part of MPEG-4. Substantial research has been devoted to such applications (Scheirer et al., 1999; Väänänen and Huopaniemi, 2004; Jot and Trivi, 2006; Dantele et al., 2003; Seo et al., 2003).
The second edition of MPEG-4 provides parameters that describe three-dimensional acoustic environments in advanced audio BIFS, which includes the parameters of rectangular rooms (e.g., room size and frequency-dependent reverberation time), the parameters of sound source characteristics (e.g., frequency-dependent directivity, position, and intensity), and the acoustic parameters of surface materials (e.g., frequency-dependent reflection or absorption coefficients). Auditory scenes are synthesized at a user’s terminal in terms of these parameters. Because MPEG-4 does not specify sound synthesis and reproduction methods, many types of sound synthesis and reproduction technologies can be adopted, depending on application requirements and hardware performance at a user’s terminal. A real-time and dynamic virtual auditory environment system (Section 11.10) is usually an appropriate choice. In this case, a listener’s movement in a virtual space causes changes in binaural signals. Interactive signal processing is supported to simulate the dynamic behavior of binaural signals in accordance with a listener’s temporal head orientation.
13.5.5 MPEG parametric coding of multichannel sound and unified speech and audio coding
After MPEG-2 AAC, MPEG-4 AAC, MPEG-4 HE-AAC v1 and v2, MPEG provides the MPEG-D MPEG Surround (MPS) in 2007, a technique and standard of the generalized means for the parametric coding of channel-based multichannel sound signals with a high efficiency. As shown in Figure 13.20, in an MPS coder, multichannel inputs are downmixed into mono or stereophonic signals and then coded. The MPS spatial parameters that describe the relationship among multichannel inputs are extracted as side information to be transmitted. In addition, residual signals containing the error related to the parametric representation are calculated and coded by the low-complexity-profile MPEG-2 AAC. A decoder restores multichannel signals from coded signals, spatial parameters, and residual signals by re-upmixing (ISO/IEC 23003-1, 2007; Hilpert and Disch, 2009; Villemoes et al., 2006; Breebaart et al., 2007; Breebaart and Faller, 2007; Herre et al., 2008). MPS supports up to
632 Spatial Sound
Figure 13.20 Block diagram of MPEG-D MPEG Surround coding and decoding (adapted from Hilpert and Disch 2009).
32 channel outputs. Signal downmixing in MPS coding enables a downward compatibility with stereophonic sound. In addition, a two-channel matrix-compatible downmixing similar to those in Section 8.1.4 can be chosen so that legacy receivers without MPS spatial parameter processing can still decode multichannel signals by conventional matrix decoding. MPS spatial parameters, such as level difference, the correlation between channels in the timefrequency domain can be evaluated from the output of QMF bands and transmitted at a bit rate of 3–32 kbit/s or higher. Existing techniques, such as MPEG-4 AAC, MPEG-4 HE-AAC, or MPEG-1 Layer II, are applicable to the core coding of downmixing mono and stereophonic signals in MPS. A subjective assessment experiment involving the MUSHRA method in Section 15.5 indicated that the average perceived quality reaches a good region at a bit rate of 64 kbit/s for the MPS with HE-AAC as core coding, crosses the border of the excellent region of 80 scales at 96 kbit/s, and achieves excellent quality at 160 kbit/s.
The MPS is a parameter coding technique for channel-based spatial sound, and the decoder yields signals for certain loudspeaker configuration. MPEG-D spatial audio object coding (SAOC), which was finalized in 2010, is a parametric coding technique for multiple objects [ISO/IEC23003-2 (2010); Herre et al., 2012]. As shown in Figure 13.21 (a), in a coder, multiple objects are downmixed into stereophonic or mono signals and then coded. At the same time, the parameters describing the relation among objects and each object are extracted and transmitted as SAOC parameters (side information). A SAOC decoder involves an object decoder and a mixer/render. An object decoder extracts the objects from the downmixing bit stream according to the SAOC parameters. In terms of the side information of each object and practical loudspeaker configuration, a mixer/render mixes object signals to loudspeaker signals by a rendering matrix. The object decoder and mixer can be integrated into one to improve the decoding efficiency, as shown in Figure 13.21 (b).
SACO-downmixed signals can be coded with existing coding schemes, such as HE-AAC. SAOC parameters include object-level differences, inter-object correlations, downmixing gains, and object energies. The SAOC object parameters are given in certain time-frequency resolution and transmitted as ancillary data with a bit rate of as low as 2–3 kbit/s per object or 3 kbit/s per scene. For some objects whose audio quality is needed to be enhanced, the residual signal (the differential signal between parametric reconstruction and original signal) is transmitted in the SAOC bit stream with AAC-based scheme so as to reconstruct the object signal exactly in the decoder.
SAOC has two decoding and rendering modes. The first mode is the SAOC decoder processing mode, which provides mono, stereophonic, and binaural outputs. As shown in Figure 13.22 (a), the SAOC bitstream, rendering matrix, and head-related transfer function (HRTF)
Storage and transmission of spatial sound signals 633
(a) Separate decoder and mixer
(b) Integrated decoder and mixer
Figure 13.21 Block diagram of MPEG-D spatial audio object coding: (a) separate decoder and mixer; (b) integrated decoder and mixer (adapted from Herre et al., 2012).
(a) SAOC decoder processing mode |
(b) Transcoder processing mode |
Figure 13.22 Two SAOC decoding and rendering modes (a) SAOC decoder processing mode; (b) Transcoder processing mode (adapted from Herre et al., 2012).
parameters (for binaural outputs) are sent to an SAOC processor. The downmixing processor directly generates output signals from downmixing signals and the output of the SAOC processor. In other words, object signal extraction, rendering, and even binaural synthesis are integrated into one stage to improve the efficiency of processing. An open SAOC interface is also included, which enables users to provide varying HRTF parameters. Dynamic binaural synthesis with the head tracker is also allowed. The second mode is the SAOC transcoder processing mode, which provides multichannel outputs. As shown in Figure 13.22 (b), SAOC downmixing object signals and parameters are converted to MPS bitstream and parameters and then decoded by an MPS decoder.
