Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
акустика / xie_bosun_spatial_sound_principles_and_applications.pdf
Скачиваний:
186
Добавлен:
04.05.2023
Размер:
28.62 Mб
Скачать

630  Spatial Sound

3.Scalable sampling rate profile

The scalable sampling rate profile is the simplest of the three profiles. This profile includes a gain control module and a limited order TNS. It does not involve prediction and intensity stereo coding. It can also provide a frequency-scalable signal.

AAC decoding is an inverse course of the aforementioned coding and omitted here. Subjective experiments (Kirby et al., 1996) have indicated that at a compression ratio of

12:1 (at a sampling frequency of 48 kHz, with a bit rate of 64 kbit /s per channel or 320 kbit/s for five channels), the MPEG-2 AAC main profile provides a “indistinguishable” perceived quality. The overall perceived quality of MPEG-2 AAC at 320 kbit/s is better than that of MPEG-2 BC Layer II at 640 kbit/s. The average quality of the latter is not better than that of MPEG-2 AAC at 256 kbit/s. For two-channel stereophonic signals, AAC at 96 kbit/s exhibits an average quality comparable with that of MPEG-1 Layer II at 192 kbit/s or Dolby Digital at 160 kbit/s (Herre and Dietz, 2008). Some earlier subjective experiments (Soulodre et al., 1998) have demonstrated that AAC and Dolby Digital achieves the highest quality at 128 and 192 kbit/s for two-channel stereophonic signals, respectively. Therefore, MPEG-2 AAC is a highly efficient coding method.

13.5.4  MPEG-4 audio coding

MPEG-4 Audio is a low-bit-rate coding standard for multimedia communication and entertainment applications. It was formulated in 1995, with the first and second editions released in 1999 and 2000, respectively (Brandenburg and Bosi, 1997; Väänänen and Huopaniemi, 2004). MPEG-4 Audio combines some previous techniques of high-quality audio coding, speech coding, and computer music with great flexibility and extensibility. It supports the synthetic audio coding (such as computer music), natural audio coding (such as music and speech) and synthetic-natural hybrid coding.

MPEG-4 natural audio coding provides three schemes, e.g., parametric audio coding (Section 13.4.1), code-excited linear prediction (CELP) and general audio (waveform) coding. The preceding two schemes are appropriate for speech or audio coding at a low bit rate. The parametric coding includes the tools of harmonic vector excitation coding (HVXC), as well as harmonic and individual line plus noise (HILN). For natural audio with a sampling frequency higher than 8 kHz and bit rate of 16–64 kbit/s (or higher), MPEG-4 Audio directly codes the wavefront. The core scheme of wavefront coding is the AAC in Section 13.5.3. The block diagram of MPEG-4 AAC is similar to that in Figure 13.18. In comparison with MPEG-2 AAC, MPEG-4 AAC increases the perceptual noise substitution (PNS) and longterm prediction (LTP) tools. PNS tools aim to improve the coding efficiency of noise-like signals. When PNS is used, a noise substitution flag and designation of the power of the coefficients are transmitted instead of quantized spectral components. The decoder inserts pseudo-random values scaled by the proper noise power level. Tone-like signals require a much higher coding resolution than that of noise-like signals. However, tone-like signals are predictable because of its long-term periodicity. The LTP tool uses forward-adaptive longterm prediction to remove the redundancy among the successive blocks.

The first version of MPEG-4 high-efficiency AAC (MPEG-4 HE-AAC v1) was developed in 2003 to improve the coding efficiency for low-bit-rate audio (Herre and Dietz, 2008). Based on the architecture MPEG-4 AAC, the SBR tool in Section 13.4.6 is used in MPEG-4 HE-AAC v1. When bit rates are 20, 32, and 48 kbit/s, the ranges of SBR are 4.5–15.4, 6.8–16.9, and 8.3–16.9 kHz, respectively. Afterward, a parametric stereo coding module is combined into MPEG-4 HE-AAC, resulting in MPEG-4 HE-AAC v2. The bit streams for the side information of SBR and parametric stereo coding are transmitted in the previously unused parts of

Storage and transmission of spatial sound signals  631

the AAC bit stream, enabling the compatibility with existing AAC. The typical bit rate for this side information is a few kilobits per second. The typical bit rate of HE-AAC v2 is 32 kbit/s for stereophonic sound and 160 kbit/s for 5.1-channel sound to achieve near-transparent audio quality (which is obtained by AAC without extension at a bit rate of 320 kbit/s). At the bit rate of 24 kbit/s per channel, HE-AAC improves the coding efficiency by 25% compared with that of previous AAC. With the same quality, the bit rate of HE-AAC v1 is 33% higher than that of HE-AAC-v2. The bit stream of HE-AAC supports up to 48 channels. Although HE-AAC is a part of MPEG-4, it is not limited to be use in interactive multimedia video and audio. Because HE-AAC possesses a high coding efficient, it can be independently use to the cases of audio coding with a strictly limited bandwidth, such as DAB and wireless music download in mobile phone.

One feature of MPEG-4 is that it allows object-based synthetic and natural audio coding. It considers every sound source’s signal (natural or synthesized sound) in the auditory scene as an independent transmitted object or element, then re-synthesizes it into a complete auditory (or more precisely audio-visual) scene in a user terminal. MPEG-4 adopts Audio Binary Format for Scene Description (audio BIFS) as a tool to describe sound scene parameters and achieve sound scene combination, while retaining flexibility in defining combination methods. Users can flexibly compile and combine these objects, and local interaction is allowable for synthesized scenes from different viewing (listening) positions and angles. MPEG-4 supports­ virtual auditory environment applications, which have actually become part of MPEG-4. Substantial research has been devoted to such applications (Scheirer et al., 1999; Väänänen and Huopaniemi, 2004; Jot and Trivi, 2006; Dantele et al., 2003; Seo et al., 2003).

The second edition of MPEG-4 provides parameters that describe three-dimensional acoustic environments in advanced audio BIFS, which includes the parameters of rectangular rooms (e.g., room size and frequency-dependent reverberation time), the parameters of sound source characteristics (e.g., frequency-dependent directivity, position, and intensity), and the acoustic parameters of surface materials (e.g., frequency-dependent reflection or absorption coefficients). Auditory scenes are synthesized at a user’s terminal in terms of these parameters. Because MPEG-4 does not specify sound synthesis and reproduction methods, many types of sound synthesis and reproduction technologies can be adopted, depending on application requirements and hardware performance at a user’s terminal. A real-time and dynamic virtual auditory environment system (Section 11.10) is usually an appropriate choice. In this case, a listener’s movement in a virtual space causes changes in binaural signals. Interactive signal processing is supported to simulate the dynamic behavior of binaural signals in accordance with a listener’s temporal head orientation.

13.5.5  MPEG parametric coding of multichannel sound and unified speech and audio coding

After MPEG-2 AAC, MPEG-4 AAC, MPEG-4 HE-AAC v1 and v2, MPEG provides the MPEG-D MPEG Surround (MPS) in 2007, a technique and standard of the generalized means for the parametric coding of channel-based multichannel sound signals with a high efficiency. As shown in Figure 13.20, in an MPS coder, multichannel inputs are downmixed into mono or stereophonic signals and then coded. The MPS spatial parameters that describe the relationship among multichannel inputs are extracted as side information to be transmitted. In addition, residual signals containing the error related to the parametric representation are calculated and coded by the low-complexity-profile MPEG-2 AAC. A decoder restores multichannel signals from coded signals, spatial parameters, and residual signals by re-upmixing (ISO/IEC 23003-1, 2007; Hilpert and Disch, 2009; Villemoes et al., 2006; Breebaart et al., 2007; Breebaart and Faller, 2007; Herre et al., 2008). MPS supports up to

632 Spatial Sound

Figure 13.20 Block diagram of MPEG-D MPEG Surround coding and decoding (adapted from Hilpert and Disch 2009).

32 channel outputs. Signal downmixing in MPS coding enables a downward compatibility with stereophonic sound. In addition, a two-channel matrix-compatible downmixing similar to those in Section 8.1.4 can be chosen so that legacy receivers without MPS spatial parameter processing can still decode multichannel signals by conventional matrix decoding. MPS spatial parameters, such as level difference, the correlation between channels in the timefrequency domain can be evaluated from the output of QMF bands and transmitted at a bit rate of 3–32 kbit/s or higher. Existing techniques, such as MPEG-4 AAC, MPEG-4 HE-AAC, or MPEG-1 Layer II, are applicable to the core coding of downmixing mono and stereophonic signals in MPS. A subjective assessment experiment involving the MUSHRA method in Section 15.5 indicated that the average perceived quality reaches a good region at a bit rate of 64 kbit/s for the MPS with HE-AAC as core coding, crosses the border of the excellent region of 80 scales at 96 kbit/s, and achieves excellent quality at 160 kbit/s.

The MPS is a parameter coding technique for channel-based spatial sound, and the decoder yields signals for certain loudspeaker configuration. MPEG-D spatial audio object coding (SAOC), which was finalized in 2010, is a parametric coding technique for multiple objects [ISO/IEC23003-2 (2010); Herre et al., 2012]. As shown in Figure 13.21 (a), in a coder, multiple objects are downmixed into stereophonic or mono signals and then coded. At the same time, the parameters describing the relation among objects and each object are extracted and transmitted as SAOC parameters (side information). A SAOC decoder involves an object decoder and a mixer/render. An object decoder extracts the objects from the downmixing bit stream according to the SAOC parameters. In terms of the side information of each object and practical loudspeaker configuration, a mixer/render mixes object signals to loudspeaker signals by a rendering matrix. The object decoder and mixer can be integrated into one to improve the decoding efficiency, as shown in Figure 13.21 (b).

SACO-downmixed signals can be coded with existing coding schemes, such as HE-AAC. SAOC parameters include object-level differences, inter-object correlations, downmixing gains, and object energies. The SAOC object parameters are given in certain time-frequency resolution and transmitted as ancillary data with a bit rate of as low as 2–3 kbit/s per object or 3 kbit/s per scene. For some objects whose audio quality is needed to be enhanced, the residual signal (the differential signal between parametric reconstruction and original signal) is transmitted in the SAOC bit stream with AAC-based scheme so as to reconstruct the object signal exactly in the decoder.

SAOC has two decoding and rendering modes. The first mode is the SAOC decoder processing mode, which provides mono, stereophonic, and binaural outputs. As shown in Figure 13.22 (a), the SAOC bitstream, rendering matrix, and head-related transfer function (HRTF)

Storage and transmission of spatial sound signals 633

(a) Separate decoder and mixer

(b) Integrated decoder and mixer

Figure 13.21 Block diagram of MPEG-D spatial audio object coding: (a) separate decoder and mixer; (b) integrated decoder and mixer (adapted from Herre et al., 2012).

(a) SAOC decoder processing mode

(b) Transcoder processing mode

Figure 13.22 Two SAOC decoding and rendering modes (a) SAOC decoder processing mode; (b) Transcoder processing mode (adapted from Herre et al., 2012).

parameters (for binaural outputs) are sent to an SAOC processor. The downmixing processor directly generates output signals from downmixing signals and the output of the SAOC processor. In other words, object signal extraction, rendering, and even binaural synthesis are integrated into one stage to improve the efficiency of processing. An open SAOC interface is also included, which enables users to provide varying HRTF parameters. Dynamic binaural synthesis with the head tracker is also allowed. The second mode is the SAOC transcoder processing mode, which provides multichannel outputs. As shown in Figure 13.22 (b), SAOC downmixing object signals and parameters are converted to MPS bitstream and parameters and then decoded by an MPS decoder.