MPEG Audio proceeds by first applying a filter bank to the input, to break the input into its frequency components. In parallel, it applies a psychoacoustic model to the data, and this model is used in a bit - allocation block. Then the number of bits allocated is used to quantize the information from the filter bank. The overall result is that quantization provides the compression, and bits are allocated where they are most needed to lower the quantization noise below an audible level.

MPEG Layers

MP3 is a popular audio compression standard. The "3" stands for Layer 3, and "MP" stands for the MPEG - 1 standard. Recall that we looked at MPEG video compression in Chapter. However, the MPEG standard actually delineates three different aspects of multimedia: audio, video, and systems. MP3 forms part of the audio component of this first phase of MPEG. It was released in 1992 and resulted in the international standard ISO / IEC 11172 - 3, published in 1993.

MPEG audio sets out three downward - compatible layers of audio compression, each able to understand the lower layers. Each offers more complexity in the psychoacoustic model applied and correspondingly better compression for a given level of audio quality. However, an increase in complexity, and concomitantly in compression effectiveness, is accompanied by extra delay.

Layers 1 to 3 in MPEG Audio are compatible, because all layers include the same file header information. Layer 1 quality can be quite good, provided a comparatively high bitrate is available. Digital Audio Tape typically uses Layer 1. Layer 2 has more complexity and was proposed for use in digital audio broadcasting. Layer 3 (MP3) is most complex and was originally aimed at audio transmission over ISDN lines. Each of the layers also uses a different frequency transform.

Most of the complexity increase is at the encoder rather than at the decoder side, and this accounts for the popularity of MP3 players. Layer 1 incorporates the simplest psychoacoustic model, and Layer 3 uses the most complex. The objective is a good tradeoff between quality and bitrate. "Quality" is defined in terms of listening test scores (the psychologists hold sway here), where a quality measure is defined by:

  • 5.0 = "Transparent" — undetectable difference from original signal; equivalent to CD - quality audio at 14 - to 16 - bit PCM
  • 4.0 — Perceptible difference, but not annoying
  • 3.0 — Slightly annoying
  • 2.0 = Annoying
  • = Very annoying

(Now that's scientific!) At 64 kbps per channel, Layer 2 scores between 2.1 and 2.6, and Layer 3 scores between 3.6 and 3.8. So Layer 3 provides a substantial improvement but is still not perfect by any means.

MPEG Audio Strategy

Compression is certainly called for, since even audio can take fairly substantial bandwidth: CD audio is sampled at 44.1 kHz and 16 bits / channel, so for two channels needs a bitrate of about 1.4 Mbps. MPEG - 1 aims at about 1.5 Mbps overall, with 1.2 Mbps for video and 256 kbps for audio.

The MPEG approach to compression relies on quantization, of course, but also recognizes that the human auditory system is not accurate within the width of a critical band, both in terms of perceived loudness and audibility of a test frequency. The encoder employs a bank of filters that act to first analyze the frequency (spectral) components of the audio signal by calculating a frequency transform of a window of signal values. The bank of filters decomposes the signal into subbands. Layer 1 and Layer 2 codecs use a quadrature - mirror filter bank, while the Layer 3 codec adds a DCT, For the psychoacoustic model, a Fourier transform is used.

Then frequency masking can be brought to bear by using a psychoacoustic model to estimate the just noticeable noise level. In its quantization and coding stage, the encoder balances the masking behavior and the available number of bits by discarding inaudible frequencies and scaling quantization according to the sound level left over, above masking levels.

A sophisticated model would take into account the actual width of the critical bands centered at different frequencies. Within a critical band, our auditory system cannot finely resolve neighboring frequencies and instead tends to blur them. As mentioned earlier, audible frequencies are usually divided into 25 main critical bands, inspired by the auditory critical bands.

However, in keeping with design simplicity, the model adopts a uniform width for all frequency analysis filters, using 32 overlapping subbands. This means that at lower frequencies, each of the frequency analysis "subbands" covers the width of several critical bands of the auditory system, whereas at higher frequencies this is not so, since a critical band's width is less than 100 Hz at the low end and more than 4 kHz at the high end. For each frequency band, the sound level above the masking level dictates how many bits must be assigned to code signal values, so that quantization noise is kept below the masking level and hence cannot be heard. In Layer 1, the psychoacoustic model uses only frequency masking.

Bitrates range from 32 kbps (mono) to 448 kbps (stereo). Near - CD stereo quality is possible with a bitrate of 256 - 384 kbps. Layer 2 uses some temporal masking by accumulating more samples and examining temporal masking between the current block of samples and the ones just before and just after. Bitrates can be 32 - 192 kbps (mono) and 64 - 384 kbps (stereo). Stereo CD - audio quality requires a bitrate of about 192 - 256 kbps.

(a) Basic MPEG Audio encoder; and (b) decoder

Basic MPEG Audio encoder; and (b) decoder

However, temporal masking is less important for compression than is frequency masking, which is why it is sometimes disregarded entirely in lower - complexity coders. Layer 3 is directed toward lower bitrate applications and uses a more sophisticated subband analysis, with nonuniform subband widths. It also adds nonuniform quantization and entropy coding. Bitrates are standardized at 32 - 320 kbps.

MPEG Audio Compression Algorithm

Basic Algorithm. The above figure shows the basic MPEG audio compression algorithm. It proceeds by dividing the input into 32 frequency subbands, via a filter bank. This is a linear operation that takes as its input a set of 32 PCM samples, sampled in time, and produces as its output 32 frequency coefficients. If the sampling rate is fs, say fs = 48 ksps (kilosamples per second; i.e., 48 kHz), then by the Nyquist theorem, the maximum frequency mapped will be fs / 2. Thus the mapped bandwidth is divided into 32 equal - width segments, each of width fs/64 (these segments overlap somewhat).

In the Layer 1 encoder, the sets of 32 PCM values are first assembled into a set of 12 groups of 32s. Hence, the coder has an inherent time lag, equal to the time to accumulate 384 (i.e., 12 x 32) samples. For example, if sampling proceeds at 32 kbps, then a time duration of 12 msec is required since each set of 32 samples is transmitted each millisecond. These sets of 12 samples, each of size 32, are called segments. The point of assembling them is to examine 12 sets of values at once in each of the 32 subbands, after frequency analysis has been carried out, then base quantization on just a summary figure for all 12 values.

Example MPEG Audio frame

Example MPEG Audio frame

The delay is actually somewhat longer than that required to accumulate 384 samples, since header information is also required. As well, ancillary data, such as multilingual data and surround - sound data, is allowed. Higher layers also allow more than 384 samples to be analyzed, so the format of the subband - samples (SBS) is also added, with a resulting frame of data, as in the above figure. The header contains a synchronization code (twelve Is — 111111111111), the sampling rate used, the bitrate, and stereo information. The frame format also contains room for so - called "ancillary" (extra) information. (In fact, an MPEG - 1 audio decoder can at least partially decode an MPEG - 2 audio bitstream, since the file header begins with an MPEG - 1 header and places the MPEG - 2 datastream into the MPEG - 1 Ancillary Data location.)

MPEG Audio is set up to be able to handle stereo or mono channels, of course. A special joint - stereo mode produces a single stream by taking into account the redundancy between the two channels in stereo. This is the audio version of a composite video signal. It can also deal with dual - monophonic —- two channels coded independently. This is useful for parallel treatment of audio — for example, two speech streams, one in English and one in Spanish.

Consider the 32 x 12 segment as a 32 x 12 matrix. The next stage of the algorithm is concerned with scale, so that proper quantization levels can be set. For each of the 32 subbands, the maximum amplitude of the 12 samples in that row of the array is found, which is the scaling factor for that subband. This maximum is then passed to the bit - allocation block of the algorithm, along with the SBS (subband samples). The key point of the bit - allocation block is to determine how to apportion the total number of code bits available for the quantization of subband signals to minimize the audibility of the quantization noise.

As we know, the psychoacoustic model is fairly complex — more than just a set of lookup tables (and in fact this model is not standardized in the specification — it forms part of the "art" content of an audio encoder and is one major reason all encoders are not the same). In Layer 1, a decision step is included to decide whether each frequency band is basically like a tone or like noise. From that decision and the scaling factor, a masking threshold is calculated for each band and compared with the threshold of hearing.

The model's output consists of a set of what are known as signal - to - mask ratios (SMRs) that flag freqency components with amplitude below the masking level. The SMR is the ratio of the short - term signal power within each frequency band to the minimum masking threshold for the subband. The SMR gives the amplitude resolution needed and therefore also controls the bit allocations that should be given to the subband. After determination of the SMR, the scaling factors discussed above are used to set quantization levels such that quantization error itself falls below the masking level. This ensures that more bits are used in regions where hearing is most sensitive. In sum, the coder uses fewer bits in critical bands when fewer can be used without making quantization noise audible.

MPEG Audio frame sizes

MPEG Audio frame sizes

The scaling factor is first quantized, using 6 bits. The 12 values in each subband are then quantized. Using 4 bits, the bit allocations for each subband are transmitted, after an iterative bit allocation scheme is used. Then the data is transmitted, with appropriate bit depths for each subband. Altogether, the data consisting of the quantized scaling factor and the 12 codewords are grouped into a collection known as the Subband - Sample format.

On the decoder side, the values are de - quantized, and magnitudes of the 32 samples are reestablished. These are passed to a bank of synthesis filters, which reconstitute a set of 32 PCM samples. Note that the psychoacoustic model is not needed in the decoder.

The above figure shows how samples are organized. A Layer 2 or Layer 3 frame actually accumulates more than 12 samples for each subband: instead of 384 samples, a frame includes 1,152 samples.

Bit Allocation.The bit - allocation algorithm is not part of the standard, and it can therefore be done in many possible ways. The aim is to ensure that all the quantization noise is below the masking thresholds. However, this is usually not the case for low bitrates. The psychoacoustic model is brought into play for such cases, to allocate more bits, from the number available, to the subbands where increased resolution will be most beneficial. One common scheme is as follows. For each subband, the psychoacoustic model calculates the Signal - to - Mask Ratio, in dB. A lookup table in the MPEG Audio standard also provides an estimate of the SNR (signal - to - noice ratio), assuming quantization to a given number of quantizer levels.

Mask - to - noise ratio and signal - to - mask ratio. A qualitative view of SNR, SMR and MNR, with one dominant masker and m bits allocated to a particular critical band

Mask - to - noise ratio and signal - to - mask ratio

Then the Mask - to - Noise Ratio (MNR) is defined as the difference MNRdB = SNRdB - SMRdB

The lowest MNR is determined, over all the subbands, and the number of code - bits allocated to this subband is incremented. Then a new estimate of the SNR is made, and the process iterates until no more bits are left to allocate.

Mask calculations are performed in parallel with subband filtering, as in the following figure. The masking curve calculation requires an accurate frequency decomposition of the input signal, using a Discrete Fourier Transform (DFT). The frequency spectrum is usually calculated with a 1,024 - point Fast Fourier Transform (FFT).

Layer 1. 16 uniform quantizers are pre - calculated, and for each subband the quantizer giving the lowest distortion is chosen. The index of the quantizer is sent as 4 bits of side information for each subband. The maximum resolution of each quantizer is 15 bits.

Layer 2. Layer 2 of the MPEG - 1 Audio codec includes small changes to effect bitrate reduction and quality improvement, at the price of an increase in complexity: The main difference in Layer 2 is that three groups of 12 samples are encoded in each frame, and temporal masking is brought into play, as well as frequency masking. One advantage is that if the scaling factor is similar for each of the three groups, a single scaling factor can be used for all three. But using three frames in the filter (before, current, and next), for a total of 1,152 samples per channel, approximates taking temporal masking into account.

MPEG - 1 Audio Layers 1 and 2

MPEG - 1 Audio Layers 1 and 2

As well, the psychoacoustic model does better at modeling slowly - changing sound if the time window used is longer. Bit allocation is applied to window lengths of 36 samples instead of 12, and resolution of the quantizers is increased from 15 bits to 16. To ensure that this greater accuracy does not mean poorer compression, the number of quantizers to choose from decreases for higher subbands.

Layer 3. Layer 3, or MP3, uses a bitrate similar to Layers 1 and 2 but produces substantially better audio quality, again at the price of increased complexity.

A filter bank similar to that used in Layer 2 is employed, except that now perceptual critical bands are more closely adhered to by using a set of filters with nonequal frequencies. This layer also takes into account stereo redundancy. It also uses a refinement of the Fourier transform: the Modified Discrete Cosine Transform (MDCT) addresses problems the DCT has at boundaries of the window used. The Discrete Fourier Transform can produce block edge effects. When such data is quantized and then transformed back to the time domain, the beginning and ending samples of a block may not be coordinated with the preceding and subsequent blocks, causing audible periodic noise.

audible periodic noise

The MDCT also gives better frequency resolution for the masking and bit allocation operations. Optionally, the window size can be reduced back to 12 samples from 36. Even so, since the window is 50% overlapped, a 12 - sample window still includes an extra 6 samples. A size - 36 window includes an extra 18 points. Since lower frequencies are more often tone like rather than noise like, they need not be analyzed as carefully, so a mixed mode is also available, with 36 - point windows used for the lowest two frequency subbands and 12 - point windows used for the rest.

MPEG - 1 Audio Layer 3

MPEG - 1 Audio Layer 3

As well, instead of assigning scaling factors to uniform - width subbands, MDCT coefficients are grouped in terms of the auditory system's actual critical bands, and scaling factors, called scale factor bands, are calculated from these.

More bits are saved by carrying out entropy coding and making use of nonuniform quantizers. And, finally, a different bit allocation scheme is used, with two parts. Firstly, a nested loop is used, with an inner loop that adjusts the shape of the quantizer, and an outer loop that then evaluates the distortion from that bit configuration. If the error ("distortion") is too high, the scale factor band is amplified. Second, a bit reservoir banks bits from frames that don't need them and allocates them to frames that do.

The following table shows various achievable MP3 compression ratios. In particular, CD - quality audio is achieved with compression ratios in the range of 12:1 to 8:1 (i.e., bitrates of 128 to 192 kbps).

MPEG - 2 AAC (Advanced Audio Coding)

The MPEG - 2 standard is widely employed, since it is the standard vehicle for DVDs, and it, too, has an audio component. The MPEG - 2 Advanced Audio Coding (AAC) standard was aimed at transparent sound reproduction for theaters. It can deliver this at 320 kbps for five channels, so that sound can be played from five directions: left, right, center, left - surround, and right - surround. So - called 5.1 channel systems also include a low - frequency enhancement (LFE) channel (a "woofer"). On the other hand, MPEG - 2 AAC is also capable of delivering high - quality stereo sound at bitrates below 128 kbps. It is the audio coding technology for the DVD - Audio Recordable (DVD - AR) format and is also adopted by XM Radio, one of the two satellite radio services in North America.

MPEG - 2 audio can support up to 48 channels, sampling rates between 8 kHz and 96 kHz, and bitrates up to 576 kbps per channel. Like MPEG - 1, MPEG - 2 supports three different "profiles", but with a different purpose. These are the Main, Low Complexity (LC), and the Scalable Sampling Rate (SSR). The LC profile requires less computation than the Main profile, but the SSR profile breaks up the signal so that different bitrates and sampling rates can be used by different decoders.

The three profiles follow mostly the same scheme, with a few modifications. First, an MDCT transform is earned out, either on a "long" window with 2,048 samples or a "short" window with 256 samples. The MDCT coefficients are then filtered by a Temporal Noise Shaping (TNS) tool, with the objective of reducing pre - masking effects and better encoding signals with stable pitch.

Table MP3 compression performance

MP3 compression performance

The MDCT coefficients are then grouped into 49 scale factor bands, approximately equivalent to a good - resolution version of the human acoustic system's critical bands. In parallel with the frequency transform, a psychoacoustic model similar to the one in MPEG - 1 is carried out, to find masking thresholds.

The Main profile uses a predictor. Based on the previous two frames, and only for frequency coefficients up to 16 kHz, MPEG - 2 subtracts a prediction from the frequency coefficients, provided this step will indeed reduce distortion. Quantization is governed by two rules: keep distortion below the masking threshold, and keep the average number of bits used per frame controlled, using a bit reservoir. Quantization uses scaling factors - which can be used to amplify some of the scale factor bands - and nonuniform quantization. MPEG - 2 AAC also uses entropy coding for both scale factors and frequency coefficients.

Again, a nested loop is used for bit allocation. The inner loop adapts the nonlinear quantizer, then applies entropy coding to the quantized data. If the bit limit is reached for the current frame, the quantizer step size is increased to use fewer bits. The outer loop decides whether for each scale factor band the distortion is below the masking threshold. If a band is too distorted, it is amplified to increase the SNR of that band, at the price of using more bits.

In the SSR profile, a Polyphase Quadrature Filter (PQF) bank is used. The meaning of this phrase is that the signal is first split into four frequency bands of equal width, and then an MDCT is applied. The point of the first step is that the decoder can decide to ignore one of the four frequency parts if the bitrate must be reduced.

MPEG - 4 Audio

MPEG - 4 audio integrates several different audio components into one standard: speech compression, perceptually based coders, text - to - speech, and MIDI. The primary general audio coder, MPEG - 4 AAC, is similar to the MPEG - 2 AAC standard, with some minor changes.

Perceptual Coders. One change is to incorporate a Perceptual Noise Substitution module, which looks at scale factor bands above 4 kHz and includes a decision as to whether they are noise like or tone like. A noise like scale factor band itself is not transmitted; instead, just its energy is transmitted, and the frequency coefficient is set to zero. The decoder then inserts noise with that energy.

Another modification is to include a Bit - Sliced Arithmetic Coding (BSAQ module. This is an algorithm for increasing bitrate scalability, by allowing the decoder side to be able to decode a 64 kbps stream using only a 16 kbps baseline output (and steps of 1 kbps from that minimum).

MPEG - 4 audio also includes a second perceptual audio coder, a vector - quantization method entitled Transform - domain Weighted Interleave Vector Quatitization (TwinVQ).This is aimed at low bitrates and allows the decoder to discard portions of the bitstream to implement both adjustable bitrate and sampling rate. The basic strategy of MPEG - 4 audio is to allow decoders to apply as many or as few audio tools as bandwidth allows.

Structured Coders. To have a low bitrate delivery option, MPEG - 4 takes what is termed a Synthetic / Natural Hybrid Coding (SNHC) approach. The objective is to integrate both "natural" multimedia sequences, both video and audio, with those arising synthetically. In audio, the latter are termed structured audio. The idea is that for low bitrate operation, we can simply send a pointer to the audio model we are working with and then send audio model parameters.

In video, such a model - based approach might involve sending face - animation data rather than natural video frames of faces. In audio, we could send the information that English is being modeled, then send codes for the basesounds (phonemes) of English, along with other assembler - like codes specifying duration and pitch.

MPEG - 4 takes a toolbox approach and allows specification of many such models. For example, Text - To - Speech (TTS) is an ultra - low bitrate method and actually works, provided we need not care what the speaker actually sounds like. Assuming we went on to derive Face Animation Parameters from such low bitrate information, we arrive directly at a very low bitrate videoconferencing system.

Table Comparison of audio coding systems

Comparison of audio coding systems

Another "tool" in structured audio is called Structured Audio Orchestra Language (SAOL, pronounced "sail"), which allows simple specification of sound synthesis, including special effects such as reverberation. Overall, structured audio takes advantage of redundancies in music to greatly compress sound descriptions.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd Protection Status