Vocoders - MULTIMEDIA

The coders (encoding / decoding algorithms) we have studied so far could have been applied to any signals, not just speech. Vocoders are specifically voice coders. As such, they cannot be usefully applied when other analog signals, such as modem signals, are in use.

Vocoders are concerned with modeling speech, so that the salient features are captured in as few bits as possible. They use either a model of the speech waveform in time (Linear Predictive Coding (LPC) vocoding), or else break down the signal into frequency components and model these (channel vocoders and formant vocoders). Incidentally, we likely all know that vocoder simulation of the voice is not wonderful yet — when the library calls you with your overdue notification, the automated voice is strangely lacking in zest.

Phase Insensitivity

Recall from Section that we can break down a signal into its constituent frequencies by analyzing it using some variant of Fourier analysis, in principle, we can also reconstitute the signal from the frequency coefficients developed that way. But it turns out that a complete reconstituting of speech waveform is unnecessary, perceptually: all that is needed is for the amount of energy at any time to be about right, and the signal will sound about right. "Phase" is a shift in the time argument, inside a function of time.

Suppose we strike a piano key and generate a roughly sinusoidal sound cos(ωt), with ω = 2πf. If we wait sufficient time to generate a phase shift π/2 and then strike another key, with sound cos(2ωt+π/2), we generate a waveform like the solid line in the following figure. This waveform is the sum cos(ωt), + cos(2ωt+π/2).

If we did not wait before striking the second note (1 / 4 msec, in the following figure), our waveform would be cos(ωt) + cos(2ωt). But perceptually, the two notes would sound the same, even though in actuality they would be shifted in phase.

Hence, if we can get the energy spectrum right — where we hear loudness and quiet - then we don't really have to worry about the exact waveform.

Channel Vocoder

Subband filtering is the process of applying a bank of band - pass filters to the analog signal, thus actually carrying out the frequency decomposition indicated in a Fourier analysis. Subband coding is the process of making use of the information derived from this filtering to achieve better compression. For example, an older ITU recommendation, G.722, uses subband filtering of analog signals into just two bands: voice frequencies in 50 Hz to 3.5 kHz and 3.5 kHz to 7 kHz.

The solid line shows the superposition of two cosines, with a phase shift. The dashed line shows the same with no phase shift. The wave is very different, yet the sound is the same, perceptually

The solid line shows the superposition of two cosines

Then the set of two signals is transmitted at 48 kbps for the low frequencies, where we can hear discrepancies well, and at only 16 kbps for the high frequencies. Vocoders can operate at low bitrates, 1 - 2 kbps. To do so, a channel vocoder first applies a filter bank to separate out the different frequency components, as in the following figure. However, as we saw above, only the energy is important, so first the waveform is "rectified" to its absolute value. The filter bank derives relative power levels for each frequency range.

A subband coder would not rectify the signal and would use wider frequency bands. A channel vocoder also analyzes the signal to determine the general pitch of the speech — low (bass), or high (tenor) — and also the excitation of the speech. Speech excitation is mainly concerned with whether a sound is voiced or unvoiced. A sound is unvoiced if its signal simply looks like noise. Sounds such as the vowels a, e, and o are voiced, and their waveform looks periodic. The o at the end of the word "audio" is fairly periodic.

During a vowel sound, air is forced through the vocal cords in a stream of regular, short puffs, occurring at the rate of 75 - 150 pulses per second for men and 150 - 250 per second for women. Consonants can be voiced or unvoiced. For the nasal sounds of the letters m and n, the vocal cords vibrate, and air is exhaled through the nose rather than the mouth. These consonants are therefore voiced. The sounds b, d, and g, in which the mouth starts closed but then opens to the following vowel over a transition lasting a few milliseconds, are also voiced.

The energy of voiced consonants is greater than that of unvoiced consonants but less than that of vowel sounds. Examples of unvoiced consonants include the sounds sh, th, and h when used at the front of a word.

A channel vocoder applies a vocal - tract transfer model to generate a vector of excitation parameters that describe a model of the sound. The vocoder also guesses whether the sound is voiced or unvoiced and, for voiced sounds, estimates the period (i.e., the sound's pitch).

Channel vocoder

Channel vocoder

Because voiced sounds can be approximated by sinusoids, a periodic pulse generator recreates voiced sounds. Since unvoiced sounds are noise - like, a pseudo - noise generator is applied, and all values are scaled by the energy estimates given by the band - pass filter set. A channel vocoder can achieve an intelligible but synthetic voice using 2,400 bps.

Formant Vocoder

It turns out that not all frequencies present in speech are equally represented. Instead, only certain frequencies show up strongly, and others are weak. This is a direct consequence of how speech sounds are formed, by resonance in only a few chambers of the mouth, throat, and nose. The important frequency peaks are called formanis. The following figure shows how this appeal's: only a few, usually just four or so, peaks of energy at certain frequencies are present. However, just where the peaks occur changes in time, as speech continues.

For example, two different vowel sounds would activate different sets of formants — this reflects the different vocal tract configurations necessary to form each vowel Usually, a small segment of speech is analyzed, say 10 - 40 msec, and formants are found. A Formant Vocoder works by encoding only the most important frequencies. Formant vocoders can produce reasonably intelligible speech at only 1,000 bps.

Linear Predictive Coding

LPC vocoders extract salient features of speech directly from the waveform rather than transforming the signal to the frequency domain. LPC coding uses a time - varying model of vocal - tract sound generated from a given excitation. What is transmitted is a set of parameters modeling the shape and excitation of the vocal tract, not actual signals or differences.

Formants are the salient frequency components present in a sample of speech. Here, the solid line shows frequencies present in the first 40 msec of the speech sample in Figure. The dashed line shows that while similar frequencies are still present one second later, they have shifted

Formants are the salient frequency components present in a sample of speech

Since what is sent is an analysis of the sound rather than sound itself, the bitrate using LPC can be small. This is like using a simple descriptor such as MIDI to generate music: we send just the description parameters and let the sound generator do its best to create appropriate music. The difference is that as well as pitch, duration, and loudness variables, here we also send vocal tract excitation parameters.

After a block of digitized samples, called a segment or frame, is analyzed, the speech signal generated by the output vocal - tract model is calculated as a function of the current speech output plus a second term linear in previous model coefficients. This is how "linear" in the coder's name arises. The model is adaptive — the encoder side sends a new set of coefficients for each new segment. The typical number of sets of previous coefficients used is N - 10 (the "model order" is 10), and such an LPC - 10 [3] system typically uses a rate of 2.4 kbps.

The model coefficients cif act as predictor coefficients, multiplying previous speech output sample values. LPC starts by deciding whether the current segment is voiced or unvoiced. For unvoiced speech, a wide - band noise generator is used to create sample values f(n) that act as input to the vocal tract simulator. For voiced speech, a pulse - train generator creates values f(n). Model parameters ai are calculated by using a least - squares set of equations that minimize the difference between the actual speech and the speech generated by the vocal - tract model, excited by the noise or pulse - train generators that capture speech parameters. If the output values generated are denoted s(n), then for input values f(n). The output depends on p previous output sample values, via

output sample values

Here, G is known as the gain factor. Note that the coefficient ai acts as values in a linear predictor mode.

The speech encoder works in a blockwise fashion. The input digital speech signal is analyzed in some small, fixed - length segments, called speech frames. For the LPC speech coder, the frame length is usually selected as 22.5 msec, which corresponds to 180 samples for 8 kHz sampled digital speech.

The speech encoder analyzes the speech frames to obtain the parameters such as LP coefficients ai, i = 1.. p, gain G, pitch P, and voiced / unvoiced decision U/V.

To calculate LP coefficients, we can solve the following minimization problem for aj

minimization problem for aj

By taking the derivative of ai and setting it to zero, we get a set of p equations:

set of p equations

The searching range [Pmin, Pmax] is often selected as [12, 140] for 8 kHz sampling speech. Denote P as the peak lag. If ν(P) is less than some given threshold, the current frame is classified as an unvoiced frame and will be reconstructed in the receiving end by stimulating with a white - noise sequence. Otherwise, the frame is determined as voiced and stimulated with a periodic waveform at the reconstruction stage.

In practical LPC speech coders, the pitch estimation and U / V decision procedure are usually based on a dynamic programming scheme, so as to correct the often occurring errors of pitch doubling or halving in the single frame scheme. In LPC - 10, each segment is 180 samples, or 22.5 msec at 8 kHz. The speech parameters transmitted are the coefficients ak; G, the gain factor; a voiced / unvoiced flag (1 bit); and the pitch period if the speech is voiced.

CELP

CELP, Code Excited Linear Prediction (sometimes Codebook Excited), is a more complex family of coders that attempts to mitigate the lack of quality of the simple LPC model by using a more complex description of the excitation. An entire set (a codebook) of excitation vectors is matched to the actual speech, and the index of the best match is sent to the receiver. This complexity increases the bitrate to 4,800 - 9,600 bps, typically.

In CELP, since all speech segments make use of the same set of templates from the template codebook, the resulting speech is perceived as much more natural than the two - mode excitation scheme in the LPC - 10 coder. The quality achieved is considered sufficient for audio conferencing. A low bitrate is required for conferencing, but the perceived quality of the speech must still be of an acceptable standard.

In CELP coders two kinds of prediction, Long Time Prediction (LTP) and Short Time Prediction (STP), are used to eliminate the redundancy in speech signals. STP is an analysis of samples — it attempts to predict the next sample from several previous ones. Here, redundancy is due to the fact that usually one sample will not change drastically from the next. LTP is based on the idea that in a segment of speech, or perhaps from segment to segment, especially for voiced sounds, a basic periodicity or pitch will cause a waveform that more or less repeats. We can reduce this redundancy by finding the pitch.

For concreteness, suppose we sample at 8,000 samples / sec and use a 10 msec frame, containing 80 samples. Then we can roughly expect a pitch that corresponds to an approxi­mately repeating pattern every 12 to 140 samples or so. (Notice that the pitch may actually be longer than the chosen frame size.) STP is based on a short - time LPC analysis, discussed in the last section.

It is "short - time" in that the prediction involves only a few samples, not a whole frame or several frames. STP is also based on minimizing the residue error over the whole speech frame, but it captures the correlation over just a short range of samples (10 for order - 10 LPC). After STP, we can subtract signal minus prediction to arrive at a differential coding situation. However, even in a set of errors e(n), the basic pitch of the sequence may still remain. This is estimated by means of LTP. That is, LTP is used to further eliminate the periodic redundancy inherent in the voiced speech signals.

Essentially, STP captures the formant structure of the short - term speech spectrum, while LTP recovers the long - term correlation in the speech signal that represents the periodicity in speech. Thus there are always two stages — and the order is in fact usually STP followed by LTP, since we always start off assuming zero error and then remove the pitch component. (If we use a closed - loop scheme, STP usually is done first). LTP proceeds using whole frames — or, more often, subframes equal to one quarter of a frame.The following figure shows these two stages.

LTP is often implemented as adaptive codebook searching. The "codeword" in the adaptive codebook is a shifted speech residue segment indexed by the lag t corresponding to the current speech frame or subframe. The idea is to look in a codebook of waveforms to find one that matches the current subframe. We generally look in the codebook using a normalized subframe of speech, so as well as a speech segment match, We also obtain a scaling value (the gain).

The gain corresponding to the codeword is denoted as go. There are two types of codeword searching: open - loop and closed - loop. Open - loop adaptive codebook searching tries to minimize the long - term prediction error but not the perceptual weighted reconstructed speech error,

perceptual weighted reconstructed speech error

By setting the partial derivative of g0 to zero

setting the partial derivative

CELP analysis model with adaptive and stochastic codebooks

CELP analysis model with adaptive and stochastic codebooks

and hence a minimum summed - error value

minimum summed - error value

Notice that the sample s(n τ) could be in the previous frame. Now, to obtain the optimum adaptive codebook index t, we can carry out a search exclusively in a small range determined by the pitch period.

More often, CELP coders use a closed - loop search. Rather than simply considering sum - of - squares, speech is reconstructed, with perceptual error minimized via an adaptive codebook search. So in a closed - loop, adaptive codebook search, the best candidate in the adaptive codebook is selected to minimize the distortion of locally reconstructed speech.

Parameters are found by minimizing a measure (usually the mean square) of the difference between the original and the reconstructed speech. Since this means that we are simultaneously incorporating synthesis as well as analysis of the speech segment, this method is also called analysis - by - synthesis, or A - B - S.

The residue signal after STP based on LPC analysis and LTP based on adaptive code­word searching is like white noise and is encoded by codeword matching in the stochastic (random or probabilistic) codebook.

This kind of sequential optimization of the adaptive codeword and stochastic codeword methods is used because jointly optimizing the adaptive and stochastic codewords is often too complex to meet real - time demands.

The decoding direction is just the reverse of the above process and works by combining the contribution from the two types of excitations.

DOD 4.8 KBPS CELP (FS1016)*DOD 4.8 kbps CELP is an early CELP coder adopted as a U.S. federal standard to update the 2.4 kbps LPC - 10e (FS1015) vocoder. This vocoder is now a basic benchmark to test other low - bitrate vocoders. FS1016 uses an S kHz sampling rate and 30 msec frame size. Each frame is further split into four 7.5 msec subframes. In FS1016, STP is based on an open - loop order - 10 LPC analysis.

To improve coding efficiency, a fairly sophisticated type of transform coding is carried out. Then, quantization and compression are done in terms of the transform coefficients. First, in this field it is common to use the z - transform. Here, z is a complex number and represents a kind of complex "frequency." If z = e - 2πi / N, then the discrete z - transform reduces to a discrete Fourier transform. The z - transform makes Fourier transforms look like polynomials. Now we can write the error in a prediction equation

error in a prediction equation

in the z domain as

z domain

where E (z) is the z - transform of the error and S(z) is the transform of the signal. The term A(z) is the transfer function in the z domain, and equals

z domain

with the same coefficients ai as appear in Equation. How speech is reconstructed, then, is via

same coefficients ai as appear in Equation

with the estimated error. For this reason, A(z) is usually stated in terms of 1 / A(z).

The idea of going to the z - transform domain is to convert the LP coefficients to Line Spectrum Pair (LSP) coefficients, which are given in this domain. The reason is that the LSP space has several good properties with respect to quantization. LSP representation has become standard and has been applied to nearly all the recent LPC - based speech coders, such as G.7*23.1, G.729, and MELP. To get LSP coefficients, we construct two polynomials

LSP coefficients

where p is the order of the LPC analysis and A(z) is the transform function of the LP filter, with z the transform domain variable. The z - transform is just like the Fourier transform but with a complex "frequency."

The roots of these two polynomials are spaced around the unit circle in the z plane and have mirror symmetry with respect to the x - axis. Assume p is even and denote the phase angles of the roots of P(z) and Q(z) above the x axis as θ1 < θ2 < ... θp/2 and Φ1< Φ2< ... < Φ1p/2, respectively. Then the vector (cos (θ1), cos(Φ1), cos(θ2), cos(Φ1)... cos(θp/2), cos(Φp/2)} is the LSP coefficient vector, and vector {θ1,Φ1, θ2, Φ1…, θp/2, Φp/2}is usually called Line Spectrum Frequency, or LSF. Based on the relationship A(z) = [P(z) + Q(z)]/2, we can reconstruct the LP coefficients at the decoder end from the LSP or LSF coefficients.

Adaptive codebook searching in FS1016 is via a closed - loop search based on perceptually weighted errors. As opposed to considering just the mean squared error, here errors are weighted so as to take human perception into account. In terms of the z - transform, it is found that the following multiplier does a good job:

following multiplier does a good job

with a constant parameter y.

The adaptive codebook has 256 codewords for 128 integer delays and 128 noninteger delays (with half - sample interval, for better resolution), the former ranging from 20 to 147. To reduce searching complexity, even subframes are searched in an interval relative to the previous odd subframe, and the difference is coded with 6 bits. The gain is nonuniformly scalar coded between - 1 and 2 with 5 bits. Stochastic codebook search is applied for each of the four subframes.

The stochastic codebook of FS1016 is generated by clipping a unit variance Gaussian distribution random sequence to within a threshold of absolute value 1.2 and quantizing to three values - 1, 0, and 1. The stochastic codebook has 512 codewords. The codewords are overlapped, and each is shifted by 2 with respect to the previous codeword. This kind of stochastic design is called an Algebraic Codebook. It has many variations and is widely applied in recent CELP coders.

Denoting the excitation vector as v(i) the periodic component obtained in the first stage is v(o), v(1) is the stochastic component search result in the second stage. In closed - loop searching, the reconstructed speech can be represented as

reconstructed speech

where u is equal to zero at the first stage and v(o) at the second stage, and s0 is the zero response of the LPC reconstructing filter. Matrix H is the truncated LPC reconstructing filter unit impulse - response matrix

reconstructing filter. Matrix H

where L is the length of the subframe (this simply represents a convolution). Similarly, defining W as the unit response matrix of the perceptual weighting filter, the perceptually weighted error of reconstructed speech is

weighted error of reconstructed speech

whereweighted error of reconstructed speech

The codebook searching process is to find a codeword y(i) in the codebook and corresponding a(i) such that v(i) = a(i) y(i) and eeT is minimized. To make the problem tractable, adaptive and stochastic codebooks are searched sequentially. Denoting a quantized version by then the criterion of codeword searching in the adaptive codebook or stochastic codebook is to minimize eeT over all y(i) in terms of an expressioncodebook is to minimize eeT over all y(i)

The decoder of the CELP codec is a reverse process of the encoder. Because of the unsymmetrical complexity property of vector quantization, the complexity in the decoder side is usually much lower.

G.723.1*G723.1 is an ITU standard aimed at multimedia communication. It has been incorporated into H.324 for audio encoding in videoconference applications. G.723.1 is a dual - rate CELP - type speech coder that can work at bitrates of 5.3 kbps and 6.3 kbps.

G.723.1 uses many techniques similar to FS1016, discussed in the last section. The input speech is again 8 kHz, sampled in 16 - bit linear PCM format. The speech frame size is also 30 msec and is further divided into four equal - sized subframes. Order - 10 LPC coefficients are estimated in each subframe. LP coefficients are further converted to LSP vectors and quantized by predictive splitting VQ. LP coefficients are also used to form the perceptually weighted filter.

G.723.1 first uses an open - loop pitch estimator to get a coarse pitch estimation in a time interval of every two subframes. Closed - loop pitch searching is done in every speech subframe by searching the data in a range of the open - loop pitch. After LP filtering and removing the harmonic components by LTP, the stochastic residue is quantized by Multi - pulse Maximum Likelihood Quantization (MP - MLQ) for the 5.3 kbps coder or Algebraic - Code - Excited Linear Prediction (ACELP) for the 6.3 kbps coder, which has a slightly higher speech quality.

These two modes can be switched at any boundary of the 30 msec speech frames. In MP - MLQ, the contribution of the stochastic component is represented as a sequence of pulses

sequence of pulses

where M is the number of pulses and gi is gain of the pulse at position m;. The closed - loop search is done by minimizing

closed - loop search

where r(n) is the speech component after perceptual weighting and eliminating the zero - response component and periodic component contributions. Based on methods similar to those presented in the last section, we can sequentially optimize the gain and position for each pulse. Say we first assume there is only one pulse and find the best gain and position. After removing the contribution from this pulse, we can get the next optimal pulse based on the same method. This process is done recursively until we get all M pulses.

The stochastic codebook structure for the ACELP model is different from FS1016. The following table shows the ACELP excitation codebook:

ACELP excitation codebook

There are only four pulses. Each can be in eight positions, coded by three bits each. Also, the sign of the pulse takes one bit, and another bit is to shift all possible positions to odd. Thus, the index of a codeword has 17 bits. Because of the special structure of the algebraic codebook, a fast algorithm exists for efficient codeword searching.

Besides the CELP coder we discussed above, there are many other CELP - type codecs, developed mainly for wireless communication systems. The basic concepts of these coders are similar, except for different implementation details on parameter analysis and codebook structuring.

Some examples include the 12.2 kbps GSM Enhanced Full Rate {EFR) algebraic CELP codec and IS - 641EFR, designed for the North American digital cellular IS - 136 TDMA system. G.728 is a low - delay CELP speech coder. G.729 is another CELP based ITU standard aimed at toll - quality speech communications.G.729 is a Conjugate - Structure Algebraic - Code - Excited - Linear - Prediction (CS - ACELP) ' codec. G.729 uses a 10 msec speech analysis frame and thus has lower delay than G.723.1, which uses a 30 msec speech frame. G.729 also has some inherent protection schemes to deal with packet loss in applications such as VoIP.

Hybrid Excitation Vocoders*Hybrid Excitation Vocoders are another large class of speech coders. They are different from CELP, in which the excitation is represented as the contributions of the adaptive and stochastic codewords. Instead, hybrid excitation coders use model - based methods to introduce multi - model excitation.

MBE. The Multi - Band Excitation (MBE) vocoder was developed by MIT's Lincoln Laboratory. The 4.15 kbps IMBE codec has become the standard for IMMSAT. MBE is also a blockwise codec, in which a speech analysis is done in a speech frame unit of about 20 to 30 msec. In the analysis part of the MBE coder, a spectrum analysis such as JFFT is first applied for the windowed speech in the current frame.

The short - time speech spectrum is further divided into different spectrum bands.The bandwidth is usually an integer times the basic frequency that equals the inverse of the pitch. Each band is described as "voiced" or "unvoiced". The parameters of the MBE coder thus include the spectrum envelope, pitch, unvoiced / voiced (U / V) decisions for different bands. Based on different bitrate demands, the phase of the spectrum can be parameterized or discarded. In the speech decoding process, voiced bands and unvoiced bands are synthesized by different schemes and combined to generate the final output. MBE utilizes the analysis - by - synthesis scheme in parameter estimation.

Parameters such as basic frequency, spectrum envelope, and subband U / V decisions are all done via closed - loop searching. The criteria of the closed - loop optimization are based on minimizing the perceptually weighted reconstructed speech error, which can be represented in the frequency domain as

frequency domain

where Sw (ω) and Swr (ω) are the original speech short - time spectrum and reconstructed speech short - time spectrum, and G(ω) is die spectrum of the perceptual weighting filter.

Similar to the closed - loop searching scheme in CELP, a sequential optimization method is used to make the problem tractable. In the first step, all bands are assumed voiced bands, and the spectrum envelope and basic frequency are estimated. Rewriting the spectrum error with the all - voiced assumption, we have

spectrum error with the all - voiced assumption

in which M is band number in [0, jt], Am is the spectrum envelope of band m, Ewr(ω) is the short - time window spectrum, and am = (m — 1 / 2)ω0, βm — (m + 1/2) ω0.

Am is the spectrum envelope of band

The basic frequency is obtained at the same time by searching over a frequency interval to minimize s. Based on the estimated spectrum envelope, an adaptive thresholding scheme tests the matching degree for each band. We label a band as voiced if there is a good matching; otherwise, we declare the band as unvoiced and re - estimate the envelope for the unvoiced band as

envelope for the unvoiced band

The decoder uses separate methods to synthesize unvoiced and voiced speech, based on the unvoiced and voiced bands. The two types of reconstructed components are then combined to generate synthesized speech. The final step is overlapping the sum of the synthesized speech in each frame to get the final output.

Multiband Excitation Linear Predictive (MELP).The MELP speech codec is a new U.S. federal standard to replace the old LPC - 10 (FS1015) standard, with the application focus on low - bitrate safety communications. At 2.4 kbps, MELP has comparable speech quality to the 4.8 kbps DOD - CELP (FS1016) and good robustness in a noisy environment.

MELP is also based on LPC analysis. Different from the hard - decision voiced / unvoiced model adopted in LPC - 10, MELP uses a multiband soft - decision model for the excitation signal. The LP residue is band - passed, and a voicing strength parameter is estimated for each band. The decoder can reconstruct the excitation signal by combining the periodic pulses and white noises, based on the voicing strength in each band. Speech can be then reconstructed by passing the excitation through the LPC synthesis filter.

Different from MBE, MELP divides the excitation into five fixed bands of 0 - 500, 500 - 1000, 1000 - 2000, 2000 3000, and 3000^1000 Hz. It estimates a voice degree parameter in each band based on the normalized correlation function of the speech signal and the smoothed, rectified signal in the non - DC band. Let Sk(n) denote the speech signal in band k, and uk(n) denote the DC - removed smoothed rectified signal of Sk(n), The correlation function is defined as

correlation function

where P is the pitch of the current frame, and N is the frame length. Then the voicing strength for band k is defined as max(Rsk(P), RUk (P)).

To further remove the buzziness of traditional LPC - 10 speech coders for the voiced speech segment, MELP adopts a jittery voiced state to simulate the marginal voiced speech segments. The jittery state is indicated by an aperiodic flag. If the aperiodic flag is set in the analysis end, the receiver adds a random shifting component to the periodic pulse excitation. The shifting can be as big as P / 4. The jittery state is determined by the peakiness of the full - wave rectified LP residue e(n),

peakiness of the full

If peakiness is greater than some threshold, the speech frame is determined as jittered. To better reconstruct the short time spectrum of the speech signal, the spectrum of the residue signal is not assumed to be flat, as it is in the LPC - 10 speech coder. After normalizing the LP residue signal, MELP preserves the magnitudes corresponding to the first min (10, P / 4) basic frequency harmonics. Basic frequency is the inverse of the pitch period. The higher harmonics are discarded and assumed to be unity spectrum.

The 10 - d magnitude vector is quantized by 8 - bit vector quantization, using a perceptual weighted distance measure. Similar to most modern LPC quantization schemes, MELP also converts LPC parameters to LSF and uses four - stage vector quantization. The bits allocated for the four stages are 7, 6, 6, and 6, respectively. Apart from integral pitch estimation similar to LPC - 10, MELP applies a fractional pitch refinement procedure to improve the accuracy of pitch estimation.

In the speech reconstruction process, MELP does not use a periodic pulse to represent the periodic excitation signal but uses a dispersed waveform.To disperse the pulses, a finite impulse response (FIR) filter is applied to the pulses. MELP also applies a perceptual weighting filter post - filter to the reconstructed speech so as to suppress the quantization noise and improve the subject's speech quality.


All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

MULTIMEDIA Topics