Research Topic:
Analysis-by-Synthesis Multimode Harmonic Speech Coding

Objective:
Develop new techniques that allow the harmonic speech coder to achieve toll-quality around 4 kb/s.

Problem Statement and Proposed Solutions:

(1) Speech Model:
The sinusoidal model based harmonic coder are well-suited for the representation of quasi-periodic signals typical of voiced speech and the noise-like signals typical of unvoiced speech. However, this model is ineffective for representing speech in transition speech regions such as voicing onsets/offsets, plosives and non-periodic pulses. To improve model accuracy for transition speech segments, a frequency domain model was proposed to represent transition speech segments. This model preserves the temporal information which is important in the perception of the transition speech. The model is also amenable to a closed-loop analysis-by-synthesis procedure for parameter estimation. To further improve the harmonic model representation for the voiced speech, we also proposed a new voicing model which includes the modeling of the pitch jittering of the natural speech.

(2) Model Parameter Estimation:
Since harmonic coders belong to the parametric coding category, the estimation errors for the model parameters would result in significant degradation of the speech quality. One solution for improving the accuracy and robustness of parameter estimations is to use closed-loop analysis-by-synthesis techniques. However, when low-rate harmonic coders are used to synthesize speech, no phase information is transmitted, which results in a loss of time alignment between the original speech and the synthesized speech. This loss of time alignment makes it difficult for the coder to perform waveform matching, and interferes with time domain closed-loop parameter estimation. We proposed a generalized time-domain analysis-by-synthesis parameter estimation scheme in the harmonic coding framework. This scheme uses a time scale signal modification technique to allow for waveform matching in harmonic coding. This concept is demonstrated in our Analysis-by -Synthesis Multimode Harmonic Coder (AbS-MHC) with a specific method for efficient closed-loop pitch estimation and speech classification.

(3) Parameter Quantization:
Generally, parameters needed to be quantized in harmonic coders are LPC parameters(LSFs), pitch, voicing information, harmonic spectral magnitudes, and gains. Efficient quantization of LSFs is not a problem any more since spectral distortion (SD) less than 1 dB using 22-24 bits/20ms for narrow band speech is obtained in our experiments. In fact, among all the parameters in a harmonic coder, quantization of the harmonic spectral magnitudes is the most challenging task. Since the spectral magnitude vector in a harmonic coder is obtained by sampling the speech magnitude spectrum or the LP residual magnitude spectrum at multiples of the pitch frequency, the dimension of the spectral magnitude vector varies as pitch varies from frame to frame. Standard fixed-dimension VQ techniques are difficult to apply directly to the quantization of spectral magnitude vectors. We studied a quantization technique called Weighted Non-Square Transform Vector Quantization (WNSTVQ) which addresses the problems associated with variable-dimension vector quantization by combining a fixed-dimension vector quantizer with a variable-sized non-square transform. We show that WNSTVQ is the generalized form of all linear dimension conversion methods. We find that choice of transforms, choice of the fixed dimension, and number of codebooks used provide the tradeoffs between complexity, memory requirement, and performance for several WNSTVQ systems.

Demos: (.wav files)

Female 1sentence1, sentence2, sentence3
Male 1sentence1, sentence2, sentence3
Female 2sentence1, sentence2, sentence3
Male 2sentence1, sentence2, sentence3

Note:
Sentence1: Original speech sentence (modified IRS filtered speech sentence)

Sentence2: AbS-MHC coder output without quantization. Features of AbS-MHC coder include: an enhanced frequency domain transition model is used in conjunction with the sinusoidal model based harmonic coding of voiced/unvoiced speech signal; New voicing model is used for the voiced speech; Closed-loop pitch estimation using the combined time-domain and frequency-domain pitch candidates. Closed-loop pitch/classification using a time scale signal modification technique.

Sentence3: AbS-MHC coder output with quantized LSFs (24 bits/20ms), quantized spectral harmonic magnitudes (14 bits/10ms DCT-II transform based WNSTVQ quantization scheme), and quantized gain( 6 bits/10ms).