Acoustic Operations and Foundations of Speech Processing
Core signal processing operations that underpin all speech recognition and synthesis systems.
Introduction
ASR (Automatic Speech Recognition) systems transform audio waveforms into text. At the heart of this process lies the acoustic model—a statistical mapping from audio features to linguistic units called phonemes.
What is a Phoneme?
A phoneme is the smallest unit of sound that distinguishes meaning. English has ~44 phonemes , French has 37 , Arabic has 32 :
English and French sound rich because they have many different vowel and consonant sounds. Arabic has fewer basic sounds, but they change shape depending on the letters around them. So English and French are rich in number, while Arabic is rich in variation.
| Type | Examples | Words |
|---|---|---|
| Consonants | /b/, /p/, /t/, /k/ | bat, pat, tap, cap |
| Vowels | /æ/, /ɪ/, /uː/ | cat, sit, boot |
Changing one phoneme changes the word: “bat” → “mat” (/b/ → /m/). We will talk in another post later about phonemes .
The Acoustic Model Pipeline
1
Audio → Framing → Windowing → FFT → Mel Filterbank → Log → DCT → MFCCs → Model → Phonemes
This article covers the mathematical operations in this pipeline. Understanding these fundamentals is essential—neural networks don’t replace DSP (Digital Signal Processing), they build upon it.
0. Digital Signal Processing Foundations Recap
These operations are covered in detail in previous posts. Here’s the quick reference for speech processing:
Convolution $(x * h)[n]$
— how filters and systems transform signals. Convolution in time = multiplication in frequency (FFT—Fast Fourier Transform—speedup).
Correlation $R_{xx}[n]$
— measures self-similarity at different lags. Peaks in autocorrelation reveal pitch period.
DFT (Discrete Fourier Transform) $X[k] = \sum_n x[n] e^{-i2\pi kn/N}$
— decomposes signal into frequency components. Magnitude = “how much”, phase = “when”.
Windowing
— When we extract a finite frame from a continuous signal, the abrupt edges create artificial discontinuities. The DFT assumes periodicity, so these sharp cuts cause spectral leakage—energy spreading into adjacent frequency bins where it shouldn’t be.
A window function tapers the frame smoothly to zero at the edges:
| Window | Main Lobe Width | Side Lobe Level | Use Case |
|---|---|---|---|
| Rectangular | Narrowest | Highest (-13 dB) | Maximum frequency resolution |
| Hamming | Medium | Low (-43 dB) | General speech analysis |
| Hann | Medium | Lower (-31 dB) | Spectral analysis |
| Blackman | Widest | Lowest (-58 dB) | When leakage must be minimal |
Trade-off: Narrower main lobe = better frequency resolution. Lower side lobes = less leakage. You can’t optimize both—this is the time-frequency uncertainty principle.
For speech processing, Hamming window is the standard choice: good balance between frequency resolution and leakage suppression.
STFT (Short-Time Fourier Transform) $X[m,k]$ — sliding window DFT producing spectrograms. For speech: 20-30ms frames, 10ms hop. Time-frequency uncertainty: $\Delta t \cdot \Delta f \geq 1/4\pi$.
1. Filtering - Frequency Selection
With these foundational operations established, we can now build speech processing systems. The first step is often filtering—selectively passing or blocking certain frequencies to prepare the signal for analysis.
Figure 1.0: Filter types and their frequency responses –>
Filter Types
| Type | Passes | Blocks | Use Case |
|---|---|---|---|
| Low-pass | $f < f_c$ | $f > f_c$ | Anti-aliasing, smoothing |
| High-pass | $f > f_c$ | $f < f_c$ | Remove DC, pre-emphasis |
| Band-pass | $f_1 < f < f_2$ | else | Filterbank channels |
| Band-stop | $f < f_1$ or $f > f_2$ | $f_1 < f < f_2$ | Notch filter (remove hum) |
FIR vs IIR
Digital filters come in two fundamental types, distinguished by whether they use feedback.
FIR (Finite Impulse Response): A filter whose output depends only on current and past inputs—no feedback.
\[y[n] = \sum_{k=0}^{M} b_k \cdot x[n-k]\]Intuition: Output is a weighted sum of the current and past $M$ input samples only. If you feed an impulse (single spike), the output dies after $M$ samples—hence “finite.”
| Property | Explanation |
|---|---|
| Always stable | No feedback → no risk of runaway oscillation |
| Linear phase | Symmetric coefficients preserve waveform shape |
| Higher order needed | Need many taps for sharp cutoffs |
IIR (Infinite Impulse Response): A filter with feedback—output depends on past outputs too.
\[y[n] = \sum_{k=0}^{M} b_k \cdot x[n-k] - \sum_{k=1}^{N} a_k \cdot y[n-k]\]Intuition: Output depends on past inputs AND past outputs (feedback). An impulse can theoretically ring forever—hence “infinite.” The feedback creates resonances that achieve sharp frequency responses with fewer coefficients.
| Property | Explanation |
|---|---|
| Can be unstable | Feedback can cause output to explode if poles outside unit circle |
| Lower order | Feedback provides “free” filtering; 2nd-order IIR ≈ 50th-order FIR |
| Phase distortion | Different frequencies delayed differently (problematic for some applications) |
IIR filters can become unstable if not designed carefully. Always verify that all poles are inside the unit circle in the z-plane.
In speech processing:
- FIR: Mel filterbanks (linear phase preserves temporal structure)
- IIR: Pre-emphasis (simple 1st-order, low latency)
Pre-emphasis Filter
\[y[n] = x[n] - \alpha x[n-1], \quad \alpha \approx 0.97\]Transfer function: $H(z) = 1 - \alpha z^{-1}$
Frequency response: $\lvert H(e^{i\omega}) \rvert = \sqrt{1 + \alpha^2 - 2\alpha\cos\omega}$
Boosts high frequencies ~6 dB/octave.
Why Pre-emphasis?
The glottal source (vocal cord vibration) has a natural spectral tilt: energy decreases ~20 dB/decade at higher frequencies. Pre-emphasis compensates for this, giving high-frequency formants more weight in analysis. Without it, low frequencies would dominate MFCC computation.
2. Cepstral Analysis
Now that we can filter and shape the spectrum, we need a way to separate the two main components of speech: the excitation source (vocal cords) and the vocal tract filter. The cepstrum provides exactly this capability.
Figure 2.0: Cepstrum separating source and filter
Homomorphic Deconvolution
Speech = source _ filter: $s[n] = e[n] _ h[n]$
In frequency: $S(f) = E(f) \cdot H(f)$
Take log: $\log S = \log E + \log H$
Cepstrum: inverse DFT of log magnitude spectrum
\[c[n] = \mathcal{F}^{-1}\{\log \lvert X[k] \rvert\}\]Quefrency Domain
Quefrency is the independent variable in the cepstral domain—it has units of time (samples or milliseconds) but represents “rate of change in the spectrum.” The name is an anagram of “frequency,” following the cepstrum/spectrum wordplay.
- Low quefrency: slow spectral variations (vocal tract = formants)
- High quefrency: fast spectral variations (pitch harmonics)
Liftering
Keep only low-quefrency components:
\[\hat{c}[n] = c[n] \cdot l[n]\]where $l[n]$ is a low-pass lifter.
Cepstrum Intuition
Etymology: “Cepstrum” is an anagram of “spectrum”—we’re analyzing the spectrum of a spectrum.
Separation principle: The vocal tract (slow-varying formants) appears at low quefrencies. The pitch harmonics (fast-varying) appear at high quefrencies. Liftering removes pitch, leaving vocal tract shape—the basis for speaker-independent recognition.
Connection to MFCCs: MFCCs are essentially cepstral coefficients computed on a mel-warped spectrum. The DCT decorrelates the log mel energies, producing a compact representation of the spectral envelope.
3. Mel-Frequency Analysis
The cepstrum works on linear frequency. But human hearing doesn’t perceive frequencies linearly—we’re more sensitive to differences at low frequencies than high. The mel scale models this perception, leading to MFCCs (Mel-Frequency Cepstral Coefficients)—the most widely used features in speech recognition.
Figure 3.0: Mel filterbank on linear frequency axis
Mel Scale
\[m = 2595 \cdot \log_{10}\left(1 + \frac{f}{700}\right)\] \[f = 700 \cdot \left(10^{m/2595} - 1\right)\]Perceptual motivation: Equal mel intervals = equal perceived pitch intervals.
Mel Filterbank
Triangular filters uniformly spaced in mel domain:
\[H_m[k] = \begin{cases} 0 & k < f[m-1] \\ \frac{k - f[m-1]}{f[m] - f[m-1]} & f[m-1] \leq k < f[m] \\ \frac{f[m+1] - k}{f[m+1] - f[m]} & f[m] \leq k < f[m+1] \\ 0 & k \geq f[m+1] \end{cases}\]Filterbank Energies
\[E_m = \sum_{k=0}^{N/2} \lvert X[k] \rvert^2 \cdot H_m[k]\]MFCC Computation
- Compute power spectrum: $\lvert X[k] \rvert^2$
- Apply mel filterbank: $E_m$
- Log compress: $\log E_m$
- DCT: $c_i = \sum_{m=1}^{M} \log E_m \cdot \cos\left(\frac{\pi i (m-0.5)}{M}\right)$
Why DCT for MFCCs?
Decorrelation: Mel filterbank outputs are correlated (adjacent filters overlap). DCT produces uncorrelated coefficients, beneficial for diagonal-covariance GMMs.
Energy compaction: Most speech information concentrates in the first 12-13 coefficients. Higher coefficients represent fine spectral detail (often discarded).
Dynamic Features: Deltas and Delta-Deltas
Static MFCCs capture spectral shape at a single instant. Speech is inherently dynamic—phoneme transitions carry critical information.
Delta coefficients (velocity): First derivative of MFCCs across time \(\Delta c_t = \frac{\sum_{n=1}^{N} n(c_{t+n} - c_{t-n})}{2\sum_{n=1}^{N} n^2}\)
Delta-delta coefficients (acceleration): Second derivative, computed the same way on deltas.
| Coefficient Type | Captures | Example |
|---|---|---|
| Static MFCC | Spectral envelope | Vowel identity |
| Delta | Rate of change | Consonant-vowel transitions |
| Delta-delta | Acceleration | Emphasis, speaking rate |
Standard feature vector: 39 dimensions per frame
- 13 static (12 MFCCs + energy)
- 13 delta
- 13 delta-delta
The 39-dimensional MFCC+delta+delta-delta feature vector has been the de facto standard for speech recognition for decades. Even with modern neural approaches, it remains a strong baseline.
This captures both “what sound” and “how it’s changing”—essential for distinguishing coarticulated phonemes.
4. Discrete Cosine Transform (DCT)
The DCT (Discrete Cosine Transform) is a transform similar to the DFT but uses only cosine functions, producing real-valued coefficients. It’s widely used in compression (JPEG, MP3) because it concentrates signal energy into fewer coefficients than the DFT.
Figure 4.0: DCT basis functions and energy compaction
Definition (DCT-II)
\[C[k] = \sum_{n=0}^{N-1} x[n] \cdot \cos\left(\frac{\pi k (2n+1)}{2N}\right)\]Why DCT?
- Real-valued: No complex numbers
- Energy compaction: Most energy in first few coefficients
- Decorrelation: Approximates the KLT (Karhunen-Loève Transform, the optimal decorrelating transform) for Markov-1 signals
DCT vs DFT
| Property | DFT | DCT |
|---|---|---|
| Values | Complex | Real |
| Assumes | Periodic | Symmetric extension |
| Boundary | Discontinuity | Smooth |
| Compaction | Good | Better |
5. Linear Prediction (LPC)
MFCCs capture spectral shape through filterbanks. LPC (Linear Predictive Coding) takes a different approach: it models the vocal tract as an all-pole filter and finds coefficients that best predict the signal. This yields another powerful representation—one that’s particularly useful for speech coding and formant analysis.
Formants are the resonance frequencies of the vocal tract (labeled F1, F2, F3…). They determine vowel identity—for example, the difference between /i/ (“ee”) and /a/ (“ah”) is primarily in F1 and F2 positions.
Figure 5.0: Linear prediction as all-pole filter modeling
The Model
Predict current sample from past samples:
\[\hat{x}[n] = -\sum_{k=1}^{p} a_k \cdot x[n-k]\]Prediction error: $e[n] = x[n] - \hat{x}[n]$
All-Pole Filter
\[H(z) = \frac{1}{1 + \sum_{k=1}^{p} a_k z^{-k}} = \frac{1}{A(z)}\]Models vocal tract transfer function (resonances = formants).
Solving for Coefficients
Minimize mean squared error:
\[E = \sum_n e^2[n] = \sum_n \left(x[n] + \sum_{k=1}^{p} a_k x[n-k]\right)^2\]Take derivatives, set to zero → Yule-Walker equations:
\[\sum_{k=1}^{p} a_k R[i-k] = -R[i], \quad i = 1, \ldots, p\]where $R[k]$ is autocorrelation.
The Yule-Walker equations (named after statisticians George Udny Yule and Gilbert Walker) form a linear system that relates the LPC coefficients to the autocorrelation of the signal. The resulting matrix is Toeplitz—a special structure where each descending diagonal contains the same value. This structure enables efficient algorithms.
Levinson-Durbin Algorithm
Solving Yule-Walker directly requires $O(p^3)$ operations (matrix inversion). The Levinson-Durbin algorithm exploits the Toeplitz structure of the autocorrelation matrix to solve it in $O(p^2)$.
Key insight: The solution for order $i$ can be built from order $i-1$. We compute coefficients recursively:
Algorithm steps:
Initialize: $E_0 = R[0]$ (signal energy)
For each order $i = 1, 2, \ldots, p$:
Compute reflection coefficient: \(k_i = \frac{R[i] + \sum_{j=1}^{i-1} a_j^{(i-1)} R[i-j]}{E_{i-1}}\)
Update coefficients: \(a_i^{(i)} = k_i\) \(a_j^{(i)} = a_j^{(i-1)} + k_i \cdot a_{i-j}^{(i-1)}, \quad j = 1, \ldots, i-1\)
Update prediction error: \(E_i = (1 - k_i^2) E_{i-1}\)
Output: Final coefficients $a_1, \ldots, a_p$
Reflection coefficients $k_i$:
These have a physical interpretation—they represent the reflection at each “stage” of a lattice filter (like acoustic reflections in a tube model of the vocal tract).
| Stability guarantee: If $ | k_i | < 1$ for all $i$, the filter is stable. This is always true when computed from valid autocorrelation (positive definite). |
Unlike general IIR filter design, Levinson-Durbin always produces stable filters when starting from a valid autocorrelation sequence—no need for manual stability checks.
Applications
- Speech coding (LPC-10, CELP)
- Formant estimation
- Speaker recognition
6. Fundamental Frequency (F0) Estimation
So far we’ve focused on the vocal tract (formants, spectral envelope). But the other critical component is the excitation source—specifically, the fundamental frequency or pitch.
F0 (Fundamental Frequency) is the rate at which the vocal cords vibrate during voiced speech—it determines the perceived pitch. F0 carries prosodic information: intonation, stress, emotion. Estimating it reliably is essential for many applications.
Figure 6.0: Pitch detection methods
Autocorrelation Method
Find first major peak in autocorrelation:
\[R[k] = \sum_n x[n] \cdot x[n+k]\]Pitch period $T_0$ = lag of first peak after $R[0]$.
F0 = $f_s / T_0$
Cepstral Method
Peak in cepstrum at quefrency = pitch period.
RAPT / YAAPT / DIO
Typical Ranges
- Male: 80-200 Hz
- Female: 150-350 Hz
- Child: 200-500 Hz
7. Modulation and Demodulation
Speech can be viewed as a slowly-varying envelope (amplitude modulation) riding on rapidly-varying carriers (formants). Extracting these modulations provides yet another perspective on the signal—one that connects to neural processing of speech and alternative feature representations.
Figure 7.0: AM, FM, and the analytic signal
Amplitude Modulation
\[y(t) = x(t) \cdot \cos(2\pi f_c t)\]Envelope: $\lvert x(t) \rvert$
Hilbert Transform and Analytic Signal
\[\hat{x}(t) = \mathcal{H}\{x(t)\} = \frac{1}{\pi} \text{P.V.} \int_{-\infty}^{\infty} \frac{x(\tau)}{t-\tau} d\tau\]Analytic signal: $z(t) = x(t) + i\hat{x}(t)$
Instantaneous amplitude: $A(t) = \lvert z(t) \rvert$
Instantaneous frequency: $f(t) = \frac{1}{2\pi} \frac{d\phi(t)}{dt}$
Applications in Speech
- Envelope extraction for ASR features
- F0 estimation via instantaneous frequency
- Modulation spectrum analysis
8. Acoustic Model Architectures
The acoustic model maps feature sequences to phoneme sequences. Two main approaches:
Hidden Markov Models (HMMs)
An HMM (Hidden Markov Model) is a statistical model where the system transitions between hidden states, and each state produces observable outputs with some probability. For speech: the hidden states are phoneme sub-units, and the observations are acoustic features.
Traditional approach modeling temporal variability:
- Each phoneme = sequence of HMM states (typically 3: onset, middle, offset)
- Emission probabilities: GMMs (Gaussian Mixture Models) model the probability of observing features in each state. A GMM represents a distribution as a weighted sum of multiple Gaussian (bell-curve) distributions.
- Transition probabilities: Model phoneme duration
Strengths: Interpretable, handles variable-length sequences naturally.
Weaknesses: GMMs assume feature independence, limited modeling capacity.
Deep Neural Networks (DNNs)
A DNN (Deep Neural Network) is a neural network with multiple hidden layers. Evolution of architectures:
| Era | Architecture | Approach |
|---|---|---|
| 2012+ | DNN-HMM hybrid | DNN replaces GMM for emission probabilities |
| 2015+ | LSTM/GRU | Recurrent networks with CTC loss |
| 2017+ | Transformer | Attention-based, parallel training |
| 2020+ | Self-supervised | Pre-trained representations |
Modern Approach: Self-Supervised Speech Embeddings
Traditional MFCCs are hand-crafted features. Modern systems learn representations directly from raw audio using self-supervised learning.
Wav2Vec 2.0 (Facebook/Meta, 2020): Learns speech representations by predicting masked portions of the audio. Pre-trained on 60k hours of unlabeled speech, then fine-tuned on small labeled datasets.
Wav2Vec 2.0 achieves strong ASR results with just 10 minutes of labeled data—a massive reduction from traditional systems requiring thousands of hours.
HuBERT (Hidden-Unit BERT): Similar approach but uses offline clustering to create pseudo-labels for masked prediction.
Whisper (OpenAI, 2022): Trained on 680k hours of weakly-supervised data. Robust to accents, background noise, and technical language.
Whisper is particularly useful for real-world applications due to its robustness to noise and ability to handle multiple languages without explicit language identification.
These models output embeddings—dense vector representations that capture phonetic, speaker, and linguistic information. They can replace or augment traditional MFCC pipelines:
1
2
Traditional: Audio → MFCCs → Acoustic Model → Text
Modern: Audio → Wav2Vec/Whisper → Fine-tuning → Text
Why embeddings work: Self-supervised pre-training on massive unlabeled data learns universal speech representations. Fine-tuning adapts these to specific tasks with minimal labeled data.
Example: LSTM Acoustic Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras import Input, Model
def build_acoustic_model(num_features, num_hidden, num_phonemes):
input_features = Input(shape=(None, num_features))
x = LSTM(num_hidden, return_sequences=True)(input_features)
output_phonemes = Dense(num_phonemes, activation='softmax')(x)
model = Model(inputs=input_features, outputs=output_phonemes)
return model
# Typical configuration
num_features = 39 # 13 MFCCs + 13 deltas + 13 delta-deltas
num_hidden = 256 # LSTM units
num_phonemes = 40 # English phoneme set
model = build_acoustic_model(num_features, num_hidden, num_phonemes)
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
Input: (batch, time_steps, 39) — sequence of MFCC frames Output: (batch, time_steps, 40) — phoneme probabilities per frame
References
- Oppenheim & Schafer - Discrete-Time Signal Processing
- Rabiner & Schafer - Digital Processing of Speech Signals
- Quatieri - Discrete-Time Speech Signal Processing
- Smith - Mathematics of the DFT
