Chapter 31: Speech & Audio Processing

PART X: Specialized Domains | Reading Time: 3.5 hours | Prerequisites: Ch 19, Ch 20

1. Learning Objectives

2. Introduction

Speech and audio processing sit at the intersection of Digital Signal Processing (DSP) and Deep Learning. For decades, voice-driven human-computer interaction was considered an AI-complete problem. Today, smart assistants like Siri, Alexa, and Google Assistant are ubiquitous.

Unlike spatial image data, audio is a one-dimensional temporal sequence containing a rich superposition of frequencies. The core challenge is translating this highly variable time-domain signal into a robust frequency-domain representation (like a Mel-spectrogram) that deep neural networks can process to extract meaning, identity, or emotion.

Modern speech systems often disentangle linguistic content (what is said) from acoustic content (who is saying it). Architectures like Transformers excel at this disentanglement.

3. Historical Background

The 1950s saw the first digit recognizers like Bell Labs' Audrey. The 1980s marked the dominance of Hidden Markov Models (HMMs) combined with Gaussian Mixture Models (GMMs). This GMM-HMM paradigm relied heavily on handcrafted features (MFCCs) and complex phonetic dictionaries.

In 2012, Deep Neural Networks (DNNs) replaced GMMs, drastically reducing Word Error Rates (WER). By 2015, End-to-End deep learning architectures, such as Baidu's DeepSpeech, utilized Recurrent Neural Networks (RNNs) with Connectionist Temporal Classification (CTC) loss, bypassing phonetic dictionaries entirely. Today, self-supervised Transformer models (wav2vec 2.0, Whisper) rule the landscape.

4. Conceptual Explanation

Audio Fundamentals

Sound is a mechanical pressure wave. A microphone converts this to an analog electrical signal, which an ADC digitizes.

Feature Extraction

Raw audio is high-dimensional. We extract features using sliding windows (frames).

ASR and CTC Loss

In ASR, the input audio sequence length differs from the output text sequence length. Connectionist Temporal Classification (CTC) loss introduces a "blank" token and marginalizes over all possible alignments between audio and text, allowing training without explicit frame-level alignment.

When decoding CTC output, remember the rule: merge consecutive identical characters, then remove blanks. `h-e-e- -l-l-l- -o` → `he-l-o` → `hello`.

OpenAI Whisper

Whisper is a purely attention-based Encoder-Decoder Transformer trained on 680,000 hours of weakly supervised data. It performs ASR, translation, and language identification simultaneously without relying on traditional CTC, mapping Mel-spectrograms directly to text tokens via cross-attention.

Text-to-Speech (TTS)

Modern TTS is a two-stage process:

  1. Acoustic Model (Tacotron): Text/Phonemes → Mel-spectrogram.
  2. Vocoder (WaveNet): Mel-spectrogram → Raw audio waveform.

Voice Activity Detection (VAD) & Emotion Recognition

VAD classifies frames as speech or non-speech, acting as a crucial pre-processing gate. Emotion Recognition classifies the affective state (happy, sad, angry) from prosodic features (pitch, energy) and spectral features.

5. Mathematical Foundation

The Fourier Transform

The Discrete Fourier Transform (DFT) converts a discrete time-domain signal $x[n]$ to the frequency domain $X[k]$:

$$ X[k] = \sum_{n=0}^{N-1} x[n] e^{-j 2\pi k n / N} $$

The Mel Scale

The Mel scale $m$ relates to frequency $f$ (in Hz):

$$ m = 2595 \log_{10} \left(1 + \frac{f}{700}\right) $$

CTC Loss

Given an input sequence $X$, the probability of a target sequence $Y$ is the sum of probabilities of all valid alignment paths $\pi$:

$$ P(Y | X) = \sum_{\pi \in \mathcal{B}^{-1}(Y)} \prod_{t=1}^{T} P(\pi_t | x_t) $$

The loss is the negative log-likelihood: $\mathcal{L}_{CTC} = - \ln P(Y | X)$.

6. Formula Derivations

Short-Time Fourier Transform (STFT) Windowing

To compute the STFT, we multiply the signal by a sliding window function (e.g., Hanning) to prevent spectral leakage at the edges of the frame:

$$ w[n] = 0.5 \left(1 - \cos\left(\frac{2\pi n}{N-1}\right)\right) $$

The STFT is then:

$$ STFT(m, \omega) = \sum_{n=-\infty}^{\infty} x[n] w[n - mR] e^{-j\omega n} $$

Where $m$ is the frame index and $R$ is the hop length.

MFCC Derivation via DCT

After obtaining the Mel-filterbank energies $E_m$, we take the logarithm $L_m = \log(E_m)$ to mimic human loudness perception. Then we apply a Type-II Discrete Cosine Transform (DCT):

$$ c_k = \sum_{m=1}^{M} L_m \cos\left[ \frac{\pi k}{M} \left( m - 0.5 \right) \right] $$

The lower coefficients $c_k$ capture the smooth spectral envelope (vocal tract formants), while higher coefficients capture fine harmonic structures (pitch), which are discarded in speech recognition.

7. Worked Numerical Examples

Calculating Mel Frequency

Problem: Convert a frequency of $f = 2100$ Hz to the Mel scale.

Solution:

$$ m = 2595 \log_{10} \left(1 + \frac{2100}{700}\right) $$

$$ m = 2595 \log_{10} (1 + 3) = 2595 \log_{10}(4) $$

$$ m = 2595 \times 0.602 = 1562.19 \text{ Mels} $$

Audio Framing Calculation

Problem: You have a 2-second audio file sampled at 16,000 Hz. You use a window size of 25 ms and a hop size of 10 ms. How many frames will you get?

Solution:

8. Visual Diagrams

[ Audio Pipeline ] Raw Waveform (1D) Mel-Spectrogram (2D) /\/\/\ STFT & Mel _______ / \/ \ ----------> | | Feature / \ Filterbank |#######| Extraction |_______| | | v v [ Vocoder (WaveNet) ] [ Acoustic Model / ASR ] Generates Audio Generates Text

9. Flowcharts

[ End-to-End ASR with CTC ] +---------------+ +-----------------+ +----------------+ | Audio File | ----> | Feature Extract | ----> | Log-Mel Spects | +---------------+ +-----------------+ +----------------+ | v +---------------+ +-----------------+ +----------------+ | Output Text | <---- | CTC Decoding | <---- | Bi-LSTM / TFMR | +---------------+ +-----------------+ +----------------+

10. Python Implementation

Let's implement fundamental audio loading and MFCC extraction using librosa.


import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# 1. Load Audio
# sr=None preserves original sampling rate
audio_path = 'sample_speech.wav'
# For demo purposes, we will mock the array if file not present
y, sr = librosa.load(librosa.ex('trumpet'), sr=16000)

# 2. Extract Mel-Spectrogram
mel_spectrogram = librosa.feature.melspectrogram(
    y=y, sr=sr, n_fft=2048, hop_length=512, n_mels=128
)
# Convert to log scale (dB)
log_mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)

# 3. Extract MFCCs
mfccs = librosa.feature.mfcc(S=log_mel_spectrogram, n_mfcc=13)

print(f"Audio Shape: {y.shape}")
print(f"Mel-Spectrogram Shape: {log_mel_spectrogram.shape}")
print(f"MFCC Shape: {mfccs.shape}")
            
Modify the code above to extract Delta and Delta-Delta MFCCs (using librosa.feature.delta), which capture the dynamic transitions of speech.

11. TensorFlow Implementation

Here is a basic 1D Convolutional Neural Network (CNN) for Audio Classification (e.g., distinguishing spoken digits).


import tensorflow as tf
from tensorflow.keras import layers, models

def build_audio_cnn(input_shape, num_classes):
    model = models.Sequential([
        # Input shape expected: (time_steps, mfcc_features)
        layers.Conv1D(64, kernel_size=3, activation='relu', input_shape=input_shape),
        layers.MaxPooling1D(pool_size=2),
        
        layers.Conv1D(128, kernel_size=3, activation='relu'),
        layers.MaxPooling1D(pool_size=2),
        
        layers.Conv1D(256, kernel_size=3, activation='relu'),
        layers.GlobalAveragePooling1D(),
        
        layers.Dense(128, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation='softmax')
    ])
    
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model

# Example usage: 98 time steps, 13 MFCCs, 10 classes (digits 0-9)
model = build_audio_cnn((98, 13), 10)
model.summary()
            

12. Scikit-Learn Pipeline

For simpler tasks like Voice Activity Detection (VAD) or basic classification, we can flatten features and use traditional ML.


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import numpy as np

# Assume X_train is an array of flattened MFCCs: shape (n_samples, n_features)
# Assume y_train is binary (0 for silence, 1 for speech)

vad_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC(kernel='rbf', probability=True))
])

# vad_pipeline.fit(X_train, y_train)
# predictions = vad_pipeline.predict(X_test)
            

13. Indian Case Studies

Bhashini (National Language Translation Mission)

India has 22 official languages. The Government of India launched Bhashini to build an open-source, crowdsourced AI platform for translation and speech recognition across Indian languages. It uses ASR to transcribe Hindi, Tamil, Bengali, etc., translates the text, and uses TTS to speak it in the target language.

Kuku FM

An Indian audio content platform providing audiobooks and shows in regional languages. They utilize advanced TTS algorithms and noise-suppression ML models to rapidly scale content creation in Marathi, Gujarati, and Telugu.

Challenge: Code-Switching. Indians frequently mix languages (e.g., "Hinglish"). ASR models trained purely on Hindi or English fail spectacularly when a user says, "Mera flight cancel ho gaya hai." Modern Indian ASR systems require massive code-switched datasets for robust training.

14. Global Case Studies

15. Startup Applications

Otter.ai: Revolutionized meeting transcriptions by combining speaker diarization (who spoke when) with accurate ASR.

Descript: Allows video and audio editing by editing the transcribed text. It uses TTS to generate audio in the speaker's voice to fix misspoken words (Overdub).

Resemble AI / ElevenLabs: Leading startups in voice cloning and highly expressive, emotive TTS for gaming and dubbing.

16. Government Applications

Surveillance & Security: Voice biometrics (Speaker Verification) are used to authenticate identities for secure telephonic access to citizen services.

Parliamentary Proceedings: Automated transcription of Lok Sabha and Rajya Sabha sessions, handling multiple regional accents and fast-paced overlapping speech.

17. Industry Applications

Deepfakes! The rise of high-quality voice cloning creates severe risks for phishing and impersonation. The industry is urgently developing "Audio Deepfake Detection" systems as countermeasures.

18. Mini Projects

Project 1: Voice Command Recognizer

Objective: Build a system to recognize words like "Up", "Down", "Left", "Right".

Steps: Download the Google Speech Commands dataset. Extract MFCCs for each 1-second clip. Train a 1D CNN or LSTM using TensorFlow. Connect it to your microphone using pyaudio to control a simple Python game.

Project 2: Audio Deepfake Detector

Objective: Classify an audio clip as human or AI-generated.

Steps: Use the ASVspoof dataset. Extract Mel-spectrograms. AI-generated speech often lacks high-frequency breath sounds and has unnatural phase consistency. Train a ResNet50 model to classify the spectrograms as Fake/Real.

19. Exercises

Complete the following exercises to solidify your understanding:

  1. Explain the purpose of applying a windowing function before the FFT.
  2. Calculate the Nyquist frequency for a standard CD audio sampled at 44.1 kHz.
  3. Why do we use the Mel scale instead of a linear frequency scale?
  4. Describe the steps to extract an MFCC from a raw audio waveform.
  5. How does Connectionist Temporal Classification (CTC) handle unaligned sequences?
  6. What is the difference between Speaker Identification and Speaker Verification?
  7. Write a Python script using librosa to plot the waveform and spectrogram of an audio file.
  8. Explain how the blank token solves the duplication problem in CTC loss.
  9. What are Formants, and which part of the MFCC captures them?
  10. Describe the architecture of the Tacotron TTS system.
  11. How does WaveNet generate audio sample by sample?
  12. What is the role of the Vocoder in a TTS pipeline?
  13. Why is Voice Activity Detection (VAD) crucial for ASR systems?
  14. How do self-attention mechanisms in Transformers improve upon RNNs in speech recognition?
  15. Describe the phenomenon of 'spectral leakage'.
  16. What is the effect of changing the hop length when computing an STFT?
  17. Explain how multilingual models like Whisper handle code-switching.
  18. What are Delta and Delta-Delta MFCCs?
  19. Design a high-level architecture for a real-time speech translation app.
  20. Discuss the ethical implications of voice cloning technology.

20. MCQs

Q1: What is the Nyquist frequency for an audio signal sampled at 16,000 Hz?

  1. 8,000 Hz
  2. 16,000 Hz
  3. 32,000 Hz
  4. 4,000 Hz
Correct Answer: A

Q2: Which feature extraction technique mimics the non-linear human perception of pitch?

  1. Linear Spectrogram
  2. Mel-Spectrogram
  3. Waveform
  4. Phase Spectrum
Correct Answer: B

Q3: In CTC Loss, what is the purpose of the 'blank' token?

  1. To act as a space between words
  2. To allow the model to output nothing for unaligned frames
  3. To represent background noise
  4. To denote end of sentence
Correct Answer: B

Q4: What mathematical operation is used to convert a Log-Mel Spectrogram into MFCCs?

  1. Fast Fourier Transform
  2. Discrete Cosine Transform (DCT)
  3. Wavelet Transform
  4. Inverse Fourier Transform
Correct Answer: B

Q5: Which of the following models is primarily a Vocoder?

  1. Tacotron
  2. DeepSpeech
  3. WaveNet
  4. Whisper
Correct Answer: C

Q6: If a window size is 25ms and hop length is 10ms for a 1-second audio, approximately how many frames are generated?

  1. 40
  2. 100
  3. 25
  4. 10
Correct Answer: B

Q7: Which component of a sound wave corresponds to its perceived pitch?

  1. Amplitude
  2. Phase
  3. Frequency
  4. Timbre
Correct Answer: C

Q8: What does 'Speaker Diarization' refer to?

  1. Translating speech to text
  2. Identifying who spoke when
  3. Synthesizing a new voice
  4. Removing background noise
Correct Answer: B

Q9: Which architecture is OpenAI's Whisper based on?

  1. HMM-GMM
  2. RNN with CTC
  3. Encoder-Decoder Transformer
  4. CNN
Correct Answer: C

Q10: Why is the logarithm applied during MFCC calculation?

  1. To compress the audio file size
  2. To mimic human perception of loudness (decibels)
  3. To make the signal periodic
  4. To remove the phase component
Correct Answer: B

21. Interview Questions

Mastering these questions is essential for roles like Speech Scientist or ML Engineer (Audio).
  1. How would you build an ASR system for a completely new language with only 10 hours of transcribed audio?
  2. Explain the end-to-end forward pass of Tacotron 2.
  3. What are the trade-offs between using MFCCs versus raw Mel-Spectrograms as inputs to a deep neural network?
  4. How do you handle variable-length audio sequences in a batch during training?
  5. Explain the beam search decoding process used with CTC loss.
  6. How does WaveNet achieve such high-quality audio generation, and what is its main drawback?
  7. What techniques would you use to improve ASR performance in highly noisy environments?
  8. Describe how you would evaluate a TTS system. What metrics would you use?
  9. What is the 'Cocktail Party Problem', and how is deep learning used to solve it?
  10. Explain the concept of self-supervised learning in speech, referencing models like wav2vec 2.0.

22. Research Problems

23. Key Takeaways

24. References