Impulse Response — data augmentation for audio deep learning

This article defines impulse response, how to estimate a room/microphone IR and its usage in data augmentation for deep learning. This article is also published in the BetterMOS blog.

Published in

Level Up Coding

6 min readAug 21, 2021

In recent years, deep learning for audio has come a long way with models beating traditional signal processing techniques in many of the downstream tasks. However, many such solutions are trained on “homogeneous” datasets — datasets where there is little variability in the recording conditions (noise, accent, language, etc.). Many such models do not perform very well (especially audio conversion/synthesis tasks) when used on real world “audio events” which can contain short burst, environment noises, background speakers, poor microphones, etc. While there are many techniques address them, here we concern ourselves with data augmentation with impulse responses, which at times can be really powerful since it simulates different recording environments.

Acoustics and Impulse Response (IR)

An impulse response of a dynamic system describes how it reacts when presented with a brief input signal called the impulse. So if you record an impulse signal, you will not only hear the impulse, but also its reflections, and their reflections.

The reaction of the system can be influenced by its surroundings such as objects or geometry of the system. The impulse signal contains all the frequencies, therefore a system’s impulse response defines the response for all the frequencies. In practice however, a perfect impulse response is difficult to produce. A short impulse is instead employed as an approximation (given that it is comparable to the ideal impulse, the IR will be close to the real IR).

An important application of IR is to determine the acoustic characteristics of a room (or any location). Due to the geometry of the room or the position of the speaker or microphone, there can be a lot of reverb. By capturing/computing the IR of the room, the reverb can be applied on any source audio.

Mel spectrogram of audio, LJ001–0002.wav from LJSpeech dataset. Above spectrogram shows the original audio while the below one shows the same audio augmented with a room impulse response, AcademicQuadrangle.wav from the EchoThief dataset (x-xis denotes time). The augmented spectrogram is longer because of reverberation.

Why is impulse response important in deep learning?

Deep learning models trained with homogeneous audio dataset can fall prey to out of domain data points, sometimes with a sever performance degradation. Factors such as room geometry and transfer function of a microphone can introduce variability in audio datapoints (which the model hasn’t seen). One approach to mitigate this is the development of robust models — by using e.g., methods to remove reverb and denoising. A powerful tool employed here is the data augmentation. Using the impulse responses (convolutional reverb), we can simulate audio files to sound as if recorded in various real/unreal rooms.

In the following figure, 100 audio files are sampled randomly from the LJSpeech dataset. The samples are recorded in a relatively quiet environment. For each datapoint, two augmentations are generated: a) Academic Quadrangle IR from EchoThief dataset and b) my room’s IR generated using the method described in the following section. Each of the original and augmented audios are given as input to fairseq’s wav2vec2 pretrained model. A tSNE plot is generated from wav2vec2 embeddings. We can see clear clusters between the original and ir augmented audios.

tSNE plot of original and augmented audios. Color represents which impulse response was used.

Computing impulse response

This part of the section (and code) is heavily inspired from the following project: DIY Room Impulse Response Capture [1].

Several room impulses (real and simulated) can be found in the following repositories: Microsoft DNS challenge, MIT McDermott, EchoThief. However, in this section, we will look at how to compute an impulse response of a room. In practice, the impulse response computed will be a complex dynamic system consisting of (room, speaker and microphone). These IRs can be used to make the models more robust by fine-tuning them: a cherry on the top!

While capturing the response of IR by directly recording an impulse signal is straight-forward, it can be noisy (environmental noise can be difficult to get rid of) and tedious (the reverb tail may not be long enough since impulse signal is very short).

Exponential Sine Sweep (ESS)

Using a known signal, we can compute IR using convolution/deconvolution (see below equations):

Let yᵣ represent the recorded signal. It is mixture of room impulse response, fₘ and a known signal, z. In frequency domain, this is a convolution of the two signals (denoted by *). Then, the room impulse response can be computed by convolving the recorded signal with the known signal’s inverse, zᶦ. Farina, 2000 [2] solved the problem of computing IR by using the exponential sine signal. ESS signal’s inverse can be computed easily (reversed ESS with a volume ramp, see [1]).

Given below are the steps to compute the Impulse response of a room

Generate the ESS and the inverse ESS signals
Set up the microphone and speaker in a (relatively) quiet room.
Play the ESS signal on the speaker while recording it in the microphone (continue recording for a few seconds after the end of ESS signal to capture the reverb)
Compute the IR using the script below

Key differences between a synthetically generated IR (See Microsoft DNS challenge repository for a dataset of synthetic room IR) and an IR generated using a ‘real-world’ room setting are: 1) background/environmental noise, 2) room IR corruption because of the limitations of speaker and microphone.

Frequency Response: Frequency response of a microphone denotes the range of sound/frequencies it can reproduce and how sensitive it is to various frequency ranges.

Microphones convert audio into electrical signals that can be recorded by responding to the sound waves at their diaphragm; essentially converting mechanical wave energy. The frequency response represents how effectively a mic can recreate the audio (say, in the audible range 20Hz to 20kHz). While a flat response (equally sensitive to all frequency ranges) is desirable, it doesn’t sound good on voices. Therefore, a lot of microphones (like on smartphones) may have a shaped response which are less sensitive in very high or very low frequencies.

The same audio event recorded on multiple microphones can sound different (assuming a very quiet room and same position). If a model has seen audio from only one device, it can perform poorly when presented with audio from a new device. Enter microphone augmentation.

Computing IR of a Microphone

Let’s assume the following conditions: a perfectly sound proof room (no background noise) which absorbs all the sound waves (no reverb) and a speaker with a flat frequency response. Now, any morphing of a recorded signal will be due to the sensitivity of the microphone for different frequencies. An impulse response computed using the ESS method is previously mentioned condition will denote the frequency response of the microphone.

In practice however, the above conditions are difficult/expensive to achieve. The Microphone Impulse Response Project (MicIRP) uses a setup similar to above to get the IR of several microphones. It is freely available on their website!

Data augmentation — in real world

In a real world setting, audio recorded will be corrupted by microphone sensitivity, acoustics of the environment, background noises (white/pink noise, environmental noise), etc. While it depends on the task at hand, for making a robust model, generally it helps to mix various augmentations. Given a corpus, generate a parallel corpus by 1) add white/pink noise with various SNR levels, 2) add environmental noise, 3) convolution with a randomly chosen room IR, 4) convolution with a randomly chosen microphone IR.

References

DIY Impulse Response Capture. http://tulrich.com/recording/ir_capture
Farina, Angelo. “Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique”. Audio Engineering Society Convention 108. Audio Engineering Society, 2000.