In almost all audio applications, clean, undistorted sound is essential and a basic requirement. Any audible “crack”, “pop” or “snap” will immediately break music or spatial immersion. One of the causes of these audible artifacts is a temporal discontinuity in the digital audio waveform – a steep change (very high-frequency content) in the signal being reproduced by the speaker.
Such discontinuity arises during a very common scenario in real-time audio processing and virtual auralization: changing an impulse response of a running Finite impulse response (FIR) filter implemented with Fast Fourier Transform (FFT) and block-based convolution algorithms. For example, every change in the position of sound sources/listeners in virtual acoustic space requires update to the filter coefficients. If the filters are implemented using Overlap-Add (OLA) or Overlap-Save (OLS) algorithms, an instantaneous change of the filter impulse response will cause output waveform discontinuity, and as a result, audible artifacts, the more audible, the more “different” the new impulse response is.
Scenario with changing positions of virtual sound sources and listeners is one important example in the context of the virtual auralization, but every other scenario where user interaction or automation meets real-time FIR (FFT-based) filtering requires those filters to be time-varying: with impulse response changing over time.
In this post I want to present the problem and implementation of artifacts-free, time-varying FIR filtering. This concerns only FFT-based filtering implemented as OLA or OLS block processing, as those are widely (if not exclusively) used where real-time and efficient processing is required and because direct convolution algorithms don’t suffer from waveform discontinuities and filter coefficients can be changed on the sample-by-sample basis.
Solution: Time-domain crossfading
Probably the simplest and working out-of-the box solution is to perform cross-fading in time domain. We have two filters running in parallel and cross-fade between those filters outputs. One of those filters has our “new” impulse response, let’s call it, h1, and the other has our “current” impulse response, h0. Our total output is then equal to:
out(n) = f0(n)*y0(n) + f1(n)*y1(n)
- y0 and y1 are the filter outputs (filtered input block) of filters h0 and h1, respectively
- f0 and f1 are fade-out and fade-in functions (signals), respectively
What are those fade out/in functions? Well, in general cross-fading can be accomplished with different functions: linear, sinusoidal, exponential etc. We would like to preserve the signal amplitude and power so ideal are sine- and cosine-square (sin^2(x) + cos^2(x) = 1). Therefore, in our case fade-in function is sin^2 and fade-out is cos^2 with the period equal to quarter of the block size (so that if L is block-length, then f0(L-1) = 0 and f1(L-1) = 1).
Now, in the algorithm, all that is needed is to mark one of the filters as current and the other as new, and change the impulse response of only one of them – swapping them afterwards (so the previous new, becomes the new current). Note that in the case of the new impulse response being the same, the total output is equal to the output of each of the filters. This property can be utilized for performance.
This approach works well and considerably diminish audio distortions when changing one of the impulse response. The drawback is that we just doubled the computational cost (we have one additional filter) and added the additional operation of cross-fading two signals. The latter is generally negligible compared with the cost of the additional IFFT (Inverse FFT) that the second filter introduce. It is possible to remove the need for the second IFFT (at the expsense of implementation flexibility, more on that below) by moving the crossfading to the frequency-domain.
Improved solution: Frequency-domain crossfading
By utilizing the convolution theorem (Fourier transform of the convolution is a point-wise multiplication of Fourier transforms), or rather, it’s inverse (Fourier transform of multiplication is given by convolution of Fourier transforms) we can perform the crossfading in frequency-domain. In frequency domain, the output signal from the previous section is given by:
OUT(k)=(F0⊗Y0)(k) + (F1⊗Y1)(k)
where F0, F1 are Fourier transforms of the fade-out and fade-in functions respectively and Y0, Y1 are Fourier transforms of the filtered signals and ⊗ denotes convolution. This is equal to (omitting the index):
OUT = F0⊗(H0*X) + F1⊗(H1*X)
where X is the Fourier transform of the input signal and H0, H1 are Fourier transforms of the impulse responses and ⊗ is a convolution sign.
From that equation it’s obvious we need to compute Fourier transforms of the fade in/out functions only once (they don’t change) and Fourier transform of the input signal only once per block. Now we need only one IFFT: to transform the final output to time-domain. The problem is the convolution, or rather, computing it efficiently. Good news is that the Fourier of our fade in/out functions (sin^2(x) and cos^2(x)) is very simple and has only 3 non-zero components – if K is the FFT size then (I omit the derivation):
- F0(n) = -K/2*δ(k+1) + K*δ(k) -K/2*δ(k-1)
- F1(n) = +K/2*δ(k+1) + K*δ(k) +K/2*δ(k-1)
Given that the convolution in our case can be realized as a simple 3-point multiplication and summation. There is however one problem that remains: spectral leakage. In our time-domain solution we could just generate one quarter of the period of sine- or cosine- square – here it is required to convolute over the entire period to avoid leakage. This can be quite easily realized using the Overlap-Save method (OLS): by discarding the leftmost K – B output samples, where B is the block-size and equal to K/2. This however cannot be accomplished with the Overlap-Add (OLA) method since it saves all output samples. That is the flexibility cost I mentioned earlier.
I have implemented and used the frequency-domain approach in my binaural VST plugin. This considerably reduced, to the point of being almost inaudible, the sonic artifacts while changing the position of the sound source.
This post is inspired by and above solutions are described in greater detail in the paper: Efficient time-varying FIR filtering using crossfading implemented in the DFT domain. All credit where it’s due.