Time-varying FIR filtering without sonic articacts

In almost all audio applications, clean, undistorted sound is essential and a basic requirement. Any audible “crack”, “pop” or “snap” will immediately break music or spatial immersion. One of the causes of these audible artifacts is a temporal discontinuity in the digital audio waveform – a steep change (very high-frequency content) in the signal being reproduced by the speaker.

Such discontinuity arises during a very common scenario in real-time audio processing and virtual auralization: changing an impulse response of a running Finite impulse response (FIR) filter implemented with Fast Fourier Transform (FFT) and block-based convolution algorithms. For example, every change in the position of sound sources/listeners in virtual acoustic space requires update to the filter coefficients. If the filters are implemented using Overlap-Add (OLA) or Overlap-Save (OLS) algorithms, an instantaneous change of the filter impulse response will cause output waveform discontinuity, and as a result, audible artifacts, the more audible, the more “different” the new impulse response is.

Scenario with changing positions of virtual sound sources and listeners is one important example in the context of the virtual auralization, but every other scenario where user interaction or automation meets real-time FIR (FFT-based) filtering requires those filters to be time-varying: with impulse response changing over time.

In this post I want to present the problem and implementation of artifacts-free, time-varying FIR filtering. This concerns only FFT-based filtering implemented as OLA or OLS block processing, as those are widely (if not exclusively) used where real-time and efficient processing is required and because direct convolution algorithms don’t suffer from waveform discontinuities and filter coefficients can be changed on the sample-by-sample basis.

Solution: Time-domain crossfading

Probably the simplest and working out-of-the box solution is to perform cross-fading in time domain. We have two filters running in parallel and cross-fade between those filters outputs. One of those filters has our “new” impulse response, let’s call it, h1, and the other has our “current” impulse response, h0. Our total output is then equal to:

out(n)f0(n)*y0(n) + f1(n)*y1(n)

where:

  • y0 and y1 are the filter outputs (filtered input block) of filters h0 and h1, respectively
  • f0 and f1 are fade-out and fade-in functions (signals), respectively

What are those fade out/in functions? Well, in general cross-fading can be accomplished with different functions: linear, sinusoidal, exponential etc. We would like to preserve the signal amplitude and power so ideal are sine- and cosine-square (sin^2(x) + cos^2(x) = 1). Therefore, in our case fade-in function is sin^2 and fade-out is cos^2 with the period equal to quarter of the block size (so that if L is block-length, then f0(L-1) = 0 and f1(L-1) = 1).

Now, in the algorithm, all that is needed is to mark one of the filters as current and the other as new, and change the impulse response of only one of them – swapping them afterwards (so the previous new, becomes the new current). Note that in the case of the new impulse response being the same, the total output is equal to the output of each of the filters. This property can be utilized for performance.

This approach works well and considerably diminish audio distortions when changing one of the impulse response. The drawback is that we just doubled the computational cost (we have one additional filter) and added the additional operation of cross-fading two signals. The latter is generally negligible compared with the cost of the additional IFFT (Inverse FFT) that the second filter introduce. It is possible to remove the need for the second IFFT (at the expsense of implementation flexibility, more on that below) by moving the crossfading to the frequency-domain.

Improved solution: Frequency-domain crossfading

By utilizing the convolution theorem (Fourier transform of the convolution is a point-wise multiplication of Fourier transforms), or rather, it’s inverse (Fourier transform of multiplication is given by convolution of Fourier transforms) we can perform the crossfading in frequency-domain. In frequency domain, the output signal from the previous section is given by:

OUT(k)=(F0⊗Y0)(k) + (F1⊗Y1)(k)

where F0, F1 are Fourier transforms of the fade-out and fade-in functions respectively and Y0, Y1 are Fourier transforms of the filtered signals and ⊗ denotes convolution. This is equal to (omitting the index):

OUT = F0⊗(H0*X) + F1⊗(H1*X)

where X is the Fourier transform of the input signal and H0, H1 are Fourier transforms of the impulse responses and ⊗ is a convolution sign.

From that equation it’s obvious we need to compute Fourier transforms of the fade in/out functions only once (they don’t change) and Fourier transform of the input signal only once per block. Now we need only one IFFT: to transform the final output to time-domain. The problem is the convolution, or rather, computing it efficiently. Good news is that the Fourier of our fade in/out functions (sin^2(x) and cos^2(x)) is very simple and has only 3 non-zero components – if K is the FFT size then (I omit the derivation):

  • F0(n) = -K/2*δ(k+1) + K*δ(k) -K/2*δ(k-1)
  • F1(n) = +K/2*δ(k+1) + K*δ(k) +K/2*δ(k-1)

Given that the convolution in our case can be realized as a simple 3-point multiplication and summation. There is however one problem that remains: spectral leakage. In our time-domain solution we could just generate one quarter of the period of sine- or cosine- square – here it is required to convolute over the entire period to avoid leakage. This can be quite easily realized using the Overlap-Save method (OLS): by discarding the leftmost K – B output samples, where B is the block-size and equal to K/2. This however cannot be accomplished with the Overlap-Add (OLA) method since it saves all output samples. That is the flexibility cost I mentioned earlier.

I have implemented and used the frequency-domain approach in my binaural VST plugin. This considerably reduced, to the point of being almost inaudible, the sonic artifacts while changing the position of the sound source.

This post is inspired by and above solutions are described in greater detail in the paper: Efficient time-varying FIR filtering using crossfading implemented in the DFT domain. All credit where it’s due.

Third-party audio spatialization plugins – future of the 3D audio in games?

With the upcoming release of the Oculus Rift and overall increased interest in virtual reality (VR) in gaming came also the development of 3D game audio. Finally, sound starts to be recognized as a vital part of the immersion in games, especially those using virtual reality technology. Still, major focus is being put on graphics – this is where 95% of the VR development is happening but at the same time there is an avalanche of free and commercial spatialization engine plugins/libraries that implement 3d (HRTF) panning supported by real-time reverberation with simple room modelling. Examples are: 3Dception by Two Big Ears, Oculus Audio SDK by Oculus VR, Phonom by Impulsonic…you can find many more but all are very similar when it comes to features. The details lay in sound quality, performance and price. Almost all of them allow integration with major game/audio engines: Unity, Unreal (only recently), Wwise, Fmod, which brings me to the topic of this post: are such 3rd-party plugins the future of 3D audio in games?

Currently, there are two dominant game engines on the market: Unity and Unreal Engine 4. Both have very basic built-in audio engines and offer no built-in 3D audio/spatialization (note that stereo panning and distance attenuation is not 3d audio). Neither does any of the two most popular audio middleware: Wwise and Fmod (Both offer 3D audio only via paid plugins). The only free solution are the libraries/plugins i mentioned above.

Let’s look at Unity. The way all spatialization plugins for Unity work is simple: they grab audio samples in OnAudioFilterRead method in a script attached to the sound source to spatialize,  process it on the native (C++) side via .dll that resides in the Plugins folder and return it back to Unity (managed code) to mix. This, however is not ideal and induces a significant overhead. However, the guys behind Unity seem to have recognized the need for 3D audio and the problem with current approach and included the Audio Spatialization SDK on their recently announced roadmap (full link here):

Capture

The feature, however, went from “on track” to “at risk”. Nevertheless it clearly means that Unity is developing a way to support those spatialization plugins instead of working on expanding their own audio engine capabilities in that area. Seems like a smart move. Why invent the wheel again? Let’s see how it looks in the case of Unreal Engine.

While Unity almost from it’s inception provides an easy way to access raw samples, Unreal didn’t offer any way to do it up until few months ago other than modifying the engine itself (it’s open source). In their release that came out few months ago they integrated the Oculus Audio SDK into the engine itself (available as a plugin) although only in the lower quality version and only on Windows 64-bit. What’s important is that while doing that, they created a way to get to raw audio samples by implementing your own audio spatialization plugin. This, however, is no where documented. It basically comes down to implementing the IAudioSpatializationPlugin and IAudioSpatializationAlgorithm interfaces, the latter one being where raw audio samples can be accessed. This is what the original source code comment says about it:

/**
* IAudioSpatializationAlgorithm
*
* This class represents instances of a plugin that will process spatialization for a stream of audio.
* Currently used to process a mono-stream through an HRTF spatialization algorithm into a stereo stream.
* This algorithm contains an audio effect assigned to every VoiceId (playing sound instance). It assumes
* the effect is updated in the audio engine update loop with new position information.
*
*/

While implementing that it’s also necessary to implement own XAPO (XAudio2 custom effect). Other than that it’s an ordinary Unreal plugin (module) so it’s possible to load own .dll and process the audio there, just like in case of Unity plugin. However, the whole process seem hacky and is full of pitfails and the lack of documentation doesn’t help. Most importantly, it’s Windows 64-bit only. It seems like it’s just a byproduct of adding support for Oculus Audio library that is yet to be fully implemented and officially released, but it’s there. Also, Unreal 4 audio side is undergoing heavy redesign at this moment because of the fact that audio in the engine is implemented differently for every platform to the point where they admitted that “every platform has almost it’s own audio engine”. After they sort that out it seems very likely they will provide a way to create custom spatialization effects in a cross-platform manner the way Unity plans to and the aforementioned interfaces are a big hint.

Since Unity and Unreal are the two giants in the game industry and most of the games in the near future will be created with their use, it seems like the future of 3D audio depends on those specialized in spatialization libraries/plugins/SDKs. Hopefully, this time, the audio development won’t get halted by some artificial patents and copyrights and we won’t end up stuck with the 3D counterpart of the Dolby Digital. Most importantly, there is a chance that due to the availability of good 3d audio solutions we will finally start catching up to graphics.

Binaural Panner – VST plugin

I’ve written a Binaural Panner VST plugin using algorithms described in the previous post. I used the popular JUCE C++ library (which is a no-brainer choice for writing VST plugins these days). JUCE doesn’t offer any API for FIR filtering so i needed to write fast convolution myself and for that I used kiss_fft which is a small, but quite fast FFT library. The GUI elements (like head icon) are hard-coded as C arrays. You can choose between 3 different HRTFs from the CIPIC database. There is a lot of room for optimization, for example crossfading between previous and next impulse response could be stopped at some point, interpolation could be done doing adjacent walk…but i wrote it as an experiment (I always wanted to make my own VST plugin someday!) so it’s not meant for any serious use. In the future, I plan to dive into Ambisonics, so I’ll probably include this HRTF panner as part of the ambisonics decoder plugin…I hope to have more knowledge and experience with HRTFs by that time so I could create an average HRTF that sounds reasonable for most people and that’s when I’ll consider this plugin for serious use.
Source is available at my repo on github. Here is the screenshot:
screenshot

Implementing Binaural (HRTF) Panner Node with Web Audio API

In my last post, I’ve talked about Head-Related Transfer Functions and how to use them to create virtual, binaural audio. In this post I’ll show how to write a stereo panner that uses HRTF to position a mono source in 3D space. I’ll implement it in JavaScript with the use of Web Audio API. Although Web Audio API already offers a stereo panner that uses HRTF, it is not of the best quality and provides no way of choosing the Head-Related Transfer Function to use which results in an effect that may not sound realistic for everyone. The ability to choose that particular HRTF which results in a best experience is very important. Another reason that i decided to use Web Audio API for implementation is accessibility – all you need is a web browser. The algorithm I’m about to present can be easily ported to any other language.

Some theory first

The process of using HRTF to create binaural sound from a mono source is simple: We convolute the mono signal with Head-Related Impulse Response (HRIR) for the left and right ear respectively. Since HRIR is different for each direction we need to select appropriate HRIR depending on the position we want our source to be.

Diagram illustrating the basic process of panning using HRTF (taken from Web Audio API)

Diagram illustrating the basic process of panning using HRTF

There are many databases available. Google “HRTF database” and you’ll get many websites from which you can download sets of HRTF impulse responses. The most popular and very well documented one is CIPIC HRTF Database. It provides HRIR for over 40 subjects with respective data about physical properties (detailed head and ear dimensions) and also comes with a Matlab script to load, visualize and listen to all HRTFs.

Regardless of the database chosen, it is necessary to understand the coordinate system our database is using. HRTF (HRIR) is a function of an angle (direction) besides time, obviously. The angle is most of the time specified a a pair of azimuth and elevation. What angles azimuth and elevation are exactly depends on the coordinate system. CPIC database uses so-called Interaural-Polar Coordinate System:

Interaural-Polar Coordinate System used by the CIPIC database compared to the common Spherical coordinate system.

In this coordinate system, elevation is the angle from the horizontal plane to a plane through the source and the x axis (called the interaural axis) and the azimuth is then measured as the angle over from the median plane. So, for example, the point (azimuth = 0, elevation = 0) is the point directly ahead, (azimuth = 90, elevation = 0) is directly to the right, (azimuth = 0, elevation = 180) is behind and so on…HRTFs from the CIPIC database are recorded at elevations from -45 deg to 230.625 deg at constant 5.625 degree intervals and azimuths from -80 to 80 (image below)

Visualization of points at which HRIR from CIPIC are recorded.

Visualization of points at which HRIR from CIPIC are recorded.

There is a fundamental problem that needs to be solved: HRIRs are, naturally, recorded for a finite set of points. On top of that, those points almost never cover the full sphere around the head. It doesn’t make sense to make a panner that allows to position the source only at points for which HRIR we have are recorder. In order to be able to pan a moving source, without audible “jumps” we need some method of interpolation.

One approach could involve taking impulse responses from neighbor points and sum them with weights proportional to distance from the interpolated point. If we restricted those weights to sum to 1, we would have implemented a kind of Vector Base Amplitude Panning (VBAP). This is proven to give good results, especially in the case of surround sound systems, but for HRTF interpolation, a better method exists.

The method is described here: http://scitation.aip.org/content/asa/journal/jasa/134/6/10.1121/1.4828983 Firstly, the 3D triangulation is performed on the original set of points, which are triplets of (azimuth, elevation, distance). As a result the points become vertices of tetrahedrons. To obtain an interpolated HRIR at point X, we need to find a tetrahedron enclosing  that point and sum HRIRs from it’s vertices with weights corresponding to barycentric coordinates of X. Therefore the interpolated HRIR is a barycenter of the four neighbor HRIRs. In case our HRIR data is measured at constant distance ( is only a function of azimuth and elevation) there is one dimension less and we search through triangles instead of tetrahedrons, in a 2D triangulation and sum only three HRIR components. HRTFs from CIPIC are function of only azimuth and elevation so we will be dealing with the 2D case, but the algorithm can very easily be scaled to add another dimension basing on the aforementioned paper.

Implementation

Let’s start by defining a class that will load and store the impulse response and triangulation and also perform the interpolation. I’ll call it “HRTF Container”. Instance of this class is what each panner will refer to.

function HRTFContainer() {

	var hrir =  {};
	var triangulation = {};

	this.loadHrir = function(file, onLoad) {
	        // ...
	}

	this.interpolateHRIR = function(azimuth, elevation) {
	        // ...
	}
}

We’re now going to define it’s two methods, starting with

Loading impulse response

The HRIR from CIPIC database is stored in .mat file format. There are few variables but only one is an actual impulse response, stored as
2[left and right] x 25[azimuths] x 50[elevations] matrix. Those 25 azimuths and 50 elevations form a grid pictured above and as You can see they don’t form a complete sphere around the head. If we are to use the method of interpolation that base on a triangulation, it would be better to not have holes. Also, the panner should be able to pan the source in any desired direction. Therefore we need to extrapolate the missing data. The most important points are directly to the left and right, those are +/- 90 azimuth. Also, points below the head are missing. This obviously comes from the fact that the source would need to be in a person’s body. This is the reason we very rarely, almost never, listen to the sound coming directly from below. It is a question whether it makes sense to create a panner that offers such an option. If it does, the result won’t be realistic anyway. However, i decided to do that and extrapolate the “directly below” direction. Why? Because of practical reasons. It will be easier to have a panner that covers the full sphere since we won’t have to restrict the direction – our panner will work for every possible angle. The effect won’t be as realistic as but it’s still better than none.
Here is the script that exports the hrir data from the .mat file to the binary (.bin) format that I’ll load into JavaScript. The points directly to the left (+/- 90 azimuths) are simply an arithmetic mean of the points for +/- 80 azimuth. Points below (elevation = 270 = -90) are a weighted mean of points for elevation = -45 and 230.625.

The code to load the data from the .bin file is pretty straightforward, as long as the layout of the binary data is known:

this.loadHrir = function(file, onLoad) {
	var oReq = new XMLHttpRequest();
	oReq.open("GET", file, true);
	oReq.responseType = "arraybuffer";
	oReq.onload = function(oEvent) {
		var arrayBuffer = oReq.response;
		if (arrayBuffer) {
			var rawData = new Float32Array(arrayBuffer);
			var ir = {};
			ir.L = {};
			ir.R = {};
			var azimuths = [-90, -80, -65, -55, -45, -40, -35, -30, -25, -20, -15, -10, -5, 0,
                5, 10, 15, 20, 25, 30, 35, 40, 45, 55, 65, 80, 90];
			var points = [];

			var hrirLength = 200;
			var k = 0;
			for (var i = 0; i < azimuths.length; ++i) {
				azi = azimuths[i];
				ir['L'][azi] = {};
				ir['R'][azi] = {};

				// -90 deg elevation
				ir['L'][azi][-90] = rawData.subarray(k, k + hrirLength);
				k += hrirLength;
				ir['R'][azi][-90] = rawData.subarray(k, k + hrirLength);
				k += hrirLength;

				points.push([azi, -90]);
				// 50 elevations: -45 + 5.625 * (0:49)
				for (var j = 0; j < 50; ++j) {
					var elv = -45 + 5.625 * j;
					ir['L'][azi][elv] = rawData.subarray(k, k + hrirLength);
					k += hrirLength;
					ir['R'][azi][elv] = rawData.subarray(k, k + hrirLength);
					k += hrirLength;
					points.push([azi, elv]);
				}

				// 270 deg elevation
				ir['L'][azi][270] = rawData.subarray(k, k + hrirLength);
				k += hrirLength;
				ir['R'][azi][270] = rawData.subarray(k, k + hrirLength);
				k += hrirLength;
				points.push([azi, 270]);
			}
			hrir = ir;
			triangulation.triangles = Delaunay.triangulate(points);
			triangulation.points = points;
			if (typeof onLoad !== "undefined")
				onLoad();
		}
		else {
			throw new Error("Failed to load HRIR");
		}
	};
	oReq.send(null);
}

Hrir is stored as float array, each HRIR is 200 samples long. With each loaded hrir we store the (azimuth, elevation) point and at line 48 the triangulation from those points is performed. I’ve used the Delaunay triangulation by ironwallaby.

Interpolation

This is a direct implementation of the algorithm described in the paper mentioned earlier.
We iterate through all the triangles in the triangulation. The triangle enclosing the point we want to interpolate is found when all the barycentric weights (g1, g2 and g3) are non-negative. The interpolated HRIR is a weighted sum of the 3 HRIRs corresponding to vertices of that triangle:

this.interpolateHRIR = function(azimuth, elevation) {
	var triangles = triangulation.triangles;
	var points = triangulation.points;
	var i = triangles.length - 1;
	var A, B, C, X, T, invT, det, g1, g2, g3;
	while (true) {
		A = points[triangles[i]]; i--;
		B = points[triangles[i]]; i--;
		C = points[triangles[i]]; i--;
		T = [A[0] - C[0], A[1] - C[1],
			 B[0] - C[0], B[1] - C[1]];
		invT = [T[3], -T[1], -T[2], T[0]];
		det = 1 / (T[0] * T[3] - T[1] * T[2]);
		for (var j = 0; j < invT.length; ++j)
			invT[j] *= det;
		X = [azimuth - C[0], elevation - C[1]];
		g1 = invT[0] * X[0] + invT[2] * X[1];
		g2 = invT[1] * X[0] + invT[3] * X[1];
		g3 = 1 - g1 - g2;
		if (g1 >= 0 && g2 >= 0 && g3 >= 0) {
			var hrirL = new Float32Array(200);
			var hrirR = new Float32Array(200);
			for (var i = 0; i < 200; ++i) {
				hrirL[i] = g1 * hrir['L'][A[0]][A[1]][i] +
					g2 * hrir['L'][B[0]][B[1]][i] +
					g3 * hrir['L'][C[0]][C[1]][i];
				hrirR[i] = g1 * hrir['R'][A[0]][A[1]][i] +
					g2 * hrir['R'][B[0]][B[1]][i] +
					g3 * hrir['R'][C[0]][C[1]][i];
			}
			return [hrirL, hrirR];
		}
		else if (i < 0) {
			break;
		}
	}
	return [new Float32Array(200), new Float32Array(200)];
}

Panner

Now that we have a way to load HR impulse response and interpolate it for every direction desired, we can finally code the actual panner. For the rest of this post I’ll assume You have basic knowledge of the Web Audio API, which means You read at least the Introduction section of the official documentation. :)

The heart of the HRTF Panner is the convolution process. Web Audio API offers a ConvolverNode interface for that. Since we need to convolve with HRIR for both ears (left and right channel), we need two convolvers (with one-channel buffer each) or one convolver using a two-channel buffer. For convenience, I’ll go with the latter:

function HRTFConvolver() {
	this.convolver = audioContext.createConvolver();
	this.convolver.normalize = false;
	this.buffer = audioContext.createBuffer(2, 200, audioContext.sampleRate);
	this.convolver.buffer = this.buffer;
	this.gainNode = audioContext.createGain();

	this.convolver.connect(this.gainNode);

	this.fillBuffer = function(hrirLR) {
		var bufferL = this.buffer.getChannelData(0);
		var bufferR = this.buffer.getChannelData(1);
		for (var i = 0; i < this.buffer.length; ++i) {
			bufferL[i] = hrirLR[0][i];
			bufferR[i] = hrirLR[1][i];
		}
		this.convolver.buffer = this.buffer;
	}
}

Nothing extraordinary…we create the two-channel buffer with length of 200 samples (that’s the length of HRIR from CIPIC) and then set the convolver to use that buffer. Before it is done however, we must set the ConvolverNode.normalize attribute to false, so that the Web Audio API algorithms does not normalize our impulse response. It is necessary because HRIRs from the CIPIC database are already normalized. I also create a GainNode and connect the convolver to it. I’ll explain the purpose of that GainNode in a bit. In the fillBuffer method we fill buffer with new impulse response via “pointers” to the underlying arrays (the getChannelData() method). At line 17 we need to set the buffer again because of the fact that call to getChannelData apparently resets the ConvolverNode.buffer attribute.

Now, if we were simply to change the convolver’s impulse response we would get audible glitches/cracks, caused by a rapid change of the filter. To avoid that we would need to add the last 200 (length of the impulse response) filtered samples from the output (so called tail) to the input and that’s simply not possible without creating a processor script, which would deal with raw samples causing a massive drop in performance, defeating the purpose of using Web Audio API in the first place. So the solution is like this: create two convolvers , update only one of them (let’s call it the target convolver) and crossfade between the two. After crossfading completes, swap the references:

function HRTFPanner(audioContext, sourceNode, hrtfContainer) {

	function HRTFConvolver() {
		// ...
	}

	var currentConvolver = new HRTFConvolver();
	var targetConvolver = new HRTFConvolver();

	this.update = function(azimuth, elevation) {
		targetConvolver.fillBuffer(hrtfContainer.interpolateHRIR(azimuth, elevation));
		// start crossfading
		var crossfadeDuration = 25;
		targetConvolver.gainNode.gain.setValueAtTime(0, audioContext.currentTime);
		targetConvolver.gainNode.gain.linearRampToValueAtTime(1, audioContext.currentTime + crossfadeDuration / 1000);
		currentConvolver.gainNode.gain.setValueAtTime(1, audioContext.currentTime);
		currentConvolver.gainNode.gain.linearRampToValueAtTime(0, audioContext.currentTime + crossfadeDuration / 1000);
		// swap convolvers
		var t = targetConvolver;
		targetConvolver = currentConvolver;
		currentConvolver = t;
	}
}

At line 11 we fill the target convolver’s buffer with the interpolated impulse response. At line 15 we schedule the gain of the target convolver’s to raise from 0 to 1 and the opposite for the current convolver at line 17. Then, we swap the references. This is very similar to the Double-Buffer pattern used by GPUs, except for the crossfading part. In the code above the crossfading time is 25 milliseconds. That is enough to avoid audible glitches, at least in theory. I’ve tested it on many signals and this value seems to be enough but it may not always be the case. The longer the time, the more save we are, however at the cost of the delay caused by the transition.

The last thing, which is not necessary but highly desirable, is a crossover. The purpose of the crossover in a HRTF Panner is to don’t spatialize lowest frequencies. Those are in their nature non-directional: hearing a deep bass coming only from one direction is very unnatural to hear. Therefore we can separate the lowest frequencies of the source signal and send them directly to the output and convolute only the frequencies above the cutoff frequency. To do that we can use the BiquadFilterNode interface:

function HRTFPanner(audioContext, sourceNode, hrtfContainer) {

	function HRTFConvolver() {
		// ...
	}

	var currentConvolver = new HRTFConvolver();
	var targetConvolver = new HRTFConvolver();

	var loPass = audioContext.createBiquadFilter();
	var hiPass = audioContext.createBiquadFilter();
	loPass.type = "lowpass";
	loPass.frequency.value = 200;
	hiPass.type = "highpass";
	hiPass.frequency.value = 200;

	var source = sourceNode;
	source.channelCount = 1;
	source.connect(loPass);
	source.connect(hiPass);
	hiPass.connect(currentConvolver.convolver);
	hiPass.connect(targetConvolver.convolver);

This is almost all that there is to it. Last thing we need is to define the methods for connecting the panner to the output (so it can be used kind of like the native AudioNode interface) and some other methods, like changing the source or crossover frequency:

	/* 
		Connects this panner to the destination node.
	*/
	this.connect = function(destination) {
		loPass.connect(destination);
		currentConvolver.gainNode.connect(destination);
		targetConvolver.gainNode.connect(destination);
	}

	/* 
		Connects a new source to this panner and disconnects the previous one.
	*/
	this.setSource = function(newSource) {
		source.disconnect(loPass);
		source.disconnect(hiPass);
		newSource.connect(loPass);
		newSource.connect(hiPass);
		source = newSource;
	}

	/*
		Sets a cut-off frequency below which input signal won't be spatialized.
	*/
	this.setCrossoverFrequency = function(freq) {
		loPass.frequency.value = freq;
		hiPass.frequency.value = freq;
	}

The full code, along with the example can be found at my repo: https://github.com/tmwoz/hrtf-panner-js
I’ve also written a browser application that showcases the whole CIPIC database here: https://github.com/tmwoz/binaural-hrtf-showcase

Creating binaural sound: Head Related Transfer Functions

Have You ever heard about the Virtual Barber Shop?  If not, then listen to this (use headphones!):

Sounds amazing, right? Well, the recording is almost 20 years old. The technique itself is more than two times older. Yet, there are hardly any games that make use of that and if they are, they are not AAA titles…which is strange because (as You could hear) the effect is incomparable with the stereo we’re used to and the method itself (as You’ll read later on) is very easy to implement. Actually, the only downside is that it works only on headphones.

But what is this magic actually?

It’s called…

Binaural Recording

The idea is to record..what our ears really hear. How do You do that? Well, you simply put the microphones in the…ears.That’s it. As for binaural recordings, there isn’t really much more to it. The effect is similar to what You can hear in the video above – You feel like you’re really there.And that is because of the obvious fact that microphones placed in your ears record sound waves almost the same as those that would normally hit your inner ear. As a consequence, such a recording already contains all the positional cues that our brain uses to localize the source of sound and other techniques try to reproduce. Moreover, it includes all the effects like scattering and diffraction of the sound wave on the head, pinna and other parts of the body that are inseparable from the sounds that we hear everyday and are very hard, if not impossible to reproduce.

A dummy head used to create a binaural recording of the sound of sea on the beach.

There is one catch though : Everybody is different. The size and shape of the head, dimensions of the pinna and other parts like torso are naturally different for each person, causing slightly different effects for different wave lenghts – therefore recording made inside one person’s ears may sound unnatural for the other. If you would examine a head as a filter then every person’s would have his own, unique filter characteristic that ultimately shapes the final sound that the brain analyses. This leads us to…

Head Related Transfer Function (HRTF)

If our head (mainly, but other parts of the body too) acts as a filter, then HRTF is nothing more than that filter’s transfer function. HRTF is an information on how your head, including the pinna acts on different frequencies of the incoming sound. That information is crucial to our brain and one of the main mechanism of determining the direction of the incoming sound: it is, most importantly, used to determine the location of sound source lying on the cone of confusion (where the Interaural Time Difference and Interaural Level Difference are the same). HRTF is a rather complicated function of frequency and direction. For every direction, the phase and frequency response “of our filter” is different.

A plot of an example HRTF frequency response for a sound source directly ahead of a listener.

A plot of an example HRTF frequency response for a sound source directly ahead of a listener.

For a sound source directly ahead, HRTF is roughly similar for both the left and right ear because of the symmetry of our body. You can see peaks and notches on the plot which are caused by various wave effects, such as reflection of the pinna, diffraction on the head, etc…

The same person as above, but this time sound source is to the left.

The same person as above, but this time sound source is to the left.

Sound source to the right.

Sound source to the right.

For the sound source not on the axis of symmetry the biggest difference is in attenuation – You can see on the plots above that for the sound source to the one side, the ear on the opposite side of the head receives much less acoustic energy (that is intuitive). Generally, the higher the frequency, the higher the attenuation.

There are many detailed papers on the subject of HRTF – the topic was studied extensively and many measurements were done. I’ve skipped a lot of details because i want to focus on the most interesting use of HRTFs from the game audio point of view, which is…

Using HRTF to create virtual surround sound

Let’s say You have a HRTF for some direction V. What would happen if You filtered a raw (mono) signal with that HRTF for of the ears? You get a signal picked by that ear like it would come from the direction V. Apply this logic for the other ear and You have a stereo sound that is almost exactly the same like it would be a binaural recording. See where this is going? It is possible to position a sound source in any point in 3D space, given only the HRTF corresponding to that point. This is one of the most popular use of HRTF – positioning sound sources in a virtual space: a 3D panning. Note, that with standard panning techniques it is possible to position a sound source only in front, in a 2D plane (no elevation). There is no way to position a sound source behind, above or below. It is possible using HRTF.

Because of that, there exists entire databases of HRTF. CIPIC is probably the most popular one. HRTFs are recorded for a set of points that form a grid around the listener.

Measuring HRTFs in an anechoic chamber. Each speaker corresponds to different sound source direction (angle). Microphones are inside person’s ears.

In practice, HRIR (Head Related Impulse Response) are measured. HRIR is nothing more than an Inverse Fourier Transform of the HRTF. It is the impulse response of “our head”. Since filtering is the process of convolution of the input with filter’s impulse response, the process can be easily reversed and that’s called deconvolution. In this case the filtered sound is recorded by microphones in the ears and the raw signal is used to find the unknown: the impulse response.

HRTF filtering effect.

To sum up, the process of positioning a sound source using HRIR (HRTF) is very easy: It is simply a convolution of the mono signal with HRIR for left and right ear. With this simple process, You can achieve the same effect that You heard in “Virtual Barber Shop”.
However, there is one obvious problem: You would need an infinite number of HRIRs to cover every possible angle. If You want to be able to position a source in ANY position, some kind of interpolation is needed. I’ll talk about this problem and possible solution in the next post, where I will also show how to program a 3D panner using HRIRs. Thank You for reading.

Parametric EQ in python

Lately in my free time i’ve been writing a parametric equalizer in python, mostly as a part of a self-learning process of equalization in digital domain (using IIR filters) and to learn more python. I used pyaudio for sound, scipy for all dsp, pyside for GUI. I initially used pyqtgraph for plotting but it turned out to have too many things not working in it’s current version so i realized i would probably spend less time coding my own plotting…In the and I needed to override paint event for other functionality anyway so that was probably the right decision…

It is not exactly finished and isn’t meant to be: It is a free-time hobby “project” after all , nothing innovative. It does it’s job: opens a wave file and allows You too have fun with the filters in real time. I learned a lot coding this, hopefully someone , someday will learn something from it too.

I intend to write a post or two that talk in details about the biggest challenges i encountered while writing the program.

I created the repository on GitHub with the project. You can get there through sources section of this blog and see the code there. (Please don’t mind the lack of comments in many places and naming inconsistency…i wanted to just make it work first and then clean it :)

Screen of the equalizer:

screenshot

Air apsorption of sound as a digital filter – Part 2: Implementation in Unity (c#)

In the previous post i described the theory on how to design a digital filter that will simulate absorption of sound due to propagation through air, using a standardized acoustic absoprtion model and simple filter design method. In this post i’ll show how to put that theory into practice by writing a script for Unity 3D engine that attached to a sound source will filter audio coming from it, depending on player’s distance from the source. Unity is a very popular, fully-featured game engine and environment – i chose it as an example because it allows for very straight-forward implementation of what we want to achieve here and because the script i’ll write for it (in c-sharp) is simply a set of clsses/methods that can be easily used/rewritten elsewhere.

Unity is a component driven engine, this means that all functionality of a game object is encapsulated in the form of attachable components that have the ability to communicate with each other. This includes all built-in components like transform, mesh, shaders, coliders, audio source etc…and user-written scripts. These scripts are nothing else than custom Components and allow for programmatical way to manipulate objects, control components and the only way to implement game logic, custom behaviour and methods, user input…basically anything essential for more than a simplest game to work.

Scripts for Unity can be written in C#, JavaScript and Boo. I will use C#. Creating a script is very simple. We must provide a definition of a class that inherits from MonoBehaviour, the base class for every script:

using UnityEngine;
using System;

public class SoundAirAbsorption : MonoBehaviour {

	void Start () {
        //initialization
	}

	void Update () {
        // code to be executed every frame
	}

    void OnAudioFilterRead(float[] data, int channels) {
        //filtering here
    }
}

By default , we only have two methods definitions: Start() and Update(). Start() is called during script initialization, Update() is called every frame. We will fill them later, cause we don’t need them to do actual filtering. Custom filters in Unity are implemented using the OnAudioFilterRead method. It’s called everytime audio data from the audio source component (or previous filter in the chain) is ready to be processed. Parameters: data is an array of floats ranging from -1 to +1 and represents interleaved samples; channels is number of channels (left, right speaker and so on…) in the array – we’ll use that information to deinterleave the data, later.

Ok, first, let’s write our air absorption model. As a reminder we’re using this set of equations: LINK. Example class could look like this (public fields/methods highlighted):

class AirModel
{

    double Fr_O; //relaxation frequency of oxygen
    double Fr_N; //relaxation frequency of nitrogen

    const float T0 = 293.15f;
    const float T01 = 273.16f;  

    float T;
    public float Temperature
    {
        get { return T; }
        set
        {
            if ((value >= 0) && (value <= 330))
	    {
                T = value;
            	updateRelaxFs();
            }
        }
    }

    float hr;
    public float Humidity
    {
        get { return hr; }
        set
        {
            if ((value >= 0) && (value <= 100))
            {
                hr = value;
            	updateRelaxFs();
	    }
        }
    }

    float ps;
    public float Pressure
    {
        get { return ps; }
        set
        {
            if ((value >= 0) && (value <= 2))
	    {
                ps = value;
            	updateRelaxFs();
            }
        }
    }

    public AirModel(float temp_kelvin, float h_relative, float atm_pressure = 1f)
    {
        T = temp_kelvin;
        hr = h_relative;
        ps = atm_pressure;
        updateRelaxFs();
    }

    public double getAbsCoeff(float f)
    {
        float F = f/ps;       
        return 20 / Math.Log(10) * F * F * (1.84 * Math.Pow(10,-11) * Math.Sqrt(T/T0) +
            Math.Pow(T/T0,-2.5) * ( 0.01278*Math.Exp(-2239.1/T)/(Fr_O+F*F/Fr_O) + 0.1068*Math.Exp(-3352/T)/(Fr_N+F*F/Fr_N) )) * ps;
    }

    void updateRelaxFs()
    {
        double h = hr * Math.Pow(10, -6.8346 * Math.Pow(T01 / T, 1.261) + 4.6151) / ps;
        Fr_O = 24 + 40400 * h * (0.02 + h) / (0.391 + h) / ps;
        Fr_N = Math.Sqrt(T0 / T) * (9 + 280 * h * Math.Exp(-4.17 * (Math.Pow(T0 / T, 1f / 3) - 1))) / ps;
    }
}

This is rather self-explanatory. Since this model is only valid for specific conditions, changing temperature, humidity or pressure requires bounds checking. I implemented those members as c# properties because they are more convinient to use than standard “setX” and “getX” in such simple case. The updateRelaxFs() method is called each time one of the air properties is set and it’s job is to recalculate oxygen and nitrogen relaxaion frequencies. The getAbsCoeff() method returns an absorption coefficient for given acoustic frequency (and current atmospheric properties) in dB/m.

Next, let’s create a filter class, which we will use in our script to do actual audio filtering. What methods and members should this class have? Starting with the most important ones: an array of impulse response (preferably a private field) and a method (preferably public) that will create/update that impulse response. Such method must take two parameters: the distance to the sound source and desired filter length. In the code below I use an alogirthm from the previous post:

class AirAbsorbFilter
{
    public readonly AirModel m_AirModel;
    Complex[] m_ImpulseResponse;
    int m_SamplingRate;

    public void updateImpulseResponse(float distance, int N)
    {
        // design filter based on distance and current atmospheric properties
        // 'Frequency sampling' method

        Complex[] H = new Complex[N];
        float df = m_SamplingRate / N;

        // sample in (0,fs/2) range
        for (int i = 0; i < H.Length / 2 + 1; ++i)
        {
            float a = (float)m_AirModel.getAbsCoeff(df * i);
            H[i].Re = Mathf.Pow(10, -a * distance / 20);
        }

        //mirror DFT with respect to N/2+1 sample
        for (int i = H.Length / 2 + 1; i < H.Length; ++i)
            H[i] = H[N - i];

        //do IFFT to get impulse response
        FourierTransform.FFT(H, FourierTransform.Direction.Backward);

        //Aforge's FFT/IFFT is not normalized, divide by N
        for (int i = 0; i < H.Length; ++i)
            H[i] /= N;

        //impulse response
        m_ImpulseResponse = H;
        //shift by N/2
        Util.shiftArray<Complex>(m_ImpulseResponse, N / 2);

        //blackman window
        float[] blackman = Util.blackmanWindow(m_ImpulseResponse.Length);
        for (int i = 0; i < m_ImpulseResponse.Length; ++i)
            m_ImpulseResponse[i] *= blackman[i];

    }
}

This method designs a filter, create an impulse response and multiplies it by a blackman window. Parameter N is the filter’s length (filter’s order plus one). Impulse response is stored as m_ImpulseResponse member. The Complex data type and FourierTransform are part of the Aforge.NET library (link at the end). H is the array of sampled (desired) frequency response. At line 18, we get an absorption coefficient for the next sampled frequency and in the next line convert it to a linear scale. We sample only the real part of the spectrum! Parameter FourierTransform.Direction.Backward (line 27) means we perform the inverse Fourier Transform. Dividing the product of the ifft by N (line 31) is necessary if we want it to be consentient with the definition of normalized IDFT. To shift an array in place and apply a window i wrote functions that i put in a class called Util:

class Util
{
    public static void shiftArray<T>(T[] array, int N)
    {
        T[] temp = new T[array.Length];
        System.Array.Copy(array, temp, array.Length);
        for (int i = 0; i < array.Length; ++i)
            array[(i + N) % array.Length] = temp[i];
    }

    public static float[] blackmanWindow(int N)
    {

        int M = (N % 2 == 0) ? N / 2 : (N + 1) / 2;

        float[] win = new float[N];
        win[0] = win[N - 1] = 0f;
        for (int i = 1; i < M; ++i)
            win[i] = win[N - 1 - i] = 0.42f - 0.5f * Mathf.Cos(2 * Mathf.PI * i / (N - 1)) + 0.08f * Mathf.Cos(4 * Mathf.PI * i / (N - 1));

        return win;
    }
}

To generate a Blackman window i use this formula (the same one Matlab uses):
blackmanwhere N is the length of a window and M is equal to N/2 for N even and (N+1)/2 for N odd.

All right, so now that we have code that creates our filter’s impulse response based on a distance, let’s write the code that does the filtering (at last!) Impulse response is all we need to filter any signal. The equation for a FIR filter is:
firwhere y is the output (filtered) signal, x is the input signal, h is an impulse response and N is length of the impulse response (filter’s order plus one). We can see that “filtering” with a FIR filter is basically multiplying delayed samples of the input with filter coefficients (impulse response) and summing them. This equation is very similar to the equation defining a linear discrete convolution. Indeed, FIR filtering means to convolute an input with filter’s impulse response.

Convolution is a costly operation and in the form above its complexity is O(N*M) where N and M are lengths of x and h. This however can be very much reduced using the fact that the Fourier transform of convolution of two signals in time-domain is equal to multiplication of Fourier Transforms of those signals. To be more precise, convolution of x and h is equal to IDFT( DFT(x)*DFT(h) ) [1]. Using FFT/IFFT to compute DFT/IDFT the complexity is reduced to O( M*log(M) + N*log(N) + (M+N)*log(M+N) ). However, there are is an issue with with this approach: the result of [1] is a circular convolution, not linear. Circular convolution is a slightly different operation and simply using [1] would yield erroneous result (if used for filtering) , but there is a condition under circular and linear convolutions are equal and the solution is rather simple one:

Let Nx = length(x) and Nh = length(h). If we zero-pad both x and h to the length of at least Nx+Nh-1 and then perform circular convolution, the result will be equal to the linear convolution for the first Nx+Nh-1 elements.

Simple, isn’t it? To implement very efficient filtering we just need to add zeros to both audio signal and impulse response so that they have length of L >= Nx+Nh-1 , perform FFT on both , multiply, take IFFT and take only first L samples. Preferably, L should be equal to 2^p so that we take advantage of the fastest version of FFT. Many libraries can only perform FFT of length that is a power of 2 and Aforge.NET is one of them. The overhead caused by L being bigger that it’s necessary is negligible when compared with reduction in numbers of computations gained by Radix-2 FFT.

To implement the filter in our case there’s one last thing that needs to be done. We need to know how to filter in real-time, when the input signal is splitted into blocks/chunks. That is how the audio data that is being routed through OnAudioFilterRead function: basically, it is invoked (called) everytime a next chunk of audio is ready to be filtered and that chunk is passed as the data parameter. It’s not guranteed to be the same length everytime but it should consist of samples that come immediately after the previous samples in the audio file (otherwise there would be audible distortions) We can’t just filter every chunk independently: sum of convolutions is not equal to the convolution of the sum (sum in the sense of array merge). The problem is so called edge effect of convolution: there is a start-up transient of length Nh – 1 samples (Nh – length of impulse response) due to the lattency of the filter. If we were to simply perform a convolution in every call then this effect would occur for every chunk, resulting in a distortion in a continous output.

Eliminating this edge effect is possible by overlapping edge samples from each block, either input or output samples. Two , very well-known algorithms exist , named: overlap-save and overlap-add, that overlap input samples and output samples, respectively. In our case, we need to use the overlap-add method: for every block we need to store last K – Nx samples from output (result of the convolution) and add them to the output of the next block.

The code for our filter class with added methods and array that will store the samples between calls to the filter function:

class AirAbsorbFilter
{
    public readonly AirModel m_AirModel;
    public Complex[] m_ImpulseResponse;
    int m_SamplingRate;
    // two channels - stereo 
    float[][] m_buffers = new float[2][];

    public AirAbsorbFilter(int sampling_rate)
    {
        m_SamplingRate = sampling_rate;
        m_AirModel = new AirModel(293.15f, 50);
        for (int i = 0; i < m_buffers.Length; ++i)
            m_buffers[i] = new float[0];
    }

    public void updateImpulseResponse(float distance, int N)
    {
       // ...
    }

    public void filter(float[] data, int num_ch)
    {
        //deinterleave data for filtering
        float[][] channels = Util.deinterleaveData<float>(data, num_ch);
        //filter every channel
        for (int ch = 0; ch < channels.Length; ++ch)
            filterChannel(channels[ch], ch);
        //interleave and copy back
        Util.interleaveData<float>(channels, data);
    }

    void filterChannel(float[] data, int channel)
    {
        //convolution using OVERLAP-ADD

        // get length that arrays will be zero-padded to
        int K = Mathf.NextPowerOfTwo(data.Length + m_ImpulseResponse.Length - 1);

        //create temporary (zero padded to K) arrays
        Complex[] ir_pad = new Complex[K];
        System.Array.Copy(m_ImpulseResponse, ir_pad, m_ImpulseResponse.Length);
        Complex[] data_pad = new Complex[K];
        for (int i = 0; i < data.Length; ++i)
            data_pad[i].Re = data[i];

        //FFT 
        FourierTransform.FFT(data_pad, FourierTransform.Direction.Forward);
        FourierTransform.FFT(ir_pad, FourierTransform.Direction.Forward);
        //convolution
        Complex[] ifft = new Complex[K];
        for (int i = 0; i < ifft.Length; ++i)
            ifft[i] = data_pad[i] * ir_pad[i] * K;
        FourierTransform.FFT(ifft, FourierTransform.Direction.Backward);
        //add from buffer 
        for (int i = 0; i < data.Length; ++i)
        {
            data[i] = (float)ifft[i].Re;
            if (i < m_buffers[channel].Length)
                data[i] += m_buffers[channel][i];
        }
        //buffer last (K - data.length) samples
        m_buffers[channel] = new float[K - data.Length];
        for (int i = 0; i < m_buffers[channel].Length; ++i)
            m_buffers[channel][i] = (float)ifft[i + data.Length].Re;
    }
}

Array m_buffers[][] stores the samples to overlap for each channel. I assume 2 channels, for stereo, because that’s what unity uses for 3D audio sources. I also added the constructor which initializes inner buffer arrays. The necessary thing to do before filtering is to deinterleave audio data and filter each channel separately (it’s impossible to filter deinterleaved samples using fast convolution method based on DFT). I wrote two helper functions for interleaving and deinterleaving for the Util class:

public static T[][] deinterleaveData<T>(T[] data, int num_ch)
{
    T[][] deinterleaved = new T[num_ch][];
    int channel_length = data.Length / num_ch;
    for (int ch = 0; ch < num_ch; ++ch)
    {
        deinterleaved[ch] = new T[channel_length];
        for (int i = 0; i < channel_length; ++i)
            deinterleaved[ch][i] = data[i * num_ch];
    }

    return deinterleaved;
}

public static void interleaveData<T>(T[][] data_in, T[] data_out)
{
    int num_ch = data_in.Length;
    for (int i = 0; i < data_in[0].Length; ++i)
    {
        for (int ch = 0; ch < num_ch; ++ch)
            data_out[i * num_ch + ch] = data_in[ch][i];
    }
}

That would be all for the DSP part! (uff?) The only thing left to do is to integrate our filter class into the script in Unity:

using UnityEngine;
using System;
using AForge.Math;

[RequireComponent(typeof(AudioSource))]
public class SoundAirAbsorption : MonoBehaviour {

    public GameObject AudioListener;

    [Range(0, 50)]
    public float Temperature = 20f;
    [Range(0, 100)]
    public float Humidity = 50f;
    [Range(0, 2)]
    public float Pressure = 1f;
    
    AirAbsorbFilter air_filter;
    const float update_rate = 0.05f;
    const int filter_length = 2048;

	void Start () {
        air_filter = new AirAbsorbFilter(audio.clip.frequency);
        InvokeRepeating("updateFilter", 0, update_rate);
	}
	
	void Update () {

	}

    void OnAudioFilterRead(float[] data, int channels)
    {
        if (!ReferenceEquals(air_filter, null))
            air_filter.filter(data, channels);
    }

    void updateFilter()
    {
        float distance_to_source = UnityEngine.Vector3.Distance(AudioListener.transform.position, transform.position);
        air_filter.m_AirModel.Temperature = Temperature + 273.15f; //Celcius to Kelvin
        air_filter.m_AirModel.Humidity = Humidity;
        air_filter.m_AirModel.Pressure = Pressure;
        air_filter.updateImpulseResponse(distance_to_source, filter_length);
    }


}

The AudioListener member (line 8) stores the reference to the game object that will receive the sound: there is only one in a scene and that’s usually a main camera. We need that reference to calculate the distance between sound source and receiver. Fields Temperature, Humidity and Pressure are public so they can be directly manipulated from the Unity editor (Every public member of the class that inherits from MonoBehaviour can be controlled/assigned within the Unity editor). [Range()] is an attribute that is used for numeric fields and determines the range for the slider in Unity editor that is used to change the value of that field. update_rate (in seconds) determines how often distance to audio listener is updated and new impulse response calculated. We pass that value to InvokeRepeating function, which is a Unity-built way to repeatedly call some function in a given time interval. In this case, it’s the updateFilter function.

This is how our script looks inside the Unity editor:
Capture

This ends a two-part entry about implementing a digital filter simulating air absorption of sound. Hopefully , any of this will be of some help to somebody, cause i’ve sure had fun writing this. Thanks for reading!

—————————
LINKS:

Aforge.NET library: http://www.aforgenet.com/framework/

Fast convolution (overlap-add & save): http://inst.eecs.berkeley.edu/~ee123/sp14/docs/FastConv.pdf

A good entry on wiki about circular convolution (look the example): http://en.wikipedia.org/wiki/Circular_convolution