More thoughts on "neural audio" analysis
First off, I’d like to reference my previous post,which mentions a paper by Pierre-Antoine Manzagol, Thierry Bertin-Mahieux and Douglas Eck which goes by the name of “On the Use of Sparse Time-Relative Auditory Codes for Music”.
I already talked about why I liked the approach, and now I’d like to distinguish it a bit semantically. The ISMIR paper builds on a 2005 Nature paper, which talks about how the low level auditory system is “tuned” to pick up speech related sounds with a very high efficiency using gammatone like wavelet filters. However, labeling the technique “sparse” in general is a bit misleading. This is because the relative “sparsity” of the signal representation depends on the gammatone basis functions, the encoding routine, as well as the signal characteristics themselves. So, respectively, these techniques use biologically observed gammatone-like wavelets, using a matching pursuit algorithm to minimally (sparsely) encode signals, on “natural” occurring sounds.
To characterize their technique more plainly, it’s appropriate to call it a form of neural audio analysis. While the standard matching pursuit encoding method doesn’t completely emulate the mechanism for a neural response (partly because it is usually performed as an offline process), it does come far closer to mimicking what our brain is doing over other techniques like short time Fourier transforms.
Furthermore, neural audio analysis differs from other more general wavelet analyses (using general Gabor atoms, etc.) in that it is using a predefined, limited, neurologically-based dictionary of wavelet functions. For better or for worse, these chirps and blips are the building blocks of how we perceive sound. They may perform very efficiently for speech, but they are probably lacking (non-sparse) for many other types of signals.
How did we get these wavelet filters? The hard way! De Boer et. al literally had to wire a cat brain into a signal processing loop in order to extract the individual neural responses to audio stimulus. Cats were chosen because they have auditory faculties that are very similar to humans. Luckily, the Boston Ear Lab provides a database of these filters so that other researchers don’t have to repeat this same experiment on the poor cats any more.
The point is, using these neurological wavelet filters is important, even if they’re non-efficient ways of representing certain signals. They come closest to representing how our brains are responding to an audio stimuli, and therefore provide better building blocks for forming higher level representations of sound. I think that even though the underlying techniques are not new, the potential for “neural audio analysis” is still fairly untapped.
As a side note, I’m glad that Manzagol et. al wrote that paper, it’s finally putting my “Mind” category to use. 🙂
All plots taken from either De Boer & De Jongh 1978 or Smith & Lewicki 2004,2005.