ISMIR: Past, Present, and Future
I’m a newcomer to ISMIR and a lot of the music information retrieval scene. However, I’m more or less aware of most of the common approaches. Currently, music is characterized in at least three ways in Music Information Retrieval (MIR):
- As symbolic notation (A composer’s score, or digital MIDI file)
- As waveform data (MP3’s, etc.)
- As a social/cultural phenomenon (Holiday music, pop music, playlists/streams)
The first form of analysis is perhaps the easiest; If we have the symbolic representations of notes, events, and other temporal data, then we can do an enormous amount of music related retrieval. However, not all forms of music are available in this form… others are impractical or impossible to put into this form.
The second form of analysis forms the bulk of the research in MIR. However, most of the techniques use a process involving Short Time Fourier Transforms (STFT). These techniques are some of the most important and useful techniques for general computation that digital signal processsing has yet to produce.
People… very bright people… have dedicated a good part of their lives towards optimizing and improving these techniques. The method and manner of the transforms are highly tuned, and well understood. However, the plot above shows how STFTs will never yield time relative coding. STFTs work by cutting a signal into a set of segments, and then transforming them into a set of energy over a range of frequencies.
The problem is that all the frequency content must be extracted at once from a segment. Short segments will limit the ability to extract low frequency informaiton, while long segments “smear” the activity in high frequency bands during the segment duration.
On top of the timing issue, the STFT output is not really a good representation of what we “hear”. We are much more sensitive to changes in a limited frequency range. This range has been modeled with the Mel Scale, and is commonly used in “Cepstral Coefficients” (MFCCs). However, there is still a large divide between these representations and the nuances of musical perception.
The (main) point of MIR is to characterize music the way that we hear it. Unfortunately, our brains don’t use STFT, so it’s a struggle to start from STFT and work towards a model of human hearing and cognition.
However, promising work has been done in the realm of speech and hearing. By analyzing the coherent cochlear activity of the inner ear, we’ve found that certain neurons “encode” brief patterns of sound. In essence, our minds have a consistent “dictionary” of these little patterns of sound (wavelets), and they use them to represent every sound that passes into the ears.
Recent work by Smith and Lewicki show fascinating evidence that human speech is optimally/efficiently encoded by these neurons. This couldn’t have happened by accident, and it means that we (as a culture at large) have fashioned our various spoken languages to take advantage of this neural activity.
The question is then, have we as a culture done something similar with music? I had been interested in digging into this topic for a long time, so I’m glad that Pierre-Antoine Manzagol, Thierry Bertin-Mahieux and Douglas Eck have already done some initial analysis, and received a best paper award for their efforts.
So, as a newcomer to the conference, it’s interesting to see “three waves” converging in Philadelphia: The past, (which includes conventional acoustic analysis and has now become state of the art in industry), the present (which includes the wide array of social analyses that are finding their feet), and the future (the sparse/neural encoding methods that appear to have tremendous potential).