Frustrated with Fourier
Lately I’ve been working towards understanding music from a signal processing perspective. Music, like any sound form, can be broken down into its component frequencies using Fourier transforms. Usually the process is to split the signal into smaller blocks of sound, and then calculate the spectra for each block. The results can be visualized as a spectrogram, which gives a visual representation of what a certain piece of audio content sounds like.
I’m trying to use the results of this process to drive a new kind of visualization (actually, it’s probably been done before, but I haven’t found a lot of good resources on it). Essentially, the algorithm takes the spectral frames of sound and compares them. It then tries to resolve the differences between the frames of sound and position all the frames in a lower dimensional space.
The process lets you see the main “features” of the signal. For instance check out this movie I made of Chris Raphael playing his oboe. You can clearly see the current frame in red as it moves through the lower dimensional space generated by the algorithm. You can also see the relationships between the notes and volume on the map.
You need to listen to the song to figure out what each part of the map represents, but on another level… this isn’t necessary. Songs can be easily transposed, slowed down, sped up, etc, but they’re still the same song. The mapping algorithm does not represent information related to frequency or time. So, theoretically, you could generate an embedding map that would be pretty much the same for almost any time or pitched based variation of a song. This would be nice for correlating songs… but it’s a woefully small step towards establishing a valid correlation benchmark.
Music is a lot more complicated. As you change the pitch of a song, the frequency characteristics of the instrument will change… sometimes pretty drastically. Brass sections will have to blow harder to hit higher notes, which will introduce more noise to their corresponding signal. The pitch shifted song may include or exclude resonant frequencies of certain instruments (ranges where the instrument is naturally louder). Also, percussion causes its own problems since it is not pitched, but generates an enormous amount of noise across a wide range of frequencies… obfuscating the signals of other pitched instruments. So… it’s clearly not as simple as I stated in the previous paragraph.
Furthermore, even for simple synthesized tones the spectral frequency representations are not completely correct. You’ll constantly have what’s called “leakage” in your representation. Basically, this means that even a perfectly pitched frequency (generated by a sine wave) won’t get perfectly resolved to the proper frequency bin in the spectral representation. This sort of stuff gets technical really quick, so if you’re interested, check out this page for more details.
The movie shows four seperate representations at each point in time. Going in clockwise order they are the low dimensional chroma map of the sound, the spectral representation of the sound at the given frame, the MIDI representation of the given sound (basically a short representation of the spectral version), and a “chroma” representation of the sound.
The chroma representation essentially encodes the frequency information as a “key” and “height”. So essentially, A’ and A” are identical in chroma space except for the height parameter. The map I generate ignores this height parameter… what I wanted to do was show the frequency range from the test tone “wrap around” in a circle… A pretty simple test that should show how the frequencies of the sound relate to eachother as a series of musical “keys”.
However, the representation isn’t a clean circle, as you can clearly see. There’s several sharp spikes present in the map. These are caused by the frequency ranges that have less leakage in their spectral representation. Because they are able to be resolved so cleanly, they show up as being very different than the rest of the frequency ranges that have some leakage. (we’re not talking a big difference overall, but it is a big difference relatively speaking).
I’m trying to tackle this problem… I’d like to come up with a way of smoothing out the spectral representations of sounds so that all frequencies are treated equally… hopefully I can work this out over the semester.