Introducing: a Musical Mel Transform

I’m open sourcing a useful tool in our realtime audio AI toolbox here at VJLab, a Musical mel transform.

It’s written in PyTorch and can be ONNX-compatibile with a convolutional FFT (with use_conv_fft=True).

If you’ve ever wanted audio features that directly represent semitones (or quarter tones!) this is the package for you.

Why have a mel transform centered on musical notes?

In general, the mel transform has the following benefits:

And what I’m calling a “musical” mel transform, where the mel bins are aligned to pitch centers, has additional advantages if:

Personally I have found this MusicalMelTransform beats raw FFTs and standard mels for realtime usecases. The package also has an option learnable_weights="fft" to add learnable parameters to reweight the incoming FFT bins for loudness, which is important.

The default arguments convert the FFT magnitudes to power (power: int = 2) and then to a dB scale (to_db: bool = True) as well, which is common in audio AI frontend feature extraction.

TL;DR - if you’re working with music in your AI usecase, then having features that map directly to musical notes can sometimes help with performance!

How does it work?

Mel scale is just a mapping of FFT bins -> new bins. So each mel bin is just a weighted sum of the linearly-spaced FFT bins. That’s it!

This code:

  1. Adds some adaptive (with adaptive=True) widening to interpolate great weighted combinations of FFT bins to make pitches discernable at pitch centers
  2. Gives a configurable way to control the number of high frequency features (with passthrough arguments)
  3. Provides an optional ONNX compatible FFT operator

You can also shorten or widen your tone granulariy – so semi- or quarter- tones is just a parameter change:

1
2
3
# `interval` is the "number of semitones"
chromatic_transform = MusicalMelTransform(interval=1.0)    # semitone scale
quarter_tone_transform = MusicalMelTransform(interval=0.5) # quarter tone scale

How does it compare to other options?

Here’s a quick comparison between:

  1. Traditional linearly-spaced FFT
  2. torchaudio mel scale transform
  3. MusicalMelTransform (this repo)

I have constrained the two mel transforms (2 & 3) to have the same dimensionality, and with f_max at 16khz to make the comparison fair:

As you can see, especially in the low frequencies, the resolution of MusicalMelTransform is better. This is great for music, and especially for low-frequency heavy music like today’s pop and electronic music. The graph here shows a kick pattern, typical in house or techno music.

If we pick a number of low-end sub notes and plot the corresponding “filters” from the MusicalMelTransform you can see how this works more concretely:

Low notes are impossibly close to each other, especially under 100hz, but that’s life (unless you can stomach the speed of a CQT transform). This package tries to cleverly interpolate FFT bins to mel pitch center bins in a way that lower frequencies are “discernable” from each other. But keep in mind we only have what the humble FFT offers us! We are just interpolating.

Contrast this to a normal FFT. The FFT linearly spaces features, so for the top frequencies end we end up with many, many features that aren’t as musically relevant.

To illustrate, let’s compare the resulting features for different transforms across different musically-relevant frequency ranges so we can see how different transforms vary:

As you can see:

⚠️ Warning of non-magic ⚠️

It’s important to remember all mel features are derivative of the FFT. If you’re working with a small FFT of, like 128 or whatever, this package won’t work miracles!

Your resolution on low end will still be crap.

I wouldn’t use this package below FFT size of 512, tbh. But by cleverly assigning and interpolating those FFT bins you do have, this package is a way to “stretch” the resolution you do have to make discrimination on the low end easier.

The main benefit is just namely that all the features you have are, by definition, musically relevant.

Characteristics of mel transforms, and some helpful tweaks to make

Here are some plots of mel bins (the x-axis dots + colored lines) as composed of FFT bin centers (the vertical grey lines) as we move up in frequency. We’ll talk through some implications.

If we zoom in to the first (very lowest) filters on MusicalMelTransform @ 2048 FFT size, 44.1khz you can see how related the lowest filters are. Because the FFT bins themselves are ~20hz apart, the mel bins below are just sliiightly different linear combinations of 2-3 low bins:

The situation, of course, gets much better as we move up in frequency to even 400-800Hz range:

And just as with any mel scale, once we get up to the really high frequencies (8th octave), the mels:

  1. Span multiple bins
  2. Ignore bins halfway between mel (pitch) centers

for reference, the top note on an 88-key piano is C8 – these frequencies are all above that! (unless you have a Bösendorfer)

These mostly-ignored bins between filters are usually fine, since at these high of frequencies we generally are hearing harmonics which are represented in a neighborhood around each other at harmonic intervals. So throwing out much of the contribution of a few bins is less important.

But as the frequencies continue the gaps get larger. And if some of that information is important (or you’d rather just pick an arbitrary point to have higher resolution than mels!), you can use MusicalMelTransform’s passthrough_cutoff_hz argument.

Here I show what happens using passthrough_cutoff_hz=5000 and passthrough_grouping_size=3. This effectively means, “after 5khz, don’t compute mel bins, just pass through the original FFT bins, grouping every 3 bins together”. This is the result:

Here you can see that after 5khz, we start simply grouping every next three FFT bins into a mel bins. While it depends on your cutoff, generally the higher you have this passthrough_cutoff_hz, the larger your passthrough_grouping_size should be.

And of course these passthrough bins are no longer directly centered on musical notes.

Scaling & normalization

You will also notice that the magnitudes of each FFT bin going into the mel bins get much smaller than 1.0 as we climb frequencies. This is because pitches are spread across many more bins at high frequencies, and the plots have the norm=True parameter set, which normalized each filter to a total weight of 1.

Due to all this rescaling, I suggest using learnable_weights="fft" as this inserts a vector of learnable parameters that helps you scale the original FFT magnitudes (or power, depending on your setting for power) for your usecase. You probably want to have norm=False in this case.

Otherwise the MusicalMelTransform has no learnable weights.

Don’t ignore the bitter lesson

At some point we should be careful – the temptation to ignore The Bitter Lesson by constantly tweaking the f_max, passthrough_cutoff_hz, passthrough_grouping_size, norm, etc with your transform to make your network perform better is a real temptation.

At some point we just need the information to flow through to a reasonable network that will learn from it.

While I do think the Bitter Lesson applies less in a realtime or resource constrained scenario, do think your your architecture and data through before spending your days tweaking your mel transform settings.

The gainz you seek are in the former, not the latter.

Summary

Again, to reiterate: a mel transform is not magic! It is a series of linear combinations on the original FFT bins.

But if you’re clever about it, it really does help!

Check out the repo here, make a PR, and write about any issues if you see them!

About VJLab.AI

Our realtime stem splitter feeding into GLSL shaders in TouchDesigner

If you’re curious to learn more about what kinds of things we’re doing at VJLab.AI with all this stuff, check out:

To stay up to date with what we’re doing:

Our next generation of realtime audio models for visual artists and live performers are coming soon :)