regulatory-landscape on Will Drevo

Ibogaine: the unofficial resources list

Tue, 14 Apr 2026 00:00:00 -0500

A compilation of resources about ibogaine.

If you are seeking treatment or trying to find a faciliator or clinic, please reach out to me directly! I am more than happy to help & share.

Also if I am missing a resource in a category I should be, please also reach out to me. Happy to add!

I have absolutely zero financial or vested interest in any of these resources: books, clinics, articles, protocols, etc.

Resources

If something is ☆☆☆☆☆ , it means I haven’t read/listened/consumed it, but I continually update this list as a find new sources and read them!

Podcasts

★★★★★ Rick Perry & Bryan Hubbard on Joe Rogan: #2477 (2026)
- Even if you don’t like Joe Rogan, this is an incredible introduction
- This is usually the first content I recommend to anyone wanting to learn more
- Rick Perry and Bryan Hubbard run the Americans for Ibogaine organization
★★★★★ Rick Perry & Bryan Hubbard on Joe Rogan: #2251
- Similar to the above, but the first episode. I recommend both, but the 2026 one is obviously more up to date with the progress made in 2025-20206
★★★★★ The Economist: The Red State Psychedelic (2026)
- A fantastic dive into the unconventional background of Bryan Hubbard and the first signs that the religious right might accept psychadelics as a form of care
☆☆☆☆☆ One Reporter’s Life-Altering Psychadelic Trip

Medical research / PubMed papers

Coming soon.

Clinics

If you contact me I am happy to share more, but I won’t post “ratings” in this section until I gather more information from primary sources.

Books

★★★★☆ Iboga: the Root of All Healing
- Good overview of the history and usage today of ibogaine. Objective facts and survey of subjective experiences during flood doses themselves.
- Great introductory book if you’re more of a book person than podcasts or articles
☆☆☆☆☆ The Ibogaine Story: Report on the Staten Island Project
☆☆☆☆☆ Ibogaine and the Bicameral Mind

Articles

☆☆☆☆☆ The Long, Strange Trip of Rick Perry — NYTimes
☆☆☆☆☆ It’s an Obscure Psychedelic Used to Treat Trauma. Could It Help Me? — NYTimes
☆☆☆☆☆ Ibogaine Treatment Complete Guide — Avante Ibogaine
☆☆☆☆☆ Best Books to Read on Iboga and Ibogaine — Get Ibogaine
☆☆☆☆☆ Ibogaine, Fair Trade, and Gabon — National Geographic (Michael Pollan)
☆☆☆☆☆ Book review: ibogaine and addiction — The Guardian, 2003

Documentaries

★★★★☆ Ibogaine: Fight of a Lifetime — Amazon Prime Video
- Straightward, heartwarming, and informative. Made by the Americans for Ibogaine initiative & Beond as a push to pass legislation in Texas for clinical trials
- Focuses on the stories of a few US service members suffering from TBI and PTSD and their journey to the Beond treatment facility in Mexico
☆☆☆☆☆ Ibogaine: Rite of Passage — YouTube, free
☆☆☆☆☆ In Waves and War — Netflix

Academic & policy

☆☆☆☆☆ Oxford University Press chapter on ibogaine
☆☆☆☆☆ What Approval Won’t Solve — Psychedelic Alpha
- Ketamine’s lessons for scaling psychedelic care.

Training run diagnostic metrics: what I track for when things break down

Wed, 04 Mar 2026 01:55:45 -0800

The journey from our loss calculation, to our gradient $G$, to updating our parameters, $P$.
And yes, I didn't use $\theta$ for parameters. Fight me. Also, not to scale.

This post talks a little about the metrics I track to characterize quickly what is going right or wrong with my runs, to save myself precious time and GPU 💸.

To be clear, when I say “break down” I don’t mean the run crashed. That’s a differnt sort of debugging. This is for when the model trains, but it’s not going the way you want it to.

I use Weights & Biases (W&B), but this all applies to similar tools like MLFlow, CometML, and so on.

These metrics are basic, but over the years I’ve picked them up to solve different training run issues. They’re much cheaper to collect and log than doing more runs :)

At the end, I’ll also contextualize how and when I log them in a pseudocode loop. Great for throwing right into a coding LLM as scaffolding for your own projects.

First, let’s talk about the non-negotiables.

📚 The basics, must haves

Obviously you need to set up weights & biases (or whatever you’re using to track with):

1
2
3
4
5
6
7
8
9


if rank == 0:
 wandb.init(
 project=wandb_project_base,
 name=run_name,
 config=checkpoint_config, # usually a dict with all my keyword args
 )

 # save your config somehow! I like saving the YAML
 wandb.save(config_yaml_path, policy="now")

The simple metrics you MUST track, per step:

Learning rate
Train loss
- Per batch
- Per epoch
Test loss
- Per epoch

These are the foundation of what’s happening to our model over time.

Next, we must agree on our x-axis.

🐾 What is a “step” exactly?

First off, your x-axis for graphs should be the “step” count.

1
2
3
4
5
6
7


# Use our custom "step" as the x-axis for all metrics
# This allows comparing runs at the same training step, even when resuming
wandb.define_metric("step")
wandb.define_metric("*", step_metric="step")

# and then to log each time:
wandb.log({ ... }, step=step)

Each step ends with updating your model’s parameters. So if you are accumulating gradients over multiple forward passes, I would suggest that block being your “step”.

This will smooth out the statistics you report (less noise) and keep all your logic like checkpointing or reporting ticking on the same heartbeat.

An aside: Logging under multiple processes

For a multiple GPU setup, I often will just have a single process reporting back metrics, ie:

1
2


if rank == 0:
 wandb.log({ ... }, step=step)

For training from multiple machines, the advice is similar, you just have to pick a leader somehow.

The only time you need all processes to participate is if you parallelize test set evaluation (which I do).

You’ll need an all-reduce step to “collect” the various losses or metrics from each process, and then combine them to your leader process, which calls wandb.log():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


# create our tensor we will all reduce sum over, coming
# from each process in our training process group
# this runs in ALL PROCESSES
loss_t = torch.tensor([
 total_losses['total'],
 total_losses['main_task'],
 total_losses['aux_loss1'],
 total_losses['aux_loss2'],
 total_losses['aux_loss3'],
 float(num_test_batches_this_process)
 ], device=device
)

# add them all together, elementwise
dist.all_reduce(loss_t, op=dist.ReduceOp.SUM)

if rank == 0:
 # run only on single process: compute averages & log
 total_test_batches = int(loss_t[5].item())
 avg_losses = {
 'total': loss_t[0].item() / total_test_batches,
 'main_task': loss_t[1].item() / total_test_batches,
 'aux_loss1': loss_t[2].item() / total_test_batches,
 'aux_loss2': loss_t[3].item() / total_test_batches,
 'aux_loss3': loss_t[4].item() / total_test_batches,
 }
 wandb.log(avg_losses, step=step)

To be clear, you can have multiple processes reporting back train metrics. But you’ll end up with multiple data points per step on your graph and this is noisy.

Additionally, with multiple runs it will be harder to compare that metric to a previous run’s if you have multiple lines per run.

With that out of the way, let’s get to the metrics.

🎓 Metric group #1: Grad norm + grad norm per module

You likely already track gradient (grad) norm, what I’ll write as $\left\lVert{G}\right\rVert_2$ since it’s the L2 norm of the gradient before any clipping.

The norm (size) of our gradient basically answers the question: “how large of a change in parameter space is our loss proposing?”

An oversimplification of how the gradient $G$ is applied to your network’s parameters $P$ using learning rate scalar $\alpha$ is:

$$ P_{new} = P_{old} - G_{clipped} * \alpha$$

Note: if your optimizer is something like AdamW, this is directionally but not literally true. Many optimizers try to maintain a “trajectory” of your parameter updates over time (ie: momentum) or other tricks to help you traverse the loss landscape in a faster manner. But this equation is the underlying dynamic.

where $G$ and $P$ are both vectors of length $N$, the number of parameters in your network.

Grad norm just looks at the sum of all the backwards passes (the gradient) per step (which could be over multiple grad accumulation steps) and concatenates them into one huge, long vector (size $N$) and computes the L2 norm (or length) of it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


def compute_grad_norm(parameters, norm_type=2):
 """
 Compute the norm of the gradients of the parameters.

 This implementation computes norms per parameter for memory
 efficiency reasons, rather than concatenating to one giant
 vector and computing the norm on it. The result is mathematically
 equivalent.
 """
 total_norm = 0.0
 for p in parameters:
 if p.grad is not None:
 param_norm = p.grad.data.norm(norm_type)
 total_norm += param_norm.item() ** norm_type
 return total_norm ** (1.0 / norm_type)

grad_norm = compute_grad_norm(model.parameters())

What I propose tracking are additional per-module norms, so for each module of your torch network, you’d compute the subgraph’s grad norm, and also plot that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


def compute_model_grad_norm_per_module(model):
 """
 Compute the norm of the gradients of the parameters
 for each module in the model

 Returns a wandb-loggable dict with mapping:
 module name: str -> float
 """
 grad_norms = {}
 grad_norms["grad_norm/overall"] = compute_grad_norm(
 model.parameters()
 )
 for name, module in model.named_modules():
 if name and name.strip():
 # Only consider modules with trainable parameters
 if any(p.requires_grad for p in module.parameters()):
 grad_norms[f"grad_norm/{name}"] = \
 compute_grad_norm(module.parameters())
 return grad_norms

grad_norm_per_module = compute_model_grad_norm_per_module(model)

Why do this?

Well if you track grad norm, it’s because you want to know if network updates are going haywire, either getting too big or too small over time. And if that is the cause, then you’re going to want to know why.

You could easily chalk it up to “oh the learning rate must be too high” or “must be too much regularization” (and it very well might be), but before you go and kick off another expensive run, checking the per-module grad norm can help save you time.

And remember, if you have gradient clipping on, it’s important to track the value pre-clip as that’s the pure signal your learning process is working with before clipping tries to tame it.

Let’s go through a real-world example.

In the below, I was training a small but decently complex transformer network (~11M parameters) for realtime audio. I had just added a number of improvements on the data and architecture side, and kicked off another run.

I started to notice the issue with the (pre-clipping) grad norm graph:

Ouch. This run was not going to converge anytime soon.

And the beginning of the grad norm explosion upwards did coincide with the peak of the learning rate, after the warmup window:

So with the fairly aggressive learning rate of 1e-3, it would be a valid conclusion that the learning rate was too high.

But this didn’t seem right. Even with a bunch of changes, I’d been training this network previously and 1e-3 had proven aggressive, but stable. I hadn’t completely changed the size of the network or regularization in a drastic enough way for this much of a deviation.

Luckily, I had per module grad norm logged!

I began to notice a pattern. The gradient norm at later layers seemed high, but not crazy:

The 8th layer's LayerNorm grad norms over time

But steadily got worse the closer to the front of the network:

Getting slightly worse in the 7th layer

And wild by the first layer (check the y-axis):

Getting pretty crazy

But things were totally insane by the frontend conv layers, with peaks in the thousands! For reference, I had gradient clipping on for any gradient norm > 1.0. Clipping prevented the weights from exploding outright, but didn’t fix the underlying problem: the gradient direction was dominated by the unstable parameter, starving the rest of the network of useful gradient signal.

Insanity at the first conv layer

But my conv layers’ random init values seemed completely reasonable. So a dead end there.

But then it hit me.

I had recently hypothesized the model might need to reweight the mel bins based on a loudness curve, sort of like humans have our own auditory perceptual curve (see: Fletcher–Munson equal loudness curve). And in terms of parameters/FLOPs it’s stupidly cheap.

So I added a simple scaling of my mel frames at the start of the forward pass:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


class Model(nn.Module):
 def __init__(self, ...):
 # ...
 self.mel_scale = nn.Parameter(torch.randn(self.num_mels))
 # ...

 def forward(self, x, ...):
 # x is tensor sized: (batch, time, num_mels)
 x *= self.mel_scale # hint: don't do this 🤣
 # ...

You might see the train wreck coming.

This had multiple problems:

Initialization doesn’t start at identity
- torch.randn outputs $N(0, 1)$ (gaussian centered at 0)
- In expectation, now:
  - half our values will be negative (flipping the sign of our features)
  - many are near zero (killing bins entirely)
  - almost none are near 1.0 (identity, passing through original features untouched).
Negative values particularly bad for elementwise log-scaling
- Imagine a mel audio value at a bin of -80 dB. This is virtually silent.
- Multiplying this by -1 is disastrous. Our quietest bin now is INSANELY loud
- This is exactly why nn.LayerNorm (and every other normalization layer) initializes its multiplicative weight parameter to ones and its additive bias parameter to zeros. Those are the identity elements for their respective operations.

The fix is very simple.

Multiplying in linear space is addition in log space:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


class Model(nn.Module):
 def __init__(self, ...):
 # ...
 self.mel_bias = nn.Parameter(torch.zeros(n_mels))
 # ...

 def forward(self, x, ...):
 # x is log-mel, sized: (batch, time, num_mels)
 x += self.mel_bias
 # ...

And we init at 0.0, so this starts as a no-op.

An additional benefit is that because we add self.mel_bias (instead of multiply) our gradient is multiplied by 1.0 instead of the input magnitude, so our gradients (and thus our updates to self.mel_bias) are much more stable.

This completely fixed the issue:

The green line is after the fix. Nice, slow, steady decline of grad norm after LR peak

You might also have noticed that because the self.mel_scale scaling tensor was just an nn.Parameter, we wouldn’t get the per-module grad norm computed with the code above. The fix would be to make an nn.Module wrapper for it:

1
2
3
4
5
6
7


class LearnableBias(nn.Module):
 def __init__(self, n_channels: int):
 super().__init__()
 self.bias = nn.Parameter(torch.zeros(n_channels))

 def forward(self, x: torch.Tensor) -> torch.Tensor:
 return x + self.bias

and then compute_model_grad_norm_per_module() would have computed and reported this in the key grad_norm/mel_bias.

Either way, per-module grad norm logging led me to the issue. But without this, I might have wasted another run or two guessing lower learning rates.

And as you know, when you lower the learning rate, it takes you longer to find the issue because the learning process is slowed.

So obviously grad norm per-module is a valuable metric in your toolbox.

Let’s talk about a related measure, the update norm.

📉 Metric group #2: Update norms + effective LR ratio

To properly introduce this family of metrics, I drew a diagram:

The journey from loss to update.
Vector sizes would definitely not be to scale for a typical training run 😆

First, the magic of backprop turns our single scalar loss into a large set of numbers: a gradient associated with each parameter of the network.

We can group the gradient by module into smaller vectors (the colored arrows), which we can characterize for debugging (more on this later).

Finally, we concatenate (not add) them all into a single, much longer vector, $G$ (the gradient vector).

Next, we clip $G$ if necessary, scaling it down to $G_{clipped}$. Note that the direction of $G$ is identical to $G_{clipped}$.

Finally, a bunch of things happen:

Optimizer modifies $G_{clipped}$ (via momentum, adaptive scaling, etc)
Learning rate multiplier is applied

These can change both scale and rotation, and gives us the value we actually use to update our parameters. We’ll call it the update, $U$.

And to update the parameters in our network, we apply the standard:

$$ P_{new} = P_{old} - U$$

So from this, we can define a few new metrics:

Update norm: $\left\lVert{U}\right\rVert_2$
- “Size” of the actual update in parameter space
Param norm: $\left\lVert{P_{new}}\right\rVert_2$
- “Size” of the new model in parameter space
Relative update norm: ratio of $\left\lVert{U}\right\rVert_2$ / $\left\lVert{P_{new}}\right\rVert_2$
- How much of the entire network we are changing per-step
Effective LR ratio: The ratio between the actual step size (update norm) and the gradient norm after clipping: $\left\lVert{U}\right\rVert_2$ / $\left\lVert{G_{clipped}}\right\rVert_2$

Easy, and simple. Here’s how we calculate them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95


def snapshot_params_to_cpu(model):
 """
 Snapshot all trainable parameters to CPU memory.

 Use this before optimizer.step() to later compute update norms.
 Copying to CPU avoids GPU VRAM spikes from doubling parameter memory.

 Args:
 model: PyTorch model (can be wrapped in DDP)

 Returns:
 dict: {param_name: param_tensor_on_cpu} for all requires_grad
 parameters
 """
 return {
 name: p.detach().clone().cpu()
 for name, p in model.named_parameters() if p.requires_grad
 }

def compute_update_norms(model, old_params_cpu, grad_norm_after_clip=None):
 """
 Compute update norms after an optimizer step.

 Measures the actual parameter changes made by the optimizer, which reflects
 the combined effect of gradients, learning rate, momentum, and adaptive
 scaling (e.g., Adam's second moment).

 Args:
 model: PyTorch model after optimizer.step()
 old_params_cpu: dict from snapshot_params_to_cpu() taken before step
 grad_norm_after_clip: optional gradient norm after clipping, used to
 compute effective learning rate ratio

 Returns:
 dict with keys:
 - update_norm:
 L2 norm of all parameter changes
 - param_norm:
 L2 norm of all current parameters
 - relative_update_norm:
 update_norm / param_norm (stability metric)
 - effective_lr_ratio:
 update_norm / grad_norm_after_clip (if provided)
 """
 update_deltas = []
 param_flatcats = []

 # iterate through new parameters, compare to old
 for name, p in model.named_parameters():
 if p.requires_grad and name in old_params_cpu:
 p_cpu = p.detach().cpu()
 delta = p_cpu - old_params_cpu[name]
 update_deltas.append(delta.flatten())
 param_flatcats.append(p_cpu.flatten())

 if not update_deltas:
 return None

 update_norm = torch.linalg.vector_norm(torch.cat(update_deltas)).item()
 param_norm = torch.linalg.vector_norm(torch.cat(param_flatcats)).item()
 relative_update_norm = update_norm / (param_norm + 1e-12)

 result = {
 "update_norm": update_norm,
 "param_norm": param_norm,
 "relative_update_norm": relative_update_norm,
 }

 # Effective LR ratio: shows actual step size relative to gradient
 if grad_norm_after_clip is not None and grad_norm_after_clip > 1e-12:
 result["effective_lr_ratio"] = update_norm / grad_norm_after_clip

 return result

old_params_cpu = snapshot_params_to_cpu(model)

grad_clip = 1.5 # just an example
grad_norm_before = torch.nn.utils.clip_grad_norm_(
 model.parameters(), grad_clip
).item()
grad_norm_after = torch.nn.utils.clip_grad_norm_(
 model.parameters(), float('inf')
).item()
grad_clip_ratio = grad_norm_before / grad_clip if grad_clip > 0 else 0.0

# ... etc

optimizer.step()
optimizer.zero_grad()

# .. etc

update_norms_result = compute_update_norms(
 model, old_params_cpu, grad_norm_after_clip=grad_norm_after
)

Interpreting param norm, update norms & effective LR ratio

Reading these metrics together gives a complete picture of training dynamics beyond loss and gradients: where the model is in parameter space, how fast it’s moving, and how much the optimizer is amplifying or dampening the raw gradient signal.

Metric	Range / trajectory	Guidance
Param norm $\left\lVert{P_{new}}\right\rVert_2$	Steady, sub-linear growth	Healthy. Growth rate should slow as LR decays.
	Exponential / super-linear growth	Weights growing fast. Could mean you’re diverging. Generally here you’ll decrease LR or increase regularization, unless something egregious is going wrong in your network. In that case, fix it.
	Shrinking	Underfitting? Check you aren’t regularizing too much (weight decay, etc)
	Flat while loss is decreasing	Likely good. Probably later in training.
	Sudden jumps or drops	Check grad norm per-module. Mostly redundant to that signal in this case.
Relative update norm $\left\lVert{U}\right\rVert_2$ / $\left\lVert{P_{new}}\right\rVert_2$	≈ 1e-3 to 1e-4	Healthy range for most architectures
	`>>` 1e-2	Updates might be too large relative to params. Risk of instability.
	`<<` 1e-6	Updates are vanishingly small :/ learning likely has stalled
	Rising late in training while loss is flat	Optimizer may be overshooting a flat basin
Effective LR ratio $\left\lVert{U}\right\rVert_2$ / $\left\lVert{G_{clipped}}\right\rVert_2$	≈ nominal LR	Your optimizer’s effective gradient multipliers are ~1.0, which happens early in training. Or for some reason you’re using vanilla SGD (why??)
	`>>` nominal LR	Your optimizer is amplifying gradients
	`<<` nominal LR	Your optimizer is dampening gradients. It could be protecting you from oscillations in weight space, but I would refer back to grad norm, LR, and other ways to diagnose instability in this case.

📈 Metric group #3: Non-loss test metrics

This might seem obvious, but I recommend plotting these as well. These might be:

Accuracy
Precision / recall / F1 score
FID score
… etc

The list goes on.

There are a number of reasons you might want these. After all, the whole point of training this model to you as a human isn’t the loss score, it’s the actual outcomes it allows for!

The other practical reason is that if you change loss formulation midway through training or between runs, you need something objective to judge the performance of the models by in lieu of a loss curve.

Changing your loss formulation can change both the scale and the shape of your loss curve over the course of training.

So yeah, duh. Do it.

🗂️ Metric group #4: Loss by category

Another fairly obvious one, but if you can break out your average loss per batch, per epoch, or per test evaluation by the type of sample, you might be able to find data quality or model parameterization issues.

For example, for a language model you may have different types of queries or chat requests that the model struggles on.

For us, in the music domain, we have found that different genres, stems, or even different bucketed ranges of BPMs gave our models trouble.

So if something is going haywire in a particular category, it can inspire you to do one of the healthiest things you can do in a model training project: actually look at the data!

The solutions for loss discrepancies between categories can range from:

Correcting data quality issues in those categories
Adjusting per-sample or per-category weights in some way to bias the model to perform better on those samples (note: do this at the data sampling time, not at loss calculation time if you can)
Learning that those examples are harder than you thought, and accepting it!

⚠️ Note: if you want to compare loss by category you NEED to make sure you are scaling your loss so that when this measurement is taken, each sample’s loss has the same weight, regardless of length of sample (for sequence models) or quantity of labels.

If you’re just getting less loss because some samples are shorter or have fewer labels, that’s not telling you anything useful about how hard the model is finding that particular category of sample vs another.

One useful pattern for this is using torch’s reduction='none' option when it is available:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# step 1: calculate per-sample loss for your batch
my_cool_bce_loss = F.binary_cross_entropy_with_logits(
 predictions, targets,
 reduction='none'
)

# step 2: measure raw loss, cut by category, etc
# ...

# step 3: reduction via .mean() to get the actual loss to backprop over 
loss = my_cool_bce_loss.mean()

# step 4: backprop!
loss.backward()

Again, remember to normalize loss for length & label count.

You may also not be able to usefully report per-batch per-category losses if the number of categories is high and you don’t encounter them all every batch. This requires accumulating and reporting these losses every M training steps, every epoch, or every test loop. It’s up to you.

Alright, we’ve covered them all!

Let’s look at a rough sketch of our training loop with respect to all of these metrics.

🔄 Putting it all together: the learning loop sketch

An example of how all this might come together and be structured in a classic train/test loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74


step = 1
for epoch in range(num_epochs):

 # ── TRAIN ──────────────────────────────────────────────────
 model.train()
 epoch_loss_sum, epoch_steps = 0.0, 0
 accum_loss = 0.0

 for batch_idx, batch in enumerate(train_loader):

 pred = model(batch)
 per_sample_losses = compute_per_sample_losses(pred, batch)
 loss = reduce_losses(per_sample_losses, loss_weights)
 (loss / grad_accum_steps).backward()
 accum_loss += loss.item()

 # ── Optimizer step (every grad_accum_steps batches) ──
 if (batch_idx + 1) % grad_accum_steps == 0:

 # Gradient norms (before clip)
 grad_norms_per_module = compute_grad_norm_per_module(model)

 grad_norm_before = clip_grad_norm_(params, max_norm)
 grad_norm_after = clip_grad_norm_(params, float('inf'))
 grad_clip_ratio = grad_norm_before / max_norm

 old_params = snapshot_params_to_cpu(model)

 optimizer.step()
 scheduler.step()
 optimizer.zero_grad()

 # Update norms (after step)
 update_norms = compute_update_norms(
 model, old_params, grad_norm_after_clip=grad_norm_after
 )

 step_loss = accum_loss / grad_accum_steps
 epoch_loss_sum += step_loss
 epoch_steps += 1
 accum_loss = 0.0

 if rank == 0:
 wandb.log({
 "lr": scheduler.get_last_lr()[0],
 "train/loss": step_loss,
 "grad_clip_ratio": grad_clip_ratio,
 **grad_norms_per_module,
 **update_norms,
 }, step=step)
 step += 1

 if rank == 0:
 wandb.log({
 "train/loss_epoch": epoch_loss_sum / epoch_steps,
 }, step=step)

 # ── TEST ───────────────────────────────────────────────────
 model.eval()
 with torch.no_grad():
 for batch in test_loader:
 pred = model(batch)
 test_losses = compute_per_sample_losses(pred, batch)
 task_metrics = compute_task_metrics(pred, batch)

 # all_reduce test metrics across processes here (see above)
 if rank == 0:
 wandb.log({
 "test/loss": avg_test_loss,
 **per_category_test_losses,
 **avg_task_metrics,
 }, step=step)

 save_checkpoint(...)

Or something like that. Every train/test loop will be different.

🏁 Summary

Taking the time to instrument metrics for your run takes time, but it will save you much more time when things go wrong!

Making your own realtime audio AI training environment

Mon, 16 Feb 2026 01:20:03 -0500

VJLab audio AI models driving visuals in a live, realtime setting

This post is pretty specific, but I haven’t seen anyone else really write about it. So I hope this is helpful to the 20 people in the world that need it (ha!).

At VJLab, we train realtime (causal) audio models that understand and listen to music like a human does for use by visual artists in live concert settings.

This means our models have to be very fast, robust to noise, and accurate.

I’ll share some best practices we’ve found for setting up our environment and avoiding disaster.

💻 Our development & training environments

My development process looks like:

Develop locally on MacBook
Do tiny training tests/runs on my Ubuntu machine w/ 4090 card (if needed)
Training run on cheap cloud GPU machines
Larger training in cloud (moar GPUs)

This is, if nothing else, a way to keep things super economical and cheap! We are 100% bootstrapped and don’t have VC money to burn on GPUs.

How Nano Banana pictures our different environments

Luckily our models are not huge (some run in realtime on CPU), but doing data transformations, scraping, and ablations can really add up if you aren’t careful.

Most of our models run in the 5-20 ms range and are generally under 10M parameters, though we have a couple of beefier exceptions.

I also can’t recommend highly enough using Cursor’s Remote SSH feature. For the cloud-based environments, being able to fire up a coding LLM to puzzle through NCCL errors or whatever derailed your latest training run is absolutely priceless.

❓What is causality, and why do we care?

Causal models exist in time, at a time t. They quite simply, use only the past data they’ve seen (like any $x_{t_i}$ where $t_i$ <= $t$), and no data from the future (no $x_{t_i}$ where $t_i$ > $t$).

So if you need a model to operate in realtime, you aren’t allowed to “cheat” by looking at information from the future.

The difference is stark: VJLab’s realtime stem splitter operating at ~90Hz is operating in a much different regime than an offline splitter like Demucs, which has access to the entire track and can take minutes to respond.

The truth is that most pretrained models are either for use in offline/batch situations, or simply aren’t performant enough for realtime audio, especially on CPU.

Thus we almost exclusively adapt or train new architectures from scratch.

But training your own causal models from scratch or adapting batch models comes with risks.

⚠️ The danger: Batch vs realtime

One of the banes of your existence if you train lots of these models will be causality. If your model has to operate like a human does (and cannot see the upcoming audio offline), you have to run inference and respond in time. Without seeing the future.

This becomes tricky when you want to train such a model, because you will have to train the model in batch (unless you have infinite patience and also infinite money).

This creates a dangerous situation where your training necessarily differs from your serving.

I have trained models that looked incredible performance-wise at train/test time in a batch setting, but fell apart when I fixed a causality bug or we finally got them deployed to a live inference setting or script. It’s an upsetting experience.

Remember: if performance looks too good to be true, it probably is.

⏱️ Timing constraints

Not only is causality tricky, but simple timing performance can be too.

If your model operates on new buffers of 512 samples, sampled at 44.1kHz, guess what, you can NEVER take longer than 512 samples / 44100 Hz ~= 11ms to respond! In fact a good rule of thumb is to keep your full buffer processing time to half your budget (ie: 5.5 ms).

Note that this time budget includes your forward pass and whatever pre/post-processing in C++ you need to do.

Even if your model has a lookahead period (ie: the model purposefully outputs values lagged slightly into the past), you still have a latency budget because new frames will just keep coming.

This is more vital on-device where you are pulling from an audio driver buffer, but in the cloud you don’t want to fall behind either.

😭 Examples of snags

Can it really be that bad?

What kinds of things might befall me, you might ask?

A few fun examples that definitely have never, ever happened to me:

A bug in your training script reveals future labels to previous frame because your convolutions’ receptive field was large enough to include them from a future frame
Your mean pooling operation aggregates over the time dimension (and is therefore not causal)
You realize your model trains in batch on precomputed mels, but your streaming model has to compute them (and puts you over your latency budget)
Because ONNX doesn’t support the FFT you hand-rolled your own convolutional FFT, but realized it’s too slow in realtime at the frame size you’ve chosen
The SOTA pretrained model you blindly fine-tuned and is supposedly causal and realtime according to the paper authors … totally isn’t. You have fix the architecture and completely retrain

In short, a lot can go haywire if you aren’t careful.

The really awful part is: if you don’t realize a mistake like this until after you’ve finished your 3 day long training run, then you’ve just literally burned money.

To save yourself immense amount of time, money, and sanity, I highly recommend you have a consistent test for your dev & training environments.

🧪 Things you MUST test

You need an integration test of your model’s entire lifecycle:

Yes, really. Even if you’re just a researcher.

Even if your idea of MLOps is SSHing into your beautifully managed Slurm cluster with Weka FS access and running a script with accelerate.

Our integration test runs on any environment (macbook, linux single GPU, linux multi-GPU), in this order:

Model latency test
- Runs the batch model against a batch_size=1 input
- Ensures non-accelerated Python version is close enough to latency budget
Generating training dataset/metadata, if applicable
- Only tiny subset of data
- Generate sample outputs of data augmentation and labels, especially if your outputs are subjective and require a human sanity check
Training + checkpointing
- In batch, of course
Loading from checkpoint + resuming
- Can also add loading older checkpoints if backwards compatibility is desired
Exporting model to accelerated format (ie: TorchScript, ONNX, or TensorRT)
Running batch vs online equivalence test
- The MOST important step
- Match the outputs of your batch running alongside your realtime (streaming) accelerated model
- If you use mels in training and audio in realtime, yes, you must test the realtime with audio and do the mel transforms. Don’t be lazy!
- Ensure that output is same to a tolerance, ie: 1e-2 or whatever is necessary for your output domain
- Keep in mind the acceleration process will often introduce floating point or numerical differences, and that’s okay
- Outputing visual or auditory examples that can be manually inspected is really helpful

All of this logs to Weights & Biases and reports back a link to check the results from.

And trust me, if you can run it each time your node starts up, or as a pre-commit hook, you will thank me later. Or your manager will.

And in the age of coding agents, there really is no excuse not to ship this testing code, even if it’s quite a few LOC.

You could literally feed this post in as input and probably get a decent starting point!

𝍔 Platform differences

One obvious call out is you won’t be able (or need) to run every step the same on every environment.

Some example differences:

Local development machines may run smaller toy versions of the model due to RAM, VRAM, or even MPS-specific constraints
Exporting accelerators depends on the platform: exporting your torch model to TensorRT won’t happen on Mac OS X, for example
Environments with different compute scales: running single GPU (python train.py) vs multi-GPU (torchrun) vs multi-instance-multi-GPU (torchrun, ray, etc)

None of this is particularly surprising or revolutionary.

📓 A few other tips

Unify your training and realtime model
- Do this by keeping input tensors in batch format at all points in the graph
- This allows you to make your realtime (exportable) torch module a simple wrapper of the batch training model where you set batch_size=1 and also handle state input/output
Think about stateless inference
- Remember most all accelerated model formats and serving techniques are stateless
- You’ll need to hand back in state like previous mel frames, KV caches, LSTM hidden states, etc manually - your model can’t use logic internally to update state.
Just ask
- Asking a top-tier coding agent to try to poke holes in your testing strategy or model architecture to find causality issues ahead of time is well worth your money and effort, even if the true positive rate is 10-20%.
Avoid BatchNorm!
- LayerNorm, GroupNorm, or InstanceNorm are your (causal) friends!
- BatchNorm technically “cheats” in training if your frames are temporal by looking at future frames to compute mean/variance stats to normalize inputs in previous frames
  - However. This violation of causality isn’t actually terrible for deploy-time inference, per se. This is because in a model frozen for inference (same as .eval() mode), the mean/variance stored in the BatchNorm op are frozen.
  - So your model will work just fine in production! But it will derail you when you run your streaming test to verify batch and streaming are the same, because they won’t be!
  - And if you write off the difference as “oh that’s just BatchNorm, let’s ignore the batch vs streaming discrepancy”, you might miss a real causality issue. This is the true danger of BatchNorm
Start with a single script 😱
- Sometimes in the research phase I will keep the entire new model in a single train.py as long as I can. Horrendous, I know.
- Coding LLMs seem to do quite well with this, as a bonus
- Anything re-usable I factor out as soon as I can + add unit test, so other models in future can benefit
- Once the model is working end to end from train to accelerated realtime, then I move models into proper Python modules for reusability, class composition, and so on
If your model needs an FFT during inference, think carefully about how you train
- For example, ONNX doesn’t support torch’s FFT or iFFT operation
- The platform you deploy to (OS X, Ubuntu, Windows) will determine the fastest way to compute the FFT, but beware, not all FFT routines have the same scaling. For this reason, we usually choose libtorch, which supports torch’s FFT routines
- You should always try to transform your model to an accelerated format/method that keeps your train/deploy equivalence intact or your will run into problems
- Some FFT libraries, while fast, are not licensed well for commercial use

🔊 In closing

Truly realtime audio is tricky! And often your first question should be: does this even need to be realtime?

Many features you might imagine could just be done quickly (but in batch), and save you the headache.

But when you do truly need it, make sure to keep your eyes open for issues like these, and put a strong integration testing framework in place to prevent you from wasting time and money.

Happy training & testing :)

Introducing: a Musical Mel Transform

Tue, 09 Sep 2025 22:54:15 -0700

I’m open sourcing a useful tool in our realtime audio AI toolbox here at VJLab, a Musical mel transform.

It’s written in PyTorch and can be ONNX-compatibile with a convolutional FFT (with use_conv_fft=True).

If you’ve ever wanted audio features that directly represent semitones (or quarter tones!) this is the package for you.

Why have a mel transform centered on musical notes?

In general, the mel transform has the following benefits:

Better featurization for perceptually relevant frequencies for human ears
Dimensionality reduction
Some noise robustness (since mel transforms average or smooth over multiple FFT bins)

And what I’m calling a “musical” mel transform, where the mel bins are aligned to pitch centers, has additional advantages if:

Your task is transcription or musical note related
Your case is realtime/speed critical and care about low-end discrimination (vs say, a CQT that would do well on low frequencies but is very slow)
You’re comparing against a completely learned filterbank, or that approach isn’t working

Personally I have found this MusicalMelTransform beats raw FFTs and standard mels for realtime usecases. The package also has an option learnable_weights="fft" to add learnable parameters to reweight the incoming FFT bins for loudness, which is important.

The default arguments convert the FFT magnitudes to power (power: int = 2) and then to a dB scale (to_db: bool = True) as well, which is common in audio AI frontend feature extraction.

TL;DR - if you’re working with music in your AI usecase, then having features that map directly to musical notes can sometimes help with performance!

How does it work?

Mel scale is just a mapping of FFT bins -> new bins. So each mel bin is just a weighted sum of the linearly-spaced FFT bins. That’s it!

This code:

Adds some adaptive (with adaptive=True) widening to interpolate great weighted combinations of FFT bins to make pitches discernable at pitch centers
Gives a configurable way to control the number of high frequency features (with passthrough arguments)
Provides an optional ONNX compatible FFT operator

You can also shorten or widen your tone granulariy – so semi- or quarter- tones is just a parameter change:

1
2
3


# `interval` is the "number of semitones"
chromatic_transform = MusicalMelTransform(interval=1.0) # semitone scale
quarter_tone_transform = MusicalMelTransform(interval=0.5) # quarter tone scale

How does it compare to other options?

Here’s a quick comparison between:

Traditional linearly-spaced FFT
torchaudio mel scale transform
MusicalMelTransform (this repo)

I have constrained the two mel transforms (2 & 3) to have the same dimensionality, and with f_max at 16khz to make the comparison fair:

As you can see, especially in the low frequencies, the resolution of MusicalMelTransform is better. This is great for music, and especially for low-frequency heavy music like today’s pop and electronic music. The graph here shows a kick pattern, typical in house or techno music.

If we pick a number of low-end sub notes and plot the corresponding “filters” from the MusicalMelTransform you can see how this works more concretely:

Low notes are impossibly close to each other, especially under 100hz, but that’s life (unless you can stomach the speed of a CQT transform). This package tries to cleverly interpolate FFT bins to mel pitch center bins in a way that lower frequencies are “discernable” from each other. But keep in mind we only have what the humble FFT offers us! We are just interpolating.

Contrast this to a normal FFT. The FFT linearly spaces features, so for the top frequencies end we end up with many, many features that aren’t as musically relevant.

To illustrate, let’s compare the resulting features for different transforms across different musically-relevant frequency ranges so we can see how different transforms vary:

As you can see:

The vanilla FFT has a huge number of features, most of which are on veryyy high frequencies >6khz, which is non-ideal
Under 150hz, where low or sub-“bass” lives, MusicalMelTransform smoothly interpolates, giving a model better features to work with
Under 500hz, the MusicalMelTransform still has the best coverage – where most all the bass, root notes, and fundamental frequencies reside
For a transform with the exact number of features, torchaudio transform has ~1.5x as many features from 1khz and up
But if we’re willing to spend a few more features, an optimized MusicalMelTransform with passthrough @ 5khz to let the FFT bins come through “covers” the torchaudio mel transform pretty much everywhere! So we can (except for the 1-3khz band) have our cake and eat it too.

⚠️ Warning of non-magic ⚠️

It’s important to remember all mel features are derivative of the FFT. If you’re working with a small FFT of, like 128 or whatever, this package won’t work miracles!

Your resolution on low end will still be crap.

I wouldn’t use this package below FFT size of 512, tbh. But by cleverly assigning and interpolating those FFT bins you do have, this package is a way to “stretch” the resolution you do have to make discrimination on the low end easier.

The main benefit is just namely that all the features you have are, by definition, musically relevant.

Characteristics of mel transforms, and some helpful tweaks to make

Here are some plots of mel bins (the x-axis dots + colored lines) as composed of FFT bin centers (the vertical grey lines) as we move up in frequency. We’ll talk through some implications.

If we zoom in to the first (very lowest) filters on MusicalMelTransform @ 2048 FFT size, 44.1khz you can see how related the lowest filters are. Because the FFT bins themselves are ~20hz apart, the mel bins below are just sliiightly different linear combinations of 2-3 low bins:

The situation, of course, gets much better as we move up in frequency to even 400-800Hz range:

And just as with any mel scale, once we get up to the really high frequencies (8th octave), the mels:

Span multiple bins
Ignore bins halfway between mel (pitch) centers

for reference, the top note on an 88-key piano is C8 – these frequencies are all above that! (unless you have a Bösendorfer)

These mostly-ignored bins between filters are usually fine, since at these high of frequencies we generally are hearing harmonics which are represented in a neighborhood around each other at harmonic intervals. So throwing out much of the contribution of a few bins is less important.

But as the frequencies continue the gaps get larger. And if some of that information is important (or you’d rather just pick an arbitrary point to have higher resolution than mels!), you can use MusicalMelTransform’s passthrough_cutoff_hz argument.

Here I show what happens using passthrough_cutoff_hz=5000 and passthrough_grouping_size=3. This effectively means, “after 5khz, don’t compute mel bins, just pass through the original FFT bins, grouping every 3 bins together”. This is the result:

Here you can see that after 5khz, we start simply grouping every next three FFT bins into a mel bins. While it depends on your cutoff, generally the higher you have this passthrough_cutoff_hz, the larger your passthrough_grouping_size should be.

And of course these passthrough bins are no longer directly centered on musical notes.

Scaling & normalization

You will also notice that the magnitudes of each FFT bin going into the mel bins get much smaller than 1.0 as we climb frequencies. This is because pitches are spread across many more bins at high frequencies, and the plots have the norm=True parameter set, which normalized each filter to a total weight of 1.

Due to all this rescaling, I suggest using learnable_weights="fft" as this inserts a vector of learnable parameters that helps you scale the original FFT magnitudes (or power, depending on your setting for power) for your usecase. You probably want to have norm=False in this case.

Otherwise the MusicalMelTransform has no learnable weights.

Don’t ignore the bitter lesson

At some point we should be careful – the temptation to ignore The Bitter Lesson by constantly tweaking the f_max, passthrough_cutoff_hz, passthrough_grouping_size, norm, etc with your transform to make your network perform better is a real temptation.

At some point we just need the information to flow through to a reasonable network that will learn from it.

While I do think the Bitter Lesson applies less in a realtime or resource constrained scenario, do think your your architecture and data through before spending your days tweaking your mel transform settings.

The gainz you seek are in the former, not the latter.

Summary

Again, to reiterate: a mel transform is not magic! It is a series of linear combinations on the original FFT bins.

But if you’re clever about it, it really does help!

Check out the repo here, make a PR, and write about any issues if you see them!

About VJLab.AI

Our realtime stem splitter feeding into GLSL shaders in TouchDesigner

If you’re curious to learn more about what kinds of things we’re doing at VJLab.AI with all this stuff, check out:

A video showcasing our tool, AudioSlice that I have personally used to peform visuals for acts like John Summit, Dom Dolla, Gorgon City, Benny Benassi, GriZ and many more
Our beat tracker, BeatSage, for live concert VJs

To stay up to date with what we’re doing:

You can follow my YouTube account for tutorials
or Instagram for tutorials and teasers
Or our new email list for updates on new tools, models, repos, and updates to our existing apps

Our next generation of realtime audio models for visual artists and live performers are coming soon :)

Music AI state of the union: an ISMIR '24 summary

Thu, 05 Dec 2024 01:20:03 -0500

ISMIR '24 held in San Francisco

ISMIR ‘24 (the conference for the International Society for Music Information Retrieval) this year was fantastic. I had an absolute blast getting to meet up with the brightest minds in the music AI space.

The pace of innovation in music AI is absolutely breathtaking.

For this post I chose a few themes I noticed at the conference. In each section I’ll describe my favorite paper and mention a few other papers to check out. You can see a full list of ISMIR ‘24 papers here.

Finally, if you’re in the music AI space and want to be friends or grab a coffee, hit me up on twitter or shoot me a message!

I’m currently working on a new stealth project building realtime models that make audioreactive light shows like Coachella possible – a perfect fit for ISMIR.

Theme #1: Latent spaces - discrete and continuous

A recent trend in audio is training better latent space representations. They help with both compression and generation tasks. The two are somewhat related — audio is an extremely information dense modality, and bottlenecking information is playing out much like we saw in the image world once diffusion started happening in latent space rather than pixel space.

Neural codecs using RVQ (ie: Encodec, DAC) or continuous autoencoders are the two preferred types of information bottlenecks today.

Codecs are better at high quality reconstruction and phase coherence, but reconstruction falls apart if you shift them in time. The codebook vectors also can be used as discrete tokens, or the last layer before the quantization can also be used as a continuous latent.

Continuous latents are wonderful for downstream tasks and are quite good at capturing lower frequency or harmonic components, though often at the expense of phase when decoded.

My favorite ISMIR ‘24 paper on this theme was:

📚 Music2Latent: Consistency Autoencoders for Latent Audio Compression [paper] [github], the PhD work of Marco Pasini in partnership with Sony Paris.

Music2Latent broke the mold of difficult-to-train audio autoencoders (no GAN!) and trains with a single loss term. Most interestingly, as Marco revealed in the poster session, if one takes two latent embeddings from Music2Latent and interpolates between them and then decodes, you get audio that sounds like the two original waveforms mixed together in equal proportion. Extremely cool, and a huge step forward for a better latent space forgenerative models.

The only dissapointing part is that the model will not be released, and that the code is under CC BY-NC 4.0 :/ but the code is on Github!

Other noteworthy papers:

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation
- An excellent example of a hybrid appraoch using Encodec: "...we use the continuous tensor z as the latent representation, while leveraging the discrete representation q for audio conditioning."
Audio Conditioning for Music Generation via Discrete Bottleneck Features
- This is another FAIR paper, and so they also use Encodec, but as tokens in an autoregressive model

Theme #2: Diffusion for audio generation

Increasingly diffusion is being used for audio generation. It has a few nice properties:

Inference can happen in parallel (not autoregressive)
We can borrow a lot of techniques from image diffusion generation
We don’t have to think about tokenization

The star paper here for me was another from both Sony and Queen Mary University of London:

📚 Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models [paper].

And coincidentally, it was trained using Music2Latent! So this is a nice segue from the last theme.

First off, the Diff-A-Riff generation quality is incredible. Take a listen for yourself.

Diff-A-Riff generates audio, conditioned by other stems, to create a target stem (the “accompaniment”). So you give it a guitar and a bass line and tell it to create a drum stem of the same length, and it will. You can even guide the accompaniment creation by conditioning with either an audio snippet or a text prompt.

Diff-A-Riff allows conditioning with either text or audio

As you might expect CLAP is used to achieve a shared text-audio space, but they have several other clever ways of handling the conditioning. Though the code and models will not be open sourced, it’s a really fascinating paper and some stellar work by the Sony Paris team.

Back on theme, while it’s tempting to say that diffusion looks like the winning approach for audio generation I don’t think we can quite be sure.

We know Suno uses an autoregressive architecture (at least in v2-3) (see this podcast with their CTO, Mikey Shulman), and their generation quality is the best in the world for full-length tracks. And I don’t actually know what Udio uses, but if you do let me know!

Other noteworthy papers:

Long-form music generation with latent diffusion (Stable Audio paper)
Combining audio control and style transfer using latent diffusion
Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation (this one uses flow matching, but it’s just a great paper)

Theme #3: Self-supervised learning (SSL) techniques gaining steam

Across a number of tasks like beat tracking, stem-affinity, and tonalty estimation, self-supervised learning (SSL) techniques started to shine this year at ISMIR.

These techniques are especially important in the music space where labeled data is far more limited than in the text, image, or video domains.

Favorite SSL paper:

📚 Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation [paper] [github] by Alain Riou et al, from Institut Polytechnique de Paris and Sony.

Basically the jist here, is that given a few stems aligned in time, you can train a model to output the latent representation of yet another stem that best fits the existing stem mixture (and any given conditioning signals you supply).

The Stem-JEPA architecture

So, why is this nice?

Well, if you want to generate, say a bassline for your jazzy vocal, what are your options?

Manually swap in and out stems, listening for compatibility
- Time consuming
Generate the missing stem
- Expensive FLOPS-wise
- Requires you to have such a generative model in the first place with extremely high quality
- You’d still need to score the generated stem against your existing stems to make sure it’s compatible, or have a model that uses the existing mix stems as conditioning
Train some sort of model to score your existing stems for compatibility
- Expensive computationally – we’d have to score each existing stem in your database against your currently active stem mixture to calculate a ranking by score

The JEPA approach here let’s us instead generate the “idea” of which kind of stem would fit best.

With this, we can then query a database of stems (with precomputed JEPA embeddings) to find which are the most compatible, using a simple nearest-neighbor approach. This does require one to precompute all the embeddings for the stems in your dataset, but that’s easily done ahead of time. At inference time, the JEPA system can be much faster. For that reason, Stem-JEPA is a wonderfully clever piece of work.

A downside: when training, Stem-JEPA does require split stems (which are less plentiful in the world than mixed audio). Luckily, it appears this model is quite data efficient!

With ~100x less data, downstream tasks using this learned embedding space are on par with representations generated with MULE (trained on 117k hours) and Jukebox (1.7M songs). Stem-JEPA was trained on Sony’s 20k multitracks (only ~1,350 hours by comparison).

A few other papers I enjoyed in the self-supervised realm at ISMIR ‘24:

Theme #4: Music stem separation (MSS): separation by query

We’re all familiar with the traditional Vocals, Drums, Bass & Other (VDBO) separation — you input an audio mix, and a model like Demucs or RoFormer outputs an estimate of each of this fixed set stems.

Today fine-tuned, offline, single-stem MSS extraction models can be in the range of ~8-12 dB SDR when compared with the ground truth stems, which is very impressive.

However offline SDR gains on those fronts are diminishing returns and increasingly, the field is moving towards:

Extracting a larger set of stems (ie: piano, acoustic guitar, electric guitar)
Extracting a stem by a text or audio “query”
Making the separation process more efficient

Realtime MSS is a different story (largely ignored at ISMIR ‘24), and you can contact me if you want to chat about this :)

But for the offline MSS theme, my favorite paper was led by the indomitable Karn Watcharasupat, who has a number of papers on this topic.

📚 A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems [paper] [github].

Their Banquet model architecture lets you train effectively infinite “decoders” for different stems. These decoders are simply a FiLM query embedding that you pair with a known, fixed stem type.

The benefit of course, is that unlike other architectures, your number of parameters dedicated to decoding a single stem drastically decreases from millions down to a few hundred, the cost of a single vector!

So in the last step of the model at inference time, you use the FiLM query vector to extract the stem you’re after – a sort of latent space “mask”.

Even better, there’s nothing inherently keeping the query set (and therefore the stems that are extractable) fixed. In their paper, the FiLM query vector set was, in fact, frozen per stem due to training instability, but it feels like a similar architecture could support arbitrary embeddings being used to extract arbitrarystems. This is the next frontier of MSS in my opinion.

Being able to query a mix for a bass guitar, for example, using an audio snippet of a similar stem (or an isolated snippet of said bass guitar from another part of the track) feels like the correct UI for MSS to extract the exact stem you’re after.

As a final note, MSS is close to my heart as the area I work most heavily in. At VJ Labs we work (among other realtime techniques) on realtime MSS — something which we proudly surpass the SOTA in :) But alas! No papers about realtime MSS this year at ISMIR!

Other stem separation (MSS) related papers from this year’s ISMIR:

Theme #5: Better transcription data, better transcription models

While of course models are getting better, largely the transcription fight seems like it’s being won on the data front, both natural and synthetic.

My favorite paper in this space was undoubtedly for guitar transcription:

📚 GAPS: A Large and Diverse Classical Guitar Dataset and Benchmark Transcription Model [paper], from first authors Xavier Riley & Nicolas Guo from C4DM.

I highly encourage you to watch the video showing the played vs transcribed MIDI side by side. The results are stunning.

The GAPS guitar transcription dataset

Piano transcription datasets (MAESTRO, MAPS, etc) are much larger, today. So Xavier & team created their own dataset and used it to fine-tune a piano transcription model from Bytedance.

Transcription data pipelines are no joke (extensive alignment and quality checking), so even though the dataset is on the smaller side, it’s quite impressive that ~14 hours of guitar was so effective.

Notably, the model is a “simple” (ie: non-Transformer) CRNN (log mel frontend + convolutional features + bidirectional RNN) operating at roughly 10ms granularity.

Other transcription papers:

Theme #6: Attribution

Having a trail of provenance for which music samples, ideas, models, or styles inspired or created a given piece of music was also clearly a theme at this year’s ISMIR, though more so in conversations and panels than papers.

As you might imagine, there’s a huge storm coming in terms of the rights holders of the world (record labels, copyright holders, artists) wanting their piece of the generative AI pie.

The big questions fall at the input and output:

➡️ On the input side: is training models on copyrighted audio “fair use”?
⬅️ On the output side: by what metric is a new piece of audio deemed to “copy” another, and to what extent?

To be fair, papers aren’t the place to tackle these issues. Likely the US Supreme Court will have that honor. But the various technical approaches being explored reflect these thorny issues.

The only paper really worth mentioning was:

📚 Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model [paper] [examples]

The paper basically amounts to “fingerprinting” a dataset of audio using CLAP and CLMR embeddings, and then querying a dataset of music with these fingerprints nearest-neighbor style, and seeing how similar the retrieved audio was to the originally embedded query audio.

Largely if you listen to the examples (or just think about what’s being done) it’s clear that this is not a good approach.

This falls on the “output” side of the attribution question. And querying based on feel or vibe (basically what CLAP and CLMR are good at) is just going to return matches that are of a similar style or genre, not musical infringement.

To me, the more sensible approaches center around a couple of “output”-based infringments:

Sampling (copying audio)
- “Did the artist literally copy and paste another piece of audio?”
- Spectral (traditional) audio fingerprinting is a much better approach here, with far less computation, less learnable parameters, and far lower false positives
Structural similarity (copying structure)
- “Did the artist directly rip off the chords or melody or lyrics?”
- This likely is solved technically with transcription models and some kind of MIDI similarity metric

For either approach, the thorny issue still is “to what extent” is a piece of music considered to be a copy? And if such a determination is made, what are the monetary and access consequences for the creators, the original rights holders, and the public?

This line of thinking merits an entire post (or book) of its own, so I’ll stop here.

Interested readers, artists, or music AI researchers should check out my favorite book on the subject: Free Culture by the famous Lawrence Lessig, founder of Creative Commons (yes, that one!). It’s an indispensable read.

Summary

ISMIR ‘24 was a blast. I’m already looking forward to next year!

See you all in Korea in ‘25 🇰🇷

Audio Fingerprinting

Fri, 15 Nov 2013 18:07:10 -0700

Note: this post was authored way, way back in my grad school days (in 2013!) but continues to be quite popular and cited in a number of papers. So I’ve copied and pasted it over to this newish blog site. Please keep in mind there are more advanced and scalable fingerprinting systems out there these days, but this is an excellent introduction and example codebase to start from. Enjoy!

The first day I tried out Shazam, I was blown away. Next to GPS and surviving the fall down a flight of stairs, being able to recognize a song from a vast corpus of audio was the most incredible thing I’d ever seen my phone do. This recognition works though a process called audio fingerprinting. Examples include:

After a few weekends of puzzling through academic papers and writing code, I came up with the Dejavu Project, an open-source audio fingerprinting project in Python. You can see it here on Github.

On my testing dataset, Dejavu exhibits 100% recall when reading an unknown wave file from disk or listening to a recording for at least 5 seconds.

Following is all the knowledge you need to understand audio fingerprinting and recognition, starting from the basics. Those with signals experience should skip to “Peak Finding”.

Music as a signal

As a computer scientist, my familiarity with the Fast Fourier Transform (FFT) was only that it was a cool way to mutliply polynomials in O(nlog(n)) time. Luckily it is much cooler for doing signal processing, its canonical usage.

Music, it turns out, is digitally encoded as just a long list of numbers. In an uncompressed .wav file, there are a lot of these numbers - 44100 per second per channel. This means a 3 minute long song has almost 16 million samples.

3 min * 60 sec * 44100 samples per sec * 2 channels = 15,876,000 samples

A channel is a separate sequence of samples that a speaker can play. Think of having two earbuds - this is a “stereo”, or two channel, setup. A single channel is called “mono”. Today, modern surround sound systems can support many more channels. But unless the sound is recorded or mixed with the same number of channels, the extra speakers are redundant and some speakers will just play the same stream of samples as other speakers.

Sampling

Why 44100 samples per second? The mysterious choice of 44100 samples per second seems quite arbitrary, but it relates to the Nyquist-Shannon Sampling Theorum. This is a long, mathematical way to say that there is a theoretical limit on the maximum frequency we can capture accurately when recording. This maximum frequency is based on how fast we sample the signal.

If this doesn’t make sense, think about watching a fan blade that makes a full revolution at a rate of exactly once per second (1 Hz). Now imagine keeping your eyes closed, but opening them briefly once per second. If the fan still happens to be making exactly a full revolution every 1 second as well, it will appear as though the fan blade hasn’t moved! Each time you open your eyes, the blade happens to be in the same spot. But there’s a problem. In fact, as far as you know, the fan blade could be making 0, 1, 2, 3, 10, 100, or even 1 million spins per second and you would never know - it would still appear stationary! Thus in order to be assured you are correctly sampling (or “seeing”) higher frequencies (or “spins”), you need to sample (or “open your eyes”) more frequently. To be exact, we need to sample twice as frequently as the frequency we want to see to make sure we’re detecting it.

In the case of recording audio, the accepted rule is that we’re OK missing out on frequencies above 22050 Hz since humans can’t even hear frequencies above 20,000 Hz. Thus by Nyquist, we have to sample twice that:

Samples per sec needed = Highest-Frequency * 2 = 22050 * 2 = 44100

The MP3 format compresses this in order to 1) save space on your hard drive, and 2) irritate audiophiles, but a pure .wav formatted file on your computer is just a list of 16 bit integers (with a small header).

Spectrograms

Since these samples are a signal of sorts, we can repeatedly use an FFT over small windows of time in the song’s samples to create a spectrogram of the song. Here’s a spectrogram of the first few seconds of “Blurred Lines” by Robin Thicke.

As you can see, it’s just a 2D array with amplitude as a function of time and frequency. The FFT shows us the strength (amplitude) of the signal at that particular frequency, giving us a column. If we do this enough times with our sliding window of FFT, we put them together and get a 2D array spectrogram.

It’s important to note that the frequency and time values are discretized, each representing a “bin”, while the amplitudes are real valued. The color shows the real value (red -> higher, green -> lower) of the amplitude at the discretized (time, frequency) coordinate.

As a thought experiment, if we were to record and create a spectrogram of a single tone, we’d get a straight horizontal line at the frequency of the tone. This is because the frequency does not vary from window to window.

Great. So how does this help us recognize audio? Well, we’d like to use this spectrogram to identify this song uniquely. The problem is that if you have your phone in your car and you try to recognize the song on the radio, you’ll get noise - someone is talking in the background, another car honking its horn, etc. We have to find a robust way to capture unique “fingerprints” from the audio signal.

Peak Finding

Now that we’ve got a specrogram of our audio signal, we can start by finding “peaks” in amplitude. We define a peak as a (time, frequency) pair corresponding to an amplitude value which is the greatest in a local “neighborhood” around it. Other (time, frequency) pairs around it are lower in amplitude, and thus less likely to survive noise.

Finding peaks is an entire problem itself. I ended up treating the spectrogram as an image and using the image processing toolkit and techniques from scipy to find peaks. A combination of a high pass filter (accentuating high amplitudes) and scipy local maxima structs did the trick.

Once we’ve extracted these noise-resistant peaks, we have found points of interest in a song that identify it. We are effectively “squashing” the spectrogram down once we’ve found the peaks. The amplitudes have served their purpose, and are no longer needed.

Let’s plot them to see what it looks like:

You’ll notice there are a lot of these. Tens of thousands per song, actually. The beauty is that since we’ve done away with amplitude, we only have two things, time and frequency, which we’ve conviently made into discrete, integer values. We’ve binned them, essentially.

We have a somewhat schizophrenic situation: on one hand, we have a system that will bin peaks from a signal into discrete (time, frequency) pairs, giving us some leeway to survive noise. On the other hand, since we’ve discretized, we’ve reduced the information of the peaks from infinite to finite, meaning that peaks found in one song could (hint: will!) collide, emitting the pairs as peaks extracted from other songs. Different songs can and most likely will emit the same peaks! So what now?

Fingerprint hashing

So we might have similar peaks. No problem, let’s combine peaks into fingerprints! We’ll do this by using a hash function.

A hash function takes an integer input and returns another integer as output. The beauty is that a good hash function will not only return the same output integer each time the input is the same, but also that very few different inputs will have the same output.

By looking at our spectrogram peaks and combining peak frequencies along with their time difference between them, we can create a hash, representing a unique fingerprint for this song.

hash(frequencies of peaks, time difference between peaks) = fingerprint hash value

There are lots of different ways to do this, Shazam has their own, SoundHound another, and so on. You can peruse my source to see how I do it, but the point is that by taking into account more than a single peak’s values you create fingerprints that have more entropy and therefore contain more information. Thus they are more powerful identifiers of songs since they will collide less.

You can visualize what is going on with the zoomed in annotated spectrogram snipped below:

The Shazam whitepaper likens these groups of peaks as a sort of “constellation” of peaks used to identify the song. In reality they use pairs of peaks along with the time delta in between. You can imagine lots of different ways to group points and fingerprints. On one hand, more peaks in a fingerprint means a rarer fingerprint that more strongly would identify a song. But more peaks also means less robust in the face of noise.

Learning a Song: Database structure

Now we can get started into how such a system works. An audio fingerprinting system has two tasks:

Learn new songs by fingerprinting them
Recognize unknown songs by searching for them in the database of learned songs

For this, we’ll use our knowledge thus far and MySQL for the database functionality. Our database schema will contain two tables:

fingerprints
songs

Fingerprints table

The fingerprints table will have the following fields:

1
2
3
4
5
6
7


CREATE TABLE fingerprints (
 hash binary(10) not null,
 song_id mediumint unsigned not null,
 offset int unsigned not null,
 INDEX(hash),
 UNIQUE(song_id, offset, hash)
);

First, notice we have not only a hash and a song ID, but an offset. This corresponds to the time window from the spectrogram where the hash originated from. This will come into play later when we need to filter through our matching hashes. Only the hashes that “align” will be from the true signal we want to identify (more on this in the “Fingerprint Alignment” section below).

Second, we’ve made an INDEX on our hash - with good reason. All of the queries will need to match that, so we need a really quick retrieval there.

Next, the UNIQUE index just insures we don’t have duplicates. No need to waste space or unduly weight matching of audio by having duplicates lying around.

If you’re scratching your head on why I used a binary(10) field for the hash, the reason is that we’ll have a lot of these hashes and cutting down space is imperative. Below is a graph of the number of fingerprints for each song:

At the front of the pack is “Mirrors” by Justin Timberlake, with over 240k fingerprints, followed by “Blurred Lines” by Robin Thicke with 180k. At the bottom is the acapella “Cups” which is a sparsely instrumented song - just voice and literally a cup. In contract, listen to “Mirrors”. You’ll notice the obvious “wall of noise” instrumentation and arranging the fills out the frequency spectrum from high to low, meaning that the spectrogram is abound with peaks in high and low frequencies alike. The average is well over 100k fingerprints per song for this dataset.

With this many fingerprints, we need to cut down on unecessary disk storage from the hash value level. For our fingerprint hash, we’ll start by using a SHA-1 hash and then cutting it down to half its size (just the first 20 characters). This cuts our byte usage per hash in half:

char(40) => char(20) goes from 40 bytes to 20 bytes

Next we’ll take this hex encoding and convert it to binary, once again cutting the space down considerably:

char(20) => binary(10) goes from 20 bytes to 10 bytes

Much better. We went from 320 bits down to 80 bits for the hash field, a reduction of 75%.

My first try at the system, I used a char(40) field for each hash - this resulted in over 1 GB of space for fingerprints alone. With binary(10) field, we cut down the table size to just 377 MB for 5.2 million fingerprints.

We do lose some of the information - our hashes will, statistically speaking, collide much more often now. We’ve reduced the “entropy” of the hash considerably. However, its important to remember that our entropy (or information) also includes the offset field, which is 4 bytes. This brings the total entropy of each of our fingerprints to:

10 bytes (hash) + 4 bytes (offset) = 14 bytes = 112 bits = 2^112 ~= 5.2+e33 possible fingerprints

Not too shabby. We’ve saved ourself 75% of the space and still managed to have an unimaginably large fingerprint space to work with. Gurantees on the distribution of keys is a hard argument to make, but we certainly have enough entropy to go around.

Songs table

The songs table will be pretty vanilla, essentially we’ll just use it for holding information about songs. We’ll need it to pair a song_id to the song’s string name.

1
2
3
4
5
6
7


CREATE TABLE songs (
 song_id mediumint unsigned not null auto_increment,
 song_name varchar(250) not null,
 fingerprinted tinyint default 0,
 PRIMARY KEY (song_id),
 UNIQUE KEY song_id (song_id)
);

The fingerprinted flag is used by Dejavu internally to decide whether or not to fingprint a file. We set the bit to 0 initially and only set it to 1 after the fingerprinting process (usually two channels) is complete.

Fingerprint Alignment

Great, so now we’ve listened to an audio track, performed FFT in overlapping windows over the length of the song, extracted peaks, and formed fingerprints. Now what?

Assuming we’ve already performed this fingerprinting on known tracks, ie we have already inserted our fingerprints into the database labeled with song IDs, we can simply match.

Our pseudocode looks something like this:

channels = capture_audio()
fingerprints_matching = [ ]
for channel_samples in channels
hashes = process_audio(channel_samples)
fingerprints_matching += find_database_matches(hashes)
predicted_song = align_matches(fingerprints_matching)

What does it mean for hashes to be aligned? Let’s think about the sample that we are listening to as a subsegment of the original audio track. Once we do this, the hashes we extract out of the sample will have an offset that is relative to the start of the sample.

The problem of course, is that when we originally fingerprinted, we recorded the absolute offset of the hash. The relative hashes from the sample and the absolute hashes from the database won’t ever match unless we started recording a sample from exactly the start of the song. Pretty unlikely.

But while they may not be the same, we do know something about the matches from the real signal behind the noise. We know all the relative offsets will be the same distance apart. This requires the assumption that the track is being played and sampled at the same speed it was recorded and released in the studio. Actually, we’d be out of luck anyway in the case the playback speed was different, since this would affect the frequency of the playback and therefore the peaks in the spectrogram. At any rate, the playback speed assumption is a good (and important) one.

Under this assumption, for each match we calculate a difference between the offsets:

difference = database offset from original track - sample offset from recording

which will always yield a postiive integer since the database track will always be at least the length of the sample. All of the true matches with have this same difference. Thus our matches from the database are altered to look like:

(song_id, difference)

Now we simply look over all of the matches and predict the song ID for which the largest count of a difference falls. This is easy to imagine if you visualize it as a histogram.

And that’s all there is to it!

How well it works

To truly get the benefit of an audio fingerprinting system, it can’t take a long time to fingerprint. It’s a bad user experience, and furthermore, a user may only decide to try to match the song with only a few precious seconds of audio left before the radio station goes to a commercial break.

To test Dejavu’s speed and accuracy, I fingerprinted a list of 45 songs from the US VA Top 40 from July 2013 (I know, their counting is off somewhere). I tested in three ways:

Reading from disk the raw mp3 -> wav data, and
Playing the song over the speakers with Dejavu listening on the laptop microphone.
Compressed streamed music played on my iPhone

Below are the results.

1. Reading from Disk

Reading from disk was an overwhelming 100% recall - no mistakes were made over the 45 songs I fingerprinted. Since Dejavu gets all of the samples from the song (without noise), it would be nasty surprise if reading the same file from disk didn’t work every time!

2. Audio over laptop microphone

Here I wrote a script to randomly chose n seconds of audio from the original mp3 file to play and have Dejavu listen over the microphone. To be fair I only allowed segments of audio that were more than 10 seconds from the starting/ending of the track to avoid listening to silence.

Additionally my friend was even talking and I was humming along a bit during the whole process, just to throw in some noise.

Here are the results for different values of listening time (n):

This is pretty rad. For the percentages:

Number of Seconds	Number Correct	Percentage Accuracy
1	27 / 45	60.0%
2	43 / 45	95.6%
3	44 / 45	97.8%
4	44 / 45	97.8%
5	45 / 45	100.0%
6	45 / 45	100.0%

Even with only a single second, randomly chosen from anywhere in the song, Dejavu is getting 60%! One extra second to 2 seconds get us to around 96%, while getting perfect only took 5 seconds or more. Honestly when I was testing this myself, I found Dejavu beat me - listening to only 1-2 seconds of a song out of context to identify is pretty hard. I had even been listening to these same songs for two days straight while debugging…

In conclusion, Dejavu works amazingly well, even with next to nothing to work with.

3. Compressed streamed music played on my iPhone

Just to try it out, I tried playing music from my Spotify account (160 kbit/s compressed) through my iPhone’s speakers with Dejavu again listening on my MacBook mic. I saw no degredation in performance; 1-2 seconds was enough to recognize any of the songs.

Performance: Speed

On my MacBook Pro, matching was done at 3x listening speed with a small constant overhead. To test, I tried different recording times and plotted the recording time plus the time to match. Since the speed is mostly invariant of the particular song and more dependent on the length of the spectrogram created, I tested on a single song, “Get Lucky” by Daft Punk:

As you can see, the relationship is quite linear. The line you see is a least-squares linear regression fit to the data, with the corresponding line equation:

1.364757 * record time - 0.034373 = time to match

Notice of course since the matching itself is single threaded, the matching time includes the recording time. This makes sense with the 3x speed in purely matching, as:

1 (recording) + 1/3 (matching) = 4/3 ~= 1.364757

if we disregard the miniscule constant term.

The overhead of peak finding is the bottleneck - I experimented with mutlithreading and realtime matching, and alas, it wasn’t meant to be in Python. An equivalent Java or C/C++ implementation would most likely have little trouble keeping up, applying FFT and peakfinding in realtime.

An important caveat is of course, the round trip time (RTT) for making matches. Since my MySQL instance was local, I didn’t have to deal with the latency penalty of transfering fingerprint matches over the air. This would add RTT to the constant term in the overall calculation, but would not effect the matching process.

Performance: Storage

For the 45 songs I fingerprinted, the database used 377 MB of space for 5.4 million fingerprints. In comparison, the disk usage is given below:

Audio Information Type	Storage in MB
mp3	339
wav	1885
fingerprints	377

There’s a pretty direct trade-off between the necessary record time and the amount of storage needed. Adjusting the amplitude threshold for peaks and the fan value for fingerprinting will add more fingerprints and bolster the accuracy at the expense of more space.

It’s true, the fingerprints take up a surprising amount of space (slighty more than raw MP3 files). This seems alarming until you consider there are tens and sometimes hundreds of thousands of hashes per song. We’ve traded off the pure information of the entire audio signal in the wave files for about 20% of that storage in fingerprints. We’ve also enabled matching songs very reliably in five seconds, so our space / speed tradeoff appears to have paid off.

Conclusion

Audio fingerprinting seemed magical the first time I saw it. But with a small amount of knowledge about signal processing and basic math, it’s a fairly accessible field.

My hope is that anyone reading this will check out the Dejavu Project and drop a few stars on me or, better yet, fork it! Check out Dejavu here:

https://github.com/worldveil/dejavu