<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>regulatory-landscape on Will Drevo</title><link>https://willdrevo.com/tags/regulatory-landscape/</link><description>Recent content in regulatory-landscape on Will Drevo</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Tue, 14 Apr 2026 00:00:00 -0500</lastBuildDate><atom:link href="https://willdrevo.com/tags/regulatory-landscape/index.xml" rel="self" type="application/rss+xml"/><item><title>Ibogaine: the unofficial resources list</title><link>https://willdrevo.com/2026/04/14/ibogaine-the-unofficial-resources-list/</link><pubDate>Tue, 14 Apr 2026 00:00:00 -0500</pubDate><guid>https://willdrevo.com/2026/04/14/ibogaine-the-unofficial-resources-list/</guid><description>&lt;p>A compilation of resources about ibogaine.&lt;/p>
&lt;p>If you are seeking treatment or trying to find a faciliator or clinic, please reach out to me directly! I am more than happy to help &amp;amp; share.&lt;/p>
&lt;p>Also if I am missing a resource in a category I should be, please also reach out to me. Happy to add!&lt;/p>
&lt;p>I have absolutely zero financial or vested interest in any of these resources: books, clinics, articles, protocols, etc.&lt;/p>
&lt;aside id="toc">
&lt;h4>Table of Contents&lt;/h4>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#resources">Resources&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#podcasts">Podcasts&lt;/a>&lt;/li>
&lt;li>&lt;a href="#medical-research--pubmed-papers">Medical research / PubMed papers&lt;/a>&lt;/li>
&lt;li>&lt;a href="#clinics">Clinics&lt;/a>&lt;/li>
&lt;li>&lt;a href="#books">Books&lt;/a>&lt;/li>
&lt;li>&lt;a href="#articles">Articles&lt;/a>&lt;/li>
&lt;li>&lt;a href="#documentaries">Documentaries&lt;/a>&lt;/li>
&lt;li>&lt;a href="#academic--policy">Academic &amp;amp; policy&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/aside>
&lt;!-- ## What is ibogaine?
Ibogaine has been known for hundreds of years, if not more.
But today, this psychadelic root extract from Africa is the darling of the most unlikely alliance:
- **The enthnogenic community**: practitioners who have been using ibogaine / iboga for decades (much longer in Africa) for spiritual growth and inner wisdom
- **The conservative right in the USA**: led by [Rick Perry]() and [Bryan Hubbard](), who seek to use it as treatment for veterans and war heroes suffering from PTSD
- Researchers &amp; pharmaceuticals: researchers from Stanford, pharmaceutical startups in sillicon valley, seeking to commercialize this seemingly miracle drug
## What does it do?
Ingesting ibogaine delivers a uniquely long and difficult psychadelic experience that has in numerous studies shown to:
- Completely interrupt substance dependencies for everything from opiods, to alcohol, to nicotine
- Reduce the symptoms of TBI, PTSD and has been employed especially among veterans seeking relief
Ibogine induces a long oneric (waking dreamlike) state where many people describe reviewing life experiences and the uncanny ability to remember, relive, and heal traumatic memories from childhood, long since lost or repressed.
## Why isn't it more widely used?
The only downside is ibogaine is not pleasant to experience. The "trip" itself lasts 24-72 hours (depending on your definition).
An ibogaine full (flood) dose regularly, but not always induces:
- Cardiac toxicity (raises heart rate, lowers blood pressure, sometimes dangerously so)
- Ataxia (lack of coordination, dizziness)
- Nausea
There is an infinite amount more to say about this incredible treatment, but I will leave that to a later post.
This post is simply to track the resources I have found on various aspects of ibogaine for all those curious to learn more. -->
&lt;h2 id="resources">Resources&lt;/h2>
&lt;p>If something is &lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
, it means I haven&amp;rsquo;t read/listened/consumed it, but I continually update this list as a find new sources and read them!&lt;/p>
&lt;h3 id="podcasts">Podcasts&lt;/h3>
&lt;ul>
&lt;li>&lt;span class="star-rating" aria-label="5 out of 5 stars">&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;/span>
&lt;a href="https://open.spotify.com/episode/2aBRLTKYpUo5jfYd68LTg4?si=60bfb8cf3e7a445d">Rick Perry &amp;amp; Bryan Hubbard on Joe Rogan: #2477&lt;/a> (2026)
&lt;ul>
&lt;li>Even if you don&amp;rsquo;t like Joe Rogan, this is an incredible introduction&lt;/li>
&lt;li>This is usually the first content I recommend to anyone wanting to learn more&lt;/li>
&lt;li>Rick Perry and Bryan Hubbard run the &lt;a href="https://www.americansforibogaine.org/">Americans for Ibogaine&lt;/a> organization&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="5 out of 5 stars">&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;/span>
&lt;a href="https://open.spotify.com/episode/7bYLDLiUN3OeHpJKnABByv?si=59c7a66430bf4cc6">Rick Perry &amp;amp; Bryan Hubbard on Joe Rogan: #2251&lt;/a>
&lt;ul>
&lt;li>Similar to the above, but the first episode. I recommend both, but the 2026 one is obviously more up to date with the progress made in 2025-20206&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="5 out of 5 stars">&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;/span>
&lt;a href="https://www.economist.com/podcasts/2026/03/28/the-red-state-psychedelic">The Economist: The Red State Psychedelic&lt;/a> (2026)
&lt;ul>
&lt;li>A fantastic dive into the unconventional background of Bryan Hubbard and the first signs that the religious right might accept psychadelics as a form of care&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://open.spotify.com/episode/1mmBgNCjEyv3OEFcTPNCjl">One Reporter&amp;rsquo;s Life-Altering Psychadelic Trip&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="medical-research--pubmed-papers">Medical research / PubMed papers&lt;/h3>
&lt;p>Coming soon.&lt;/p>
&lt;h3 id="clinics">Clinics&lt;/h3>
&lt;p>If you contact me I am happy to share more, but I won&amp;rsquo;t post &amp;ldquo;ratings&amp;rdquo; in this section until I gather more information from primary sources.&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://innerrealmscenter.com/">Inner Realms&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.ambio.life/">Ambio&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://beondibogaine.com/">Beond&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.ibogaquest.com/">IbogaQuest&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="books">Books&lt;/h3>
&lt;ul>
&lt;li>&lt;span class="star-rating" aria-label="4 out of 5 stars">&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.amazon.com/Iboga-Root-Healing-Daniel-Brett/dp/1838446214">Iboga: the Root of All Healing&lt;/a>
&lt;ul>
&lt;li>Good overview of the history and usage today of ibogaine. Objective facts and survey of subjective experiences during flood doses themselves.&lt;/li>
&lt;li>Great introductory book if you&amp;rsquo;re more of a book person than podcasts or articles&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.amazon.ie/Ibogaine-Story-Report-Staten-Project/dp/1570270295">The Ibogaine Story: Report on the Staten Island Project&lt;/a>&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.simonandschuster.com/books/Ibogaine-and-the-Bicameral-Mind/Jonathan-Dickinson/9798888504680">Ibogaine and the Bicameral Mind&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="articles">Articles&lt;/h3>
&lt;ul>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.nytimes.com/2025/08/11/us/politics/rick-perry-drug-psychedelics-ibogaine.html">The Long, Strange Trip of Rick Perry&lt;/a> — NYTimes&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.nytimes.com/2026/03/01/magazine/ibogaine-psychedelic-treatment-trauma-mental-health.html">It’s an Obscure Psychedelic Used to Treat Trauma. Could It Help Me?&lt;/a> — NYTimes&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://avanteibogaine.com/ibogaine-treatment-complete-guide/">Ibogaine Treatment Complete Guide&lt;/a> — Avante Ibogaine&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://getibogaine.com/best-books-to-read-on-iboga-and-ibogaine/">Best Books to Read on Iboga and Ibogaine&lt;/a> — Get Ibogaine&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.nationalgeographic.com/animals/article/ibogaine-pschedelic-drug-root-fair-trade-gabon">Ibogaine, Fair Trade, and Gabon&lt;/a> — National Geographic (Michael Pollan)&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.theguardian.com/books/2003/sep/20/booksonhealth.lifeandhealth">Book review: ibogaine and addiction&lt;/a> — The Guardian, 2003&lt;/li>
&lt;/ul>
&lt;h3 id="documentaries">Documentaries&lt;/h3>
&lt;ul>
&lt;li>&lt;span class="star-rating" aria-label="4 out of 5 stars">&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star filled">★&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.amazon.com/Ibogaine-Fight-Lifetime/dp/B0G8VG59Q2">Ibogaine: Fight of a Lifetime&lt;/a> — Amazon Prime Video
&lt;ul>
&lt;li>Straightward, heartwarming, and informative. Made by the Americans for Ibogaine initiative &amp;amp; Beond as a push to pass legislation in Texas for clinical trials&lt;/li>
&lt;li>Focuses on the stories of a few US service members suffering from TBI and PTSD and their journey to the Beond treatment facility in Mexico&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.youtube.com/watch?v=vt0E8N4FRFY">Ibogaine: Rite of Passage&lt;/a> — YouTube, free&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://www.netflix.com/title/82047468">In Waves and War&lt;/a> — Netflix&lt;/li>
&lt;/ul>
&lt;h3 id="academic--policy">Academic &amp;amp; policy&lt;/h3>
&lt;ul>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://academic.oup.com/book/24744/chapter-abstract/188256487?redirectedFrom=fulltext">Oxford University Press chapter on ibogaine&lt;/a>&lt;/li>
&lt;li>&lt;span class="star-rating" aria-label="0 out of 5 stars">&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;span class="star empty">☆&lt;/span>&lt;/span>
&lt;a href="https://psychedelicalpha.com/news/what-approval-wont-solve-brian-barnett-on-ketamines-lessons-rvus-and-scaling-psychedelic-care/">What Approval Won&amp;rsquo;t Solve&lt;/a> — Psychedelic Alpha
&lt;ul>
&lt;li>Ketamine&amp;rsquo;s lessons for scaling psychedelic care.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Training run diagnostic metrics: what I track for when things break down</title><link>https://willdrevo.com/2026/03/04/ai-training-run-diagnostic-metrics-what-i-track-for-when-things-break-down/</link><pubDate>Wed, 04 Mar 2026 01:55:45 -0800</pubDate><guid>https://willdrevo.com/2026/03/04/ai-training-run-diagnostic-metrics-what-i-track-for-when-things-break-down/</guid><description>&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/learning_diagram.png" alt="" width="1000">
&lt;figcaption>The journey from our loss calculation, to our gradient $G$, to updating our parameters, $P$. &lt;BR>And yes, I didn't use $\theta$ for parameters. Fight me. Also, not to scale.&lt;/figcaption>
&lt;/figure>
&lt;p>This post talks a little about the metrics I track to characterize quickly what is going right or wrong with my runs, to save myself precious time and GPU 💸.&lt;/p>
&lt;p>To be clear, when I say &amp;ldquo;break down&amp;rdquo; I don&amp;rsquo;t mean the run crashed. That&amp;rsquo;s a differnt sort of debugging. This is for when the model trains, but it&amp;rsquo;s not going the way you want it to.&lt;/p>
&lt;p>I use &lt;a href="https://wandb.ai/">Weights &amp;amp; Biases&lt;/a> (W&amp;amp;B), but this all applies to similar tools like &lt;a href="https://mlflow.org/">MLFlow&lt;/a>, &lt;a href="https://www.comet.com/site/">CometML&lt;/a>, and so on.&lt;/p>
&lt;p>These metrics are basic, but over the years I&amp;rsquo;ve picked them up to solve different training run issues. They&amp;rsquo;re much cheaper to collect and log than doing more runs :)&lt;/p>
&lt;p>At the end, I&amp;rsquo;ll also contextualize how and when I log them in a pseudocode loop. Great for throwing right into a coding LLM as scaffolding for your own projects.&lt;/p>
&lt;aside id="toc">
&lt;h4>Table of Contents&lt;/h4>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#-the-basics-must-haves">📚 The basics, must haves&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-what-is-a-step-exactly">🐾 What is a &amp;ldquo;step&amp;rdquo; exactly?&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#an-aside-logging-under-multiple-processes">An aside: Logging under multiple processes&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#-metric-group-1-grad-norm--grad-norm-per-module">🎓 Metric group #1: Grad norm + grad norm per module&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-metric-group-2-update-norms--effective-lr-ratio">📉 Metric group #2: Update norms + effective LR ratio&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#interpreting-param-norm-update-norms--effective-lr-ratio">Interpreting param norm, update norms &amp;amp; effective LR ratio&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#-metric-group-3-non-loss-test-metrics">📈 Metric group #3: Non-loss test metrics&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-metric-group-4-loss-by-category">🗂️ Metric group #4: Loss by category&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-putting-it-all-together-the-learning-loop-sketch">🔄 Putting it all together: the learning loop sketch&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-summary">🏁 Summary&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/aside>
&lt;p>First, let&amp;rsquo;s talk about the non-negotiables.&lt;/p>
&lt;h2 id="-the-basics-must-haves">📚 The basics, must haves&lt;/h2>
&lt;p>Obviously you need to set up weights &amp;amp; biases (or whatever you&amp;rsquo;re using to track with):&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">9
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> rank &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>init(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> project&lt;span style="color:#f92672">=&lt;/span>wandb_project_base,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> name&lt;span style="color:#f92672">=&lt;/span>run_name,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> config&lt;span style="color:#f92672">=&lt;/span>checkpoint_config, &lt;span style="color:#75715e"># usually a dict with all my keyword args&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> )
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># save your config somehow! I like saving the YAML&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>save(config_yaml_path, policy&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;now&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>The simple metrics you MUST track, per step:&lt;/p>
&lt;ol>
&lt;li>Learning rate&lt;/li>
&lt;li>Train loss
&lt;ul>
&lt;li>Per batch&lt;/li>
&lt;li>Per epoch&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Test loss
&lt;ul>
&lt;li>Per epoch&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>These are the foundation of what&amp;rsquo;s happening to our model over time.&lt;/p>
&lt;p>Next, we must agree on our x-axis.&lt;/p>
&lt;h2 id="-what-is-a-step-exactly">🐾 What is a &amp;ldquo;step&amp;rdquo; exactly?&lt;/h2>
&lt;p>First off, your x-axis for graphs should be the &amp;ldquo;step&amp;rdquo; count.&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Use our custom &amp;#34;step&amp;#34; as the x-axis for all metrics&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># This allows comparing runs at the same training step, even when resuming&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wandb&lt;span style="color:#f92672">.&lt;/span>define_metric(&lt;span style="color:#e6db74">&amp;#34;step&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wandb&lt;span style="color:#f92672">.&lt;/span>define_metric(&lt;span style="color:#e6db74">&amp;#34;*&amp;#34;&lt;/span>, step_metric&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;step&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># and then to log each time:&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wandb&lt;span style="color:#f92672">.&lt;/span>log({ &lt;span style="color:#f92672">...&lt;/span> }, step&lt;span style="color:#f92672">=&lt;/span>step)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>Each step ends with updating your model&amp;rsquo;s parameters. So if you are accumulating gradients over multiple forward passes, I would suggest that block being your &amp;ldquo;step&amp;rdquo;.&lt;/p>
&lt;p>This will smooth out the statistics you report (less noise) and keep all your logic like checkpointing or reporting ticking on the same heartbeat.&lt;/p>
&lt;h3 id="an-aside-logging-under-multiple-processes">An aside: Logging under multiple processes&lt;/h3>
&lt;p>For a multiple GPU setup, I often will just have a single process reporting back metrics, ie:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> rank &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>log({ &lt;span style="color:#f92672">...&lt;/span> }, step&lt;span style="color:#f92672">=&lt;/span>step)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>For training from multiple machines, the advice is similar, you just have to pick a leader somehow.&lt;/p>
&lt;p>The only time you need all processes to participate is if you parallelize test set evaluation (which I do).&lt;/p>
&lt;p>You&amp;rsquo;ll need an all-reduce step to &amp;ldquo;collect&amp;rdquo; the various losses or metrics from each process, and then combine them to your leader process, which calls &lt;code>wandb.log()&lt;/code>:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># create our tensor we will all reduce sum over, coming&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># from each process in our training process group&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># this runs in ALL PROCESSES&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>loss_t &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>tensor([
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_losses[&lt;span style="color:#e6db74">&amp;#39;total&amp;#39;&lt;/span>],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_losses[&lt;span style="color:#e6db74">&amp;#39;main_task&amp;#39;&lt;/span>],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_losses[&lt;span style="color:#e6db74">&amp;#39;aux_loss1&amp;#39;&lt;/span>],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_losses[&lt;span style="color:#e6db74">&amp;#39;aux_loss2&amp;#39;&lt;/span>],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_losses[&lt;span style="color:#e6db74">&amp;#39;aux_loss3&amp;#39;&lt;/span>],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> float(num_test_batches_this_process)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ], device&lt;span style="color:#f92672">=&lt;/span>device
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># add them all together, elementwise&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>dist&lt;span style="color:#f92672">.&lt;/span>all_reduce(loss_t, op&lt;span style="color:#f92672">=&lt;/span>dist&lt;span style="color:#f92672">.&lt;/span>ReduceOp&lt;span style="color:#f92672">.&lt;/span>SUM)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> rank &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># run only on single process: compute averages &amp;amp; log&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_test_batches &lt;span style="color:#f92672">=&lt;/span> int(loss_t[&lt;span style="color:#ae81ff">5&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>item())
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> avg_losses &lt;span style="color:#f92672">=&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;total&amp;#39;&lt;/span>: loss_t[&lt;span style="color:#ae81ff">0&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">/&lt;/span> total_test_batches,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;main_task&amp;#39;&lt;/span>: loss_t[&lt;span style="color:#ae81ff">1&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">/&lt;/span> total_test_batches,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;aux_loss1&amp;#39;&lt;/span>: loss_t[&lt;span style="color:#ae81ff">2&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">/&lt;/span> total_test_batches,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;aux_loss2&amp;#39;&lt;/span>: loss_t[&lt;span style="color:#ae81ff">3&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">/&lt;/span> total_test_batches,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;aux_loss3&amp;#39;&lt;/span>: loss_t[&lt;span style="color:#ae81ff">4&lt;/span>]&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">/&lt;/span> total_test_batches,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>log(avg_losses, step&lt;span style="color:#f92672">=&lt;/span>step)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>To be clear, you can have multiple processes reporting back train metrics. But you&amp;rsquo;ll end up with multiple data points per step on your graph and this is noisy.&lt;/p>
&lt;p>Additionally, with multiple runs it will be harder to compare that metric to a previous run&amp;rsquo;s if you have multiple lines per run.&lt;/p>
&lt;p>With that out of the way, let&amp;rsquo;s get to the metrics.&lt;/p>
&lt;h2 id="-metric-group-1-grad-norm--grad-norm-per-module">🎓 Metric group #1: Grad norm + grad norm per module&lt;/h2>
&lt;p>You likely already track gradient (grad) norm, what I&amp;rsquo;ll write as $\left\lVert{G}\right\rVert_2$ since it&amp;rsquo;s the L2 norm of the gradient before any clipping.&lt;/p>
&lt;p>The norm (size) of our gradient basically answers the question: &amp;ldquo;how large of a change in parameter space is our loss proposing?&amp;rdquo;&lt;/p>
&lt;p>An oversimplification of how the gradient $G$ is applied to your network&amp;rsquo;s parameters $P$ using learning rate scalar $\alpha$ is:&lt;/p>
&lt;p>$$ P_{new} = P_{old} - G_{clipped} * \alpha$$&lt;/p>
&lt;blockquote>
&lt;p>Note: if your optimizer is something like AdamW, this is directionally but not literally true. Many optimizers try to maintain a &amp;ldquo;trajectory&amp;rdquo; of your parameter updates over time (ie: momentum) or other tricks to help you traverse the loss landscape in a faster manner. But this equation is the underlying dynamic.&lt;/p>
&lt;/blockquote>
&lt;p>where $G$ and $P$ are both vectors of length $N$, the number of parameters in your network.&lt;/p>
&lt;p>Grad norm just looks at the sum of all the backwards passes (the gradient) per step (which could be over multiple grad accumulation steps) and concatenates them into one huge, long vector (size $N$) and computes the L2 norm (or length) of it:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">compute_grad_norm&lt;/span>(parameters, norm_type&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">2&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Compute the norm of the gradients of the parameters.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> This implementation computes norms per parameter for memory
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> efficiency reasons, rather than concatenating to one giant
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> vector and computing the norm on it. The result is mathematically
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> equivalent.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_norm &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> p &lt;span style="color:#f92672">in&lt;/span> parameters:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> p&lt;span style="color:#f92672">.&lt;/span>grad &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#f92672">not&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> param_norm &lt;span style="color:#f92672">=&lt;/span> p&lt;span style="color:#f92672">.&lt;/span>grad&lt;span style="color:#f92672">.&lt;/span>data&lt;span style="color:#f92672">.&lt;/span>norm(norm_type)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_norm &lt;span style="color:#f92672">+=&lt;/span> param_norm&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">**&lt;/span> norm_type
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> total_norm &lt;span style="color:#f92672">**&lt;/span> (&lt;span style="color:#ae81ff">1.0&lt;/span> &lt;span style="color:#f92672">/&lt;/span> norm_type)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>grad_norm &lt;span style="color:#f92672">=&lt;/span> compute_grad_norm(model&lt;span style="color:#f92672">.&lt;/span>parameters())
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>What I propose tracking are additional per-module norms, so for each module of your torch network, you&amp;rsquo;d compute the subgraph&amp;rsquo;s grad norm, and also plot that:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">compute_model_grad_norm_per_module&lt;/span>(model):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Compute the norm of the gradients of the parameters
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> for each module in the model
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Returns a wandb-loggable dict with mapping:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> module name: str -&amp;gt; float
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_norms &lt;span style="color:#f92672">=&lt;/span> {}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_norms[&lt;span style="color:#e6db74">&amp;#34;grad_norm/overall&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> compute_grad_norm(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model&lt;span style="color:#f92672">.&lt;/span>parameters()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> )
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> name, module &lt;span style="color:#f92672">in&lt;/span> model&lt;span style="color:#f92672">.&lt;/span>named_modules():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> name &lt;span style="color:#f92672">and&lt;/span> name&lt;span style="color:#f92672">.&lt;/span>strip():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Only consider modules with trainable parameters&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> any(p&lt;span style="color:#f92672">.&lt;/span>requires_grad &lt;span style="color:#66d9ef">for&lt;/span> p &lt;span style="color:#f92672">in&lt;/span> module&lt;span style="color:#f92672">.&lt;/span>parameters()):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_norms[&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;grad_norm/&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> \
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> compute_grad_norm(module&lt;span style="color:#f92672">.&lt;/span>parameters())
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> grad_norms
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>grad_norm_per_module &lt;span style="color:#f92672">=&lt;/span> compute_model_grad_norm_per_module(model)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>Why do this?&lt;/p>
&lt;p>Well if you track grad norm, it&amp;rsquo;s because you want to know if network updates are going haywire, either getting too big or too small over time. And if that is the cause, then you&amp;rsquo;re going to want to know &lt;em>why&lt;/em>.&lt;/p>
&lt;p>You could easily chalk it up to &amp;ldquo;oh the learning rate must be too high&amp;rdquo; or &amp;ldquo;must be too much regularization&amp;rdquo; (and it very well might be), but before you go and kick off another expensive run, checking the per-module grad norm can help save you time.&lt;/p>
&lt;blockquote>
&lt;p>And remember, if you have gradient clipping on, it&amp;rsquo;s important to track the value &lt;strong>pre-clip&lt;/strong> as that&amp;rsquo;s the pure signal your learning process is working with before clipping tries to tame it.&lt;/p>
&lt;/blockquote>
&lt;p>Let&amp;rsquo;s go through a real-world example.&lt;/p>
&lt;p>In the below, I was training a small but decently complex transformer network (~11M parameters) for realtime audio. I had just added a number of improvements on the data and architecture side, and kicked off another run.&lt;/p>
&lt;p>I started to notice the issue with the (pre-clipping) grad norm graph:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/grad_norm_explode.png" alt="" width="800">
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>Ouch. This run was not going to converge anytime soon.&lt;/p>
&lt;p>And the beginning of the grad norm explosion upwards did coincide with the peak of the learning rate, after the warmup window:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/explode_lr.png" alt="" width="800">
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>So with the fairly aggressive learning rate of &lt;code>1e-3&lt;/code>, it &lt;em>would&lt;/em> be a valid conclusion that the learning rate was too high.&lt;/p>
&lt;p>But this didn&amp;rsquo;t seem right. Even with a bunch of changes, I&amp;rsquo;d been training this network previously and &lt;code>1e-3&lt;/code> had proven aggressive, but stable. I hadn&amp;rsquo;t completely changed the size of the network or regularization in a drastic enough way for this much of a deviation.&lt;/p>
&lt;p>Luckily, I had per module grad norm logged!&lt;/p>
&lt;p>I began to notice a pattern. The gradient norm at later layers seemed high, but not crazy:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/grad_norm_layer8.png" alt="" width="800">
&lt;figcaption>The 8th layer's LayerNorm grad norms over time&lt;/figcaption>
&lt;/figure>
&lt;p>But steadily got worse the closer to the front of the network:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/grad_norm_layer7.png" alt="" width="800">
&lt;figcaption>Getting slightly worse in the 7th layer&lt;/figcaption>
&lt;/figure>
&lt;p>And wild by the first layer (check the y-axis):&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/grad_norm_layer1.png" alt="" width="800">
&lt;figcaption>Getting pretty crazy&lt;/figcaption>
&lt;/figure>
&lt;p>But things were totally insane by the frontend conv layers, with peaks in the thousands! For reference, I had gradient clipping on for any gradient norm &amp;gt; 1.0. Clipping prevented the weights from exploding outright, but didn&amp;rsquo;t fix the underlying problem: the gradient &lt;em>direction&lt;/em> was dominated by the unstable parameter, starving the rest of the network of useful gradient signal.&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/grad_norm_conv1.png" alt="" width="800">
&lt;figcaption>Insanity at the first conv layer&lt;/figcaption>
&lt;/figure>
&lt;p>But my &lt;code>conv&lt;/code> layers&amp;rsquo; random init values seemed completely reasonable. So a dead end there.&lt;/p>
&lt;p>But then it hit me.&lt;/p>
&lt;p>I had recently hypothesized the model might need to reweight the mel bins based on a loudness curve, sort of like humans have our own auditory perceptual curve (see: &lt;a href="https://en.wikipedia.org/wiki/Equal-loudness_contour">Fletcher–Munson equal loudness curve&lt;/a>). And in terms of parameters/FLOPs it&amp;rsquo;s stupidly cheap.&lt;/p>
&lt;p>So I added a simple scaling of my mel frames at the start of the forward pass:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">Model&lt;/span>(nn&lt;span style="color:#f92672">.&lt;/span>Module):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> __init__(self, &lt;span style="color:#f92672">...&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>mel_scale &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>Parameter(torch&lt;span style="color:#f92672">.&lt;/span>randn(self&lt;span style="color:#f92672">.&lt;/span>num_mels))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">forward&lt;/span>(self, x, &lt;span style="color:#f92672">...&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># x is tensor sized: (batch, time, num_mels)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> x &lt;span style="color:#f92672">*=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>mel_scale &lt;span style="color:#75715e"># hint: don&amp;#39;t do this 🤣&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>You might see the train wreck coming.&lt;/p>
&lt;p>This had multiple problems:&lt;/p>
&lt;ol>
&lt;li>Initialization doesn&amp;rsquo;t start at identity
&lt;ul>
&lt;li>&lt;code>torch.randn&lt;/code> outputs $N(0, 1)$ (gaussian centered at 0)&lt;/li>
&lt;li>In expectation, now:
&lt;ul>
&lt;li>half our values will be negative (flipping the sign of our features)&lt;/li>
&lt;li>many are near zero (killing bins entirely)&lt;/li>
&lt;li>almost none are near 1.0 (identity, passing through original features untouched).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Negative values particularly bad for elementwise log-scaling
&lt;ul>
&lt;li>Imagine a mel audio value at a bin of -80 dB. This is virtually silent.&lt;/li>
&lt;li>Multiplying this by -1 is disastrous. Our quietest bin now is INSANELY loud&lt;/li>
&lt;li>This is exactly why &lt;code>nn.LayerNorm&lt;/code> (and every other normalization layer) initializes its multiplicative &lt;code>weight&lt;/code> parameter to &lt;strong>ones&lt;/strong> and its additive &lt;code>bias&lt;/code> parameter to &lt;strong>zeros&lt;/strong>. Those are the identity elements for their respective operations.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>The fix is very simple.&lt;/p>
&lt;p>Multiplying in linear space is &lt;em>addition&lt;/em> in log space:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">Model&lt;/span>(nn&lt;span style="color:#f92672">.&lt;/span>Module):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> __init__(self, &lt;span style="color:#f92672">...&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>mel_bias &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>Parameter(torch&lt;span style="color:#f92672">.&lt;/span>zeros(n_mels))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">forward&lt;/span>(self, x, &lt;span style="color:#f92672">...&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># x is log-mel, sized: (batch, time, num_mels)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> x &lt;span style="color:#f92672">+=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>mel_bias
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>And we init at 0.0, so this starts as a no-op.&lt;/p>
&lt;p>An additional benefit is that because we &lt;em>add&lt;/em> &lt;code>self.mel_bias&lt;/code> (instead of &lt;em>multiply&lt;/em>) our gradient is multiplied by 1.0 instead of the input magnitude, so our gradients (and thus our updates to &lt;code>self.mel_bias&lt;/code>) are much more stable.&lt;/p>
&lt;p>This completely fixed the issue:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/grad_norm_after_fix.png" alt="" width="800">
&lt;figcaption>The green line is after the fix. Nice, slow, steady decline of grad norm after LR peak&lt;/figcaption>
&lt;/figure>
&lt;p>You might also have noticed that because the &lt;code>self.mel_scale&lt;/code> scaling tensor was just an &lt;code>nn.Parameter&lt;/code>, we wouldn&amp;rsquo;t get the per-module grad norm computed with the code above. The fix would be to make an &lt;code>nn.Module&lt;/code> wrapper for it:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LearnableBias&lt;/span>(nn&lt;span style="color:#f92672">.&lt;/span>Module):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> __init__(self, n_channels: int):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> super()&lt;span style="color:#f92672">.&lt;/span>__init__()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>bias &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>Parameter(torch&lt;span style="color:#f92672">.&lt;/span>zeros(n_channels))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">forward&lt;/span>(self, x: torch&lt;span style="color:#f92672">.&lt;/span>Tensor) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>Tensor:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> x &lt;span style="color:#f92672">+&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>bias
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>and then &lt;code>compute_model_grad_norm_per_module()&lt;/code> would have computed and reported this in the key &lt;code>grad_norm/mel_bias&lt;/code>.&lt;/p>
&lt;p>Either way, per-module grad norm logging led me to the issue. But without this, I might have wasted another run or two guessing lower learning rates.&lt;/p>
&lt;p>And as you know, when you lower the learning rate, &lt;em>it takes you longer to find the issue&lt;/em> because the learning process is slowed.&lt;/p>
&lt;p>So obviously grad norm per-module is a valuable metric in your toolbox.&lt;/p>
&lt;p>Let&amp;rsquo;s talk about a related measure, the update norm.&lt;/p>
&lt;h2 id="-metric-group-2-update-norms--effective-lr-ratio">📉 Metric group #2: Update norms + effective LR ratio&lt;/h2>
&lt;p>To properly introduce this family of metrics, I drew a diagram:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/wandb_metrics/learning_diagram.png" alt="" width="1000">
&lt;figcaption>The journey from loss to update. &lt;BR>Vector sizes would definitely not be to scale for a typical training run 😆 &lt;/figcaption>
&lt;/figure>
&lt;p>First, the magic of backprop turns our single scalar loss into a large set of numbers: a gradient associated with each parameter of the network.&lt;/p>
&lt;p>We can group the gradient by module into smaller vectors (the colored arrows), which we can characterize for debugging (more on this later).&lt;/p>
&lt;p>Finally, we concatenate (not add) them all into a single, much longer vector, $G$ (the gradient vector).&lt;/p>
&lt;p>Next, we clip $G$ if necessary, scaling it down to $G_{clipped}$. Note that the direction of $G$ is identical to $G_{clipped}$.&lt;/p>
&lt;p>Finally, a bunch of things happen:&lt;/p>
&lt;ul>
&lt;li>Optimizer modifies $G_{clipped}$ (via momentum, adaptive scaling, etc)&lt;/li>
&lt;li>Learning rate multiplier is applied&lt;/li>
&lt;/ul>
&lt;p>These can change both scale and rotation, and gives us the value we &lt;em>actually use&lt;/em> to update our parameters. We&amp;rsquo;ll call it the update, $U$.&lt;/p>
&lt;p>And to update the parameters in our network, we apply the standard:&lt;/p>
&lt;p>$$ P_{new} = P_{old} - U$$&lt;/p>
&lt;p>So from this, we can define a few new metrics:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Update norm&lt;/strong>: $\left\lVert{U}\right\rVert_2$
&lt;ul>
&lt;li>&amp;ldquo;Size&amp;rdquo; of the actual update in parameter space&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Param norm&lt;/strong>: $\left\lVert{P_{new}}\right\rVert_2$
&lt;ul>
&lt;li>&amp;ldquo;Size&amp;rdquo; of the new model in parameter space&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Relative update norm&lt;/strong>: ratio of $\left\lVert{U}\right\rVert_2$ / $\left\lVert{P_{new}}\right\rVert_2$
&lt;ul>
&lt;li>How much of the entire network we are changing per-step&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Effective LR ratio&lt;/strong>: The ratio between the actual step size (update norm) and the gradient norm after clipping: $\left\lVert{U}\right\rVert_2$ / $\left\lVert{G_{clipped}}\right\rVert_2$&lt;/li>
&lt;/ul>
&lt;p>Easy, and simple. Here&amp;rsquo;s how we calculate them:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">46
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">47
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">48
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">49
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">50
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">51
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">52
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">53
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">54
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">55
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">56
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">57
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">58
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">59
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">60
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">61
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">62
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">63
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">64
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">65
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">66
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">67
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">68
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">69
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">70
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">71
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">72
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">73
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">74
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">75
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">76
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">77
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">78
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">79
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">80
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">81
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">82
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">83
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">84
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">85
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">86
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">87
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">88
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">89
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">90
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">91
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">92
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">93
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">94
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">95
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">snapshot_params_to_cpu&lt;/span>(model):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Snapshot all trainable parameters to CPU memory.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Use this before optimizer.step() to later compute update norms.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Copying to CPU avoids GPU VRAM spikes from doubling parameter memory.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Args:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> model: PyTorch model (can be wrapped in DDP)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Returns:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> dict: {param_name: param_tensor_on_cpu} for all requires_grad
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> parameters
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> name: p&lt;span style="color:#f92672">.&lt;/span>detach()&lt;span style="color:#f92672">.&lt;/span>clone()&lt;span style="color:#f92672">.&lt;/span>cpu()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> name, p &lt;span style="color:#f92672">in&lt;/span> model&lt;span style="color:#f92672">.&lt;/span>named_parameters() &lt;span style="color:#66d9ef">if&lt;/span> p&lt;span style="color:#f92672">.&lt;/span>requires_grad
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">compute_update_norms&lt;/span>(model, old_params_cpu, grad_norm_after_clip&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">None&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Compute update norms after an optimizer step.
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Measures the actual parameter changes made by the optimizer, which reflects
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> the combined effect of gradients, learning rate, momentum, and adaptive
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> scaling (e.g., Adam&amp;#39;s second moment).
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Args:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> model: PyTorch model after optimizer.step()
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> old_params_cpu: dict from snapshot_params_to_cpu() taken before step
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> grad_norm_after_clip: optional gradient norm after clipping, used to
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> compute effective learning rate ratio
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74">
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Returns:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> dict with keys:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> - update_norm:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> L2 norm of all parameter changes
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> - param_norm:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> L2 norm of all current parameters
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> - relative_update_norm:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> update_norm / param_norm (stability metric)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> - effective_lr_ratio:
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> update_norm / grad_norm_after_clip (if provided)
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> update_deltas &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> param_flatcats &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># iterate through new parameters, compare to old&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> name, p &lt;span style="color:#f92672">in&lt;/span> model&lt;span style="color:#f92672">.&lt;/span>named_parameters():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> p&lt;span style="color:#f92672">.&lt;/span>requires_grad &lt;span style="color:#f92672">and&lt;/span> name &lt;span style="color:#f92672">in&lt;/span> old_params_cpu:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> p_cpu &lt;span style="color:#f92672">=&lt;/span> p&lt;span style="color:#f92672">.&lt;/span>detach()&lt;span style="color:#f92672">.&lt;/span>cpu()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> delta &lt;span style="color:#f92672">=&lt;/span> p_cpu &lt;span style="color:#f92672">-&lt;/span> old_params_cpu[name]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> update_deltas&lt;span style="color:#f92672">.&lt;/span>append(delta&lt;span style="color:#f92672">.&lt;/span>flatten())
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> param_flatcats&lt;span style="color:#f92672">.&lt;/span>append(p_cpu&lt;span style="color:#f92672">.&lt;/span>flatten())
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> update_deltas:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> update_norm &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>linalg&lt;span style="color:#f92672">.&lt;/span>vector_norm(torch&lt;span style="color:#f92672">.&lt;/span>cat(update_deltas))&lt;span style="color:#f92672">.&lt;/span>item()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> param_norm &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>linalg&lt;span style="color:#f92672">.&lt;/span>vector_norm(torch&lt;span style="color:#f92672">.&lt;/span>cat(param_flatcats))&lt;span style="color:#f92672">.&lt;/span>item()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> relative_update_norm &lt;span style="color:#f92672">=&lt;/span> update_norm &lt;span style="color:#f92672">/&lt;/span> (param_norm &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1e-12&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> result &lt;span style="color:#f92672">=&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;update_norm&amp;#34;&lt;/span>: update_norm,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;param_norm&amp;#34;&lt;/span>: param_norm,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;relative_update_norm&amp;#34;&lt;/span>: relative_update_norm,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Effective LR ratio: shows actual step size relative to gradient&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> grad_norm_after_clip &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#f92672">not&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span> &lt;span style="color:#f92672">and&lt;/span> grad_norm_after_clip &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">1e-12&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> result[&lt;span style="color:#e6db74">&amp;#34;effective_lr_ratio&amp;#34;&lt;/span>] &lt;span style="color:#f92672">=&lt;/span> update_norm &lt;span style="color:#f92672">/&lt;/span> grad_norm_after_clip
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> result
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>old_params_cpu &lt;span style="color:#f92672">=&lt;/span> snapshot_params_to_cpu(model)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>grad_clip &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1.5&lt;/span> &lt;span style="color:#75715e"># just an example&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>grad_norm_before &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>nn&lt;span style="color:#f92672">.&lt;/span>utils&lt;span style="color:#f92672">.&lt;/span>clip_grad_norm_(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model&lt;span style="color:#f92672">.&lt;/span>parameters(), grad_clip
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)&lt;span style="color:#f92672">.&lt;/span>item()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>grad_norm_after &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>nn&lt;span style="color:#f92672">.&lt;/span>utils&lt;span style="color:#f92672">.&lt;/span>clip_grad_norm_(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model&lt;span style="color:#f92672">.&lt;/span>parameters(), float(&lt;span style="color:#e6db74">&amp;#39;inf&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)&lt;span style="color:#f92672">.&lt;/span>item()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>grad_clip_ratio &lt;span style="color:#f92672">=&lt;/span> grad_norm_before &lt;span style="color:#f92672">/&lt;/span> grad_clip &lt;span style="color:#66d9ef">if&lt;/span> grad_clip &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span> &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#ae81ff">0.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># ... etc&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>optimizer&lt;span style="color:#f92672">.&lt;/span>step()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>optimizer&lt;span style="color:#f92672">.&lt;/span>zero_grad()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># .. etc&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>update_norms_result &lt;span style="color:#f92672">=&lt;/span> compute_update_norms(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model, old_params_cpu, grad_norm_after_clip&lt;span style="color:#f92672">=&lt;/span>grad_norm_after
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;h3 id="interpreting-param-norm-update-norms--effective-lr-ratio">Interpreting param norm, update norms &amp;amp; effective LR ratio&lt;/h3>
&lt;p>Reading these metrics together gives a complete picture of training dynamics beyond loss and gradients: &lt;em>where&lt;/em> the model is in parameter space, &lt;em>how fast&lt;/em> it&amp;rsquo;s moving, and how much the optimizer is amplifying or dampening the raw gradient signal.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Metric&lt;/th>
&lt;th>Range / trajectory&lt;/th>
&lt;th>Guidance&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Param norm $\left\lVert{P_{new}}\right\rVert_2$&lt;/td>
&lt;td>Steady, sub-linear growth&lt;/td>
&lt;td>Healthy. Growth rate should slow as LR decays.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>Exponential / super-linear growth&lt;/td>
&lt;td>Weights growing fast. Could mean you&amp;rsquo;re diverging. &lt;BR>&lt;BR>Generally here you&amp;rsquo;ll decrease LR or increase regularization, unless something egregious is going wrong in your network. In that case, fix it.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>Shrinking&lt;/td>
&lt;td>Underfitting? Check you aren&amp;rsquo;t regularizing too much (weight decay, etc)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>Flat while loss is decreasing&lt;/td>
&lt;td>Likely good. Probably later in training.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>Sudden jumps or drops&lt;/td>
&lt;td>Check grad norm per-module. Mostly redundant to that signal in this case.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Relative update norm &lt;BR>$\left\lVert{U}\right\rVert_2$ / $\left\lVert{P_{new}}\right\rVert_2$&lt;/td>
&lt;td>≈ 1e-3 to 1e-4&lt;/td>
&lt;td>Healthy range for most architectures&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;gt;&amp;gt;&lt;/code> 1e-2&lt;/td>
&lt;td>Updates might be too large relative to params. Risk of instability.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;lt;&amp;lt;&lt;/code> 1e-6&lt;/td>
&lt;td>Updates are vanishingly small :/ learning likely has stalled&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>Rising late in training while loss is flat&lt;/td>
&lt;td>Optimizer may be overshooting a flat basin&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Effective LR ratio $\left\lVert{U}\right\rVert_2$ / $\left\lVert{G_{clipped}}\right\rVert_2$&lt;/td>
&lt;td>≈ nominal LR&lt;/td>
&lt;td>Your optimizer&amp;rsquo;s effective gradient multipliers are ~1.0, which happens early in training. Or for some reason you&amp;rsquo;re using vanilla SGD (why??)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;gt;&amp;gt;&lt;/code> nominal LR&lt;/td>
&lt;td>Your optimizer is amplifying gradients&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;/td>
&lt;td>&lt;code>&amp;lt;&amp;lt;&lt;/code> nominal LR&lt;/td>
&lt;td>Your optimizer is dampening gradients. &lt;BR>&lt;BR>It could be protecting you from oscillations in weight space, but I would refer back to grad norm, LR, and other ways to diagnose instability in this case.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="-metric-group-3-non-loss-test-metrics">📈 Metric group #3: Non-loss test metrics&lt;/h2>
&lt;p>This might seem obvious, but I recommend plotting these as well. These might be:&lt;/p>
&lt;ul>
&lt;li>Accuracy&lt;/li>
&lt;li>Precision / recall / F1 score&lt;/li>
&lt;li>FID score&lt;/li>
&lt;li>&amp;hellip; etc&lt;/li>
&lt;/ul>
&lt;p>The list goes on.&lt;/p>
&lt;p>There are a number of reasons you might want these. After all, the whole point of training this model to you as a human isn&amp;rsquo;t the loss score, it&amp;rsquo;s the actual outcomes it allows for!&lt;/p>
&lt;p>The other practical reason is that if you change loss formulation midway through training or between runs, you need something objective to judge the performance of the models by in lieu of a loss curve.&lt;/p>
&lt;p>Changing your loss formulation can change both the scale and the shape of your loss curve over the course of training.&lt;/p>
&lt;p>So yeah, duh. Do it.&lt;/p>
&lt;h2 id="-metric-group-4-loss-by-category">🗂️ Metric group #4: Loss by category&lt;/h2>
&lt;p>Another fairly obvious one, but if you can break out your average loss per batch, per epoch, or per test evaluation &lt;em>by the type of sample&lt;/em>, you might be able to find data quality or model parameterization issues.&lt;/p>
&lt;p>For example, for a language model you may have different types of queries or chat requests that the model struggles on.&lt;/p>
&lt;p>For us, in the music domain, we have found that different genres, stems, or even different bucketed ranges of BPMs gave our models trouble.&lt;/p>
&lt;p>So if something is going haywire in a particular category, it can inspire you to do one of the healthiest things you can do in a model training project: &lt;strong>actually look at the data&lt;/strong>!&lt;/p>
&lt;p>The solutions for loss discrepancies between categories can range from:&lt;/p>
&lt;ul>
&lt;li>Correcting data quality issues in those categories&lt;/li>
&lt;li>Adjusting per-sample or per-category weights in some way to bias the model to perform better on those samples (note: do this at the data sampling time, not at loss calculation time if you can)&lt;/li>
&lt;li>Learning that those examples are harder than you thought, and accepting it!&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>⚠️ Note: if you want to compare loss by category you NEED to make sure you are scaling your loss so that when this measurement is taken, each sample&amp;rsquo;s loss has the same weight, regardless of length of sample (for sequence models) or quantity of labels. &lt;BR>&lt;BR> If you&amp;rsquo;re just getting less loss because some samples are shorter or have fewer labels, that&amp;rsquo;s not telling you anything useful about how hard the model is finding that particular category of sample vs another.&lt;/p>
&lt;/blockquote>
&lt;p>One useful pattern for this is using torch&amp;rsquo;s &lt;code>reduction='none'&lt;/code> option when it is available:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># step 1: calculate per-sample loss for your batch&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>my_cool_bce_loss &lt;span style="color:#f92672">=&lt;/span> F&lt;span style="color:#f92672">.&lt;/span>binary_cross_entropy_with_logits(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> predictions, targets,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> reduction&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;none&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># step 2: measure raw loss, cut by category, etc&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># ...&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># step 3: reduction via .mean() to get the actual loss to backprop over &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>loss &lt;span style="color:#f92672">=&lt;/span> my_cool_bce_loss&lt;span style="color:#f92672">.&lt;/span>mean()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># step 4: backprop!&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>loss&lt;span style="color:#f92672">.&lt;/span>backward()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>Again, remember to normalize loss for length &amp;amp; label count.&lt;/p>
&lt;p>You may also not be able to usefully report per-batch per-category losses if the number of categories is high and you don&amp;rsquo;t encounter them all every batch. This requires accumulating and reporting these losses every &lt;code>M&lt;/code> training steps, every epoch, or every test loop. It&amp;rsquo;s up to you.&lt;/p>
&lt;p>Alright, we&amp;rsquo;ve covered them all!&lt;/p>
&lt;p>Let&amp;rsquo;s look at a rough sketch of our training loop with respect to all of these metrics.&lt;/p>
&lt;h2 id="-putting-it-all-together-the-learning-loop-sketch">🔄 Putting it all together: the learning loop sketch&lt;/h2>
&lt;p>An example of how all this might come together and be structured in a classic train/test loop:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">46
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">47
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">48
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">49
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">50
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">51
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">52
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">53
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">54
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">55
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">56
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">57
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">58
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">59
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">60
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">61
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">62
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">63
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">64
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">65
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">66
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">67
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">68
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">69
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">70
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">71
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">72
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">73
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">74
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>step &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">for&lt;/span> epoch &lt;span style="color:#f92672">in&lt;/span> range(num_epochs):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ── TRAIN ──────────────────────────────────────────────────&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model&lt;span style="color:#f92672">.&lt;/span>train()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> epoch_loss_sum, epoch_steps &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.0&lt;/span>, &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> accum_loss &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> batch_idx, batch &lt;span style="color:#f92672">in&lt;/span> enumerate(train_loader):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pred &lt;span style="color:#f92672">=&lt;/span> model(batch)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> per_sample_losses &lt;span style="color:#f92672">=&lt;/span> compute_per_sample_losses(pred, batch)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> loss &lt;span style="color:#f92672">=&lt;/span> reduce_losses(per_sample_losses, loss_weights)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> (loss &lt;span style="color:#f92672">/&lt;/span> grad_accum_steps)&lt;span style="color:#f92672">.&lt;/span>backward()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> accum_loss &lt;span style="color:#f92672">+=&lt;/span> loss&lt;span style="color:#f92672">.&lt;/span>item()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ── Optimizer step (every grad_accum_steps batches) ──&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> (batch_idx &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#f92672">%&lt;/span> grad_accum_steps &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Gradient norms (before clip)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_norms_per_module &lt;span style="color:#f92672">=&lt;/span> compute_grad_norm_per_module(model)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_norm_before &lt;span style="color:#f92672">=&lt;/span> clip_grad_norm_(params, max_norm)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_norm_after &lt;span style="color:#f92672">=&lt;/span> clip_grad_norm_(params, float(&lt;span style="color:#e6db74">&amp;#39;inf&amp;#39;&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> grad_clip_ratio &lt;span style="color:#f92672">=&lt;/span> grad_norm_before &lt;span style="color:#f92672">/&lt;/span> max_norm
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> old_params &lt;span style="color:#f92672">=&lt;/span> snapshot_params_to_cpu(model)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> optimizer&lt;span style="color:#f92672">.&lt;/span>step()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> scheduler&lt;span style="color:#f92672">.&lt;/span>step()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> optimizer&lt;span style="color:#f92672">.&lt;/span>zero_grad()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Update norms (after step)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> update_norms &lt;span style="color:#f92672">=&lt;/span> compute_update_norms(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model, old_params, grad_norm_after_clip&lt;span style="color:#f92672">=&lt;/span>grad_norm_after
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> )
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> step_loss &lt;span style="color:#f92672">=&lt;/span> accum_loss &lt;span style="color:#f92672">/&lt;/span> grad_accum_steps
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> epoch_loss_sum &lt;span style="color:#f92672">+=&lt;/span> step_loss
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> epoch_steps &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> accum_loss &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0.0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> rank &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>log({
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;lr&amp;#34;&lt;/span>: scheduler&lt;span style="color:#f92672">.&lt;/span>get_last_lr()[&lt;span style="color:#ae81ff">0&lt;/span>],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;train/loss&amp;#34;&lt;/span>: step_loss,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;grad_clip_ratio&amp;#34;&lt;/span>: grad_clip_ratio,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">**&lt;/span>grad_norms_per_module,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">**&lt;/span>update_norms,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }, step&lt;span style="color:#f92672">=&lt;/span>step)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> step &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> rank &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>log({
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;train/loss_epoch&amp;#34;&lt;/span>: epoch_loss_sum &lt;span style="color:#f92672">/&lt;/span> epoch_steps,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }, step&lt;span style="color:#f92672">=&lt;/span>step)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># ── TEST ───────────────────────────────────────────────────&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> model&lt;span style="color:#f92672">.&lt;/span>eval()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>no_grad():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> batch &lt;span style="color:#f92672">in&lt;/span> test_loader:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> pred &lt;span style="color:#f92672">=&lt;/span> model(batch)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> test_losses &lt;span style="color:#f92672">=&lt;/span> compute_per_sample_losses(pred, batch)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> task_metrics &lt;span style="color:#f92672">=&lt;/span> compute_task_metrics(pred, batch)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># all_reduce test metrics across processes here (see above)&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> rank &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> wandb&lt;span style="color:#f92672">.&lt;/span>log({
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;test/loss&amp;#34;&lt;/span>: avg_test_loss,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">**&lt;/span>per_category_test_losses,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">**&lt;/span>avg_task_metrics,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }, step&lt;span style="color:#f92672">=&lt;/span>step)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> save_checkpoint(&lt;span style="color:#f92672">...&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>Or something like that. Every train/test loop will be different.&lt;/p>
&lt;h2 id="-summary">🏁 Summary&lt;/h2>
&lt;p>Taking the time to instrument metrics for your run takes time, but it will save you much more time when things go wrong!&lt;/p></description></item><item><title>Making your own realtime audio AI training environment</title><link>https://willdrevo.com/2026/02/16/making-your-own-realtime-audio-ai-training-environment/</link><pubDate>Mon, 16 Feb 2026 01:20:03 -0500</pubDate><guid>https://willdrevo.com/2026/02/16/making-your-own-realtime-audio-ai-training-environment/</guid><description>&lt;figure>
&lt;img src="https://willdrevo.com/static/img/vjlab/cogs.jpg" alt="live concert setting">
&lt;figcaption>VJLab audio AI models driving visuals in a live, realtime setting&lt;/figcaption>
&lt;/figure>
&lt;p>This post is pretty specific, but I haven&amp;rsquo;t seen anyone else really write about it. So I hope this is helpful to the 20 people in the world that need it (ha!).&lt;/p>
&lt;p>At &lt;a href="https://vjlab.ai/">VJLab&lt;/a>, we train realtime (causal) audio models that understand and listen to music like a human does for use by visual artists in live concert settings.&lt;/p>
&lt;p>This means our models have to be very fast, robust to noise, and accurate.&lt;/p>
&lt;p>I&amp;rsquo;ll share some best practices we&amp;rsquo;ve found for setting up our environment and avoiding disaster.&lt;/p>
&lt;aside id="toc">
&lt;h4>Table of Contents&lt;/h4>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#-our-development--training-environments">💻 Our development &amp;amp; training environments&lt;/a>&lt;/li>
&lt;li>&lt;a href="#what-is-causality-and-why-do-we-care">❓What is causality, and why do we care?&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-the-danger-batch-vs-realtime">⚠️ The danger: Batch vs realtime&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-timing-constraints">⏱️ Timing constraints&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-examples-of-snags">😭 Examples of snags&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-things-you-must-test">🧪 Things you MUST test&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-platform-differences">𝍔 Platform differences&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-a-few-other-tips">📓 A few other tips&lt;/a>&lt;/li>
&lt;li>&lt;a href="#-in-closing">🔊 In closing&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/aside>
&lt;h2 id="-our-development--training-environments">💻 Our development &amp;amp; training environments&lt;/h2>
&lt;p>My development process looks like:&lt;/p>
&lt;ol>
&lt;li>Develop locally on MacBook&lt;/li>
&lt;li>Do tiny training tests/runs on my Ubuntu machine w/ 4090 card (if needed)&lt;/li>
&lt;li>Training run on cheap cloud GPU machines&lt;/li>
&lt;li>Larger training in cloud (moar GPUs)&lt;/li>
&lt;/ol>
&lt;p>This is, if nothing else, a way to keep things super economical and cheap! We are 100% bootstrapped and don&amp;rsquo;t have VC money to burn on GPUs.&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/audio_ai_envs/economical_scaling.png" alt="scaling environments">
&lt;figcaption>How Nano Banana pictures our different environments&lt;/figcaption>
&lt;/figure>
&lt;p>Luckily our models are not huge (some run in realtime on CPU), but doing data transformations, scraping, and ablations can really add up if you aren&amp;rsquo;t careful.&lt;/p>
&lt;p>Most of our models run in the 5-20 ms range and are generally under 10M parameters, though we have a couple of beefier exceptions.&lt;/p>
&lt;p>I also can&amp;rsquo;t recommend highly enough using Cursor&amp;rsquo;s Remote SSH feature. For the cloud-based environments, being able to fire up a coding LLM to puzzle through NCCL errors or whatever derailed your latest training run is absolutely priceless.&lt;/p>
&lt;h2 id="what-is-causality-and-why-do-we-care">❓What is causality, and why do we care?&lt;/h2>
&lt;p>Causal models exist in time, at a time &lt;code>t&lt;/code>. They quite simply, use only the past data they&amp;rsquo;ve seen (like any $x_{t_i}$ where $t_i$ &amp;lt;= $t$), and no data from the future (no $x_{t_i}$ where $t_i$ &amp;gt; $t$).&lt;/p>
&lt;p>So if you need a model to operate in realtime, you aren&amp;rsquo;t allowed to &amp;ldquo;cheat&amp;rdquo; by looking at information from the future.&lt;/p>
&lt;p>The difference is stark: VJLab&amp;rsquo;s &lt;a href="https://vjlab.ai/p/audioslice-realtime-stem-splitter-for-touchdesigner/">realtime stem splitter&lt;/a> operating at ~90Hz is operating in a much different regime than an offline splitter like &lt;a href="https://huggingface.co/spaces/abidlabs/music-separation">Demucs&lt;/a>, which has access to the entire track and can take minutes to respond.&lt;/p>
&lt;p>&lt;em>The truth is that most pretrained models are either for use in offline/batch situations, or simply aren&amp;rsquo;t performant enough for realtime audio, especially on CPU.&lt;/em>&lt;/p>
&lt;p>Thus we almost exclusively adapt or train new architectures from scratch.&lt;/p>
&lt;p>But training your own causal models from scratch or adapting batch models comes with risks.&lt;/p>
&lt;h2 id="-the-danger-batch-vs-realtime">⚠️ The danger: Batch vs realtime&lt;/h2>
&lt;p>One of the banes of your existence if you train lots of these models will be causality. If your model has to operate like a human does (and cannot see the upcoming audio offline), you have to run inference and respond in time. Without seeing the future.&lt;/p>
&lt;p>This becomes tricky when you want to train such a model, because you will have to train the model in batch (unless you have infinite patience and also infinite money).&lt;/p>
&lt;p>This creates a dangerous situation where your training necessarily differs from your serving.&lt;/p>
&lt;p>I have trained models that looked incredible performance-wise at train/test time in a batch setting, but fell apart when I fixed a causality bug or we finally got them deployed to a live inference setting or script. It&amp;rsquo;s an upsetting experience.&lt;/p>
&lt;p>Remember: if performance looks too good to be true, it probably is.&lt;/p>
&lt;h2 id="-timing-constraints">⏱️ Timing constraints&lt;/h2>
&lt;p>Not only is causality tricky, but simple timing performance can be too.&lt;/p>
&lt;p>If your model operates on new buffers of 512 samples, sampled at 44.1kHz, guess what, you can NEVER take longer than &lt;code>512 samples / 44100 Hz ~= 11ms&lt;/code> to respond! In fact a good rule of thumb is to keep your full buffer processing time to half your budget (ie: &lt;code>5.5 ms&lt;/code>).&lt;/p>
&lt;p>Note that this time budget includes your forward pass and whatever pre/post-processing in C++ you need to do.&lt;/p>
&lt;p>Even if your model has a lookahead period (ie: the model purposefully outputs values lagged slightly into the past), you still have a latency budget because new frames will just keep coming.&lt;/p>
&lt;p>This is more vital on-device where you are pulling from an audio driver buffer, but in the cloud you don&amp;rsquo;t want to fall behind either.&lt;/p>
&lt;h2 id="-examples-of-snags">😭 Examples of snags&lt;/h2>
&lt;p>Can it really be that bad?&lt;/p>
&lt;p>What kinds of things might befall me, you might ask?&lt;/p>
&lt;p>A few fun examples that definitely have never, ever happened to me:&lt;/p>
&lt;ol>
&lt;li>A bug in your training script reveals future labels to previous frame because your convolutions&amp;rsquo; receptive field was large enough to include them from a future frame&lt;/li>
&lt;li>Your mean pooling operation aggregates over the time dimension (and is therefore not causal)&lt;/li>
&lt;li>You realize your model trains in batch on precomputed mels, but your streaming model has to compute them (and puts you over your latency budget)&lt;/li>
&lt;li>Because ONNX doesn&amp;rsquo;t support the FFT you hand-rolled your own convolutional FFT, but realized it&amp;rsquo;s too slow in realtime at the frame size you&amp;rsquo;ve chosen&lt;/li>
&lt;li>The SOTA pretrained model you blindly fine-tuned and is supposedly causal and realtime according to the paper authors &amp;hellip; totally isn&amp;rsquo;t. You have fix the architecture and completely retrain&lt;/li>
&lt;/ol>
&lt;p>In short, a lot can go haywire if you aren&amp;rsquo;t careful.&lt;/p>
&lt;p>&lt;strong>The really awful part is: if you don&amp;rsquo;t realize a mistake like this until after you&amp;rsquo;ve finished your 3 day long training run, then you&amp;rsquo;ve just literally burned money.&lt;/strong>&lt;/p>
&lt;p>To save yourself immense amount of time, money, and sanity, I highly recommend you have a consistent test for your dev &amp;amp; training environments.&lt;/p>
&lt;h2 id="-things-you-must-test">🧪 Things you MUST test&lt;/h2>
&lt;p>You need an integration test of your model&amp;rsquo;s entire lifecycle:&lt;/p>
&lt;!-- &lt;figure>
&lt;img src="https://willdrevo.com/static/img/audio_ai_envs/integration_testing.png" alt="integration testing workflow">
&lt;figcaption>Yet another hilarious but very visually pleasing diagram of our testing flow from Nano Banana&lt;/figcaption>
&lt;/figure> -->
&lt;p>Yes, really. Even if you&amp;rsquo;re just a researcher.&lt;/p>
&lt;p>Even if your idea of MLOps is SSHing into your beautifully managed Slurm cluster with Weka FS access and running a script with &lt;code>accelerate&lt;/code>.&lt;/p>
&lt;p>Our integration test runs on any environment (macbook, linux single GPU, linux multi-GPU), in this order:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Model latency test&lt;/strong>
&lt;ul>
&lt;li>Runs the batch model against a batch_size=1 input&lt;/li>
&lt;li>Ensures non-accelerated Python version is close enough to latency budget&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Generating training dataset/metadata, if applicable&lt;/strong>
&lt;ul>
&lt;li>Only tiny subset of data&lt;/li>
&lt;li>Generate sample outputs of data augmentation and labels, especially if your outputs are subjective and require a human sanity check&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Training + checkpointing&lt;/strong>
&lt;ul>
&lt;li>In batch, of course&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Loading from checkpoint + resuming&lt;/strong>
&lt;ul>
&lt;li>Can also add loading older checkpoints if backwards compatibility is desired&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Exporting model to accelerated format (ie: TorchScript, ONNX, or TensorRT)&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Running batch vs online equivalence test&lt;/strong>
&lt;ul>
&lt;li>&lt;em>The MOST important step&lt;/em>&lt;/li>
&lt;li>Match the outputs of your batch running alongside your realtime (streaming) accelerated model&lt;/li>
&lt;li>If you use mels in training and audio in realtime, yes, you must test the realtime with audio and do the mel transforms. &lt;strong>Don&amp;rsquo;t be lazy!&lt;/strong>&lt;/li>
&lt;li>Ensure that output is same to a tolerance, ie: 1e-2 or whatever is necessary for your output domain&lt;/li>
&lt;li>Keep in mind the acceleration process will often introduce floating point or numerical differences, and that&amp;rsquo;s okay&lt;/li>
&lt;li>Outputing visual or auditory examples that can be manually inspected is really helpful&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>All of this logs to Weights &amp;amp; Biases and reports back a link to check the results from.&lt;/p>
&lt;p>And trust me, if you can run it each time your node starts up, or as a pre-commit hook, you will thank me later. Or your manager will.&lt;/p>
&lt;p>And in the age of coding agents, there really is no excuse not to ship this testing code, even if it&amp;rsquo;s quite a few LOC.&lt;/p>
&lt;p>You could literally feed this post in as input and probably get a decent starting point!&lt;/p>
&lt;h2 id="-platform-differences">𝍔 Platform differences&lt;/h2>
&lt;p>One obvious call out is you won&amp;rsquo;t be able (or need) to run every step the same on every environment.&lt;/p>
&lt;p>Some example differences:&lt;/p>
&lt;ul>
&lt;li>Local development machines may run smaller toy versions of the model due to RAM, VRAM, or even MPS-specific constraints&lt;/li>
&lt;li>Exporting accelerators depends on the platform: exporting your torch model to TensorRT won&amp;rsquo;t happen on Mac OS X, for example&lt;/li>
&lt;li>Environments with different compute scales: running single GPU (&lt;code>python train.py&lt;/code>) vs multi-GPU (&lt;code>torchrun&lt;/code>) vs multi-instance-multi-GPU (&lt;code>torchrun&lt;/code>, &lt;code>ray&lt;/code>, etc)&lt;/li>
&lt;/ul>
&lt;p>None of this is particularly surprising or revolutionary.&lt;/p>
&lt;h2 id="-a-few-other-tips">📓 A few other tips&lt;/h2>
&lt;ol>
&lt;li>&lt;strong>Unify your training and realtime model&lt;/strong>
&lt;ul>
&lt;li>Do this by keeping input tensors in batch format at all points in the graph&lt;/li>
&lt;li>This allows you to make your realtime (exportable) torch module a simple wrapper of the batch training model where you set batch_size=1 and also handle state input/output&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Think about stateless inference&lt;/strong>
&lt;ul>
&lt;li>Remember most all accelerated model formats and serving techniques are stateless&lt;/li>
&lt;li>You&amp;rsquo;ll need to hand back in state like previous mel frames, KV caches, LSTM hidden states, etc manually - your model can&amp;rsquo;t use logic internally to update state.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Just ask&lt;/strong>
&lt;ul>
&lt;li>Asking a top-tier coding agent to try to poke holes in your testing strategy or model architecture to find causality issues ahead of time is well worth your money and effort, even if the true positive rate is 10-20%.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Avoid &lt;code>BatchNorm&lt;/code>!&lt;/strong>
&lt;ul>
&lt;li>&lt;code>LayerNorm&lt;/code>, &lt;code>GroupNorm&lt;/code>, or &lt;code>InstanceNorm&lt;/code> are your (causal) friends!&lt;/li>
&lt;li>&lt;code>BatchNorm&lt;/code> technically &amp;ldquo;cheats&amp;rdquo; in training if your frames are temporal by looking at future frames to compute mean/variance stats to normalize inputs in previous frames
&lt;ul>
&lt;li>However. This violation of causality isn&amp;rsquo;t actually terrible for deploy-time inference, per se. This is because in a model frozen for inference (same as &lt;code>.eval()&lt;/code> mode), the mean/variance stored in the &lt;code>BatchNorm&lt;/code> op are frozen.&lt;/li>
&lt;li>So your model will work just fine in production! But it will derail you when you run your streaming test to verify batch and streaming are the same, because they won&amp;rsquo;t be!&lt;/li>
&lt;li>And if you write off the difference as &amp;ldquo;oh that&amp;rsquo;s just &lt;code>BatchNorm&lt;/code>, let&amp;rsquo;s ignore the batch vs streaming discrepancy&amp;rdquo;, you might miss a real causality issue. This is the true danger of &lt;code>BatchNorm&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Start with a single script&lt;/strong> 😱
&lt;ul>
&lt;li>Sometimes in the research phase I will keep the entire new model in a single &lt;code>train.py&lt;/code> as long as I can. &lt;em>Horrendous, I know.&lt;/em>&lt;/li>
&lt;li>Coding LLMs seem to do quite well with this, as a bonus&lt;/li>
&lt;li>Anything re-usable I factor out as soon as I can + add unit test, so other models in future can benefit&lt;/li>
&lt;li>Once the model is working end to end from train to accelerated realtime, then I move models into proper Python modules for reusability, class composition, and so on&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>If your model needs an FFT during inference, think carefully about how you train&lt;/strong>
&lt;ul>
&lt;li>For example, ONNX doesn&amp;rsquo;t support &lt;code>torch&lt;/code>&amp;rsquo;s FFT or iFFT operation&lt;/li>
&lt;li>The platform you deploy to (OS X, Ubuntu, Windows) will determine the fastest way to compute the FFT, but beware, not all FFT routines have the same scaling. For this reason, we usually choose &lt;code>libtorch&lt;/code>, which supports &lt;code>torch&lt;/code>&amp;rsquo;s FFT routines&lt;/li>
&lt;li>You should always try to transform your model to an accelerated format/method that keeps your train/deploy equivalence intact or your will run into problems&lt;/li>
&lt;li>Some FFT libraries, while fast, are not licensed well for commercial use&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h2 id="-in-closing">🔊 In closing&lt;/h2>
&lt;p>Truly realtime audio is tricky! And often your first question should be: does this even need to be realtime?&lt;/p>
&lt;p>Many features you might imagine could just be done quickly (but in batch), and save you the headache.&lt;/p>
&lt;p>But when you do truly need it, make sure to keep your eyes open for issues like these, and put a strong integration testing framework in place to prevent you from wasting time and money.&lt;/p>
&lt;p>Happy training &amp;amp; testing :)&lt;/p></description></item><item><title>Introducing: a Musical Mel Transform</title><link>https://willdrevo.com/2025/09/09/introducing-a-musical-mel-transform-in-pytorch/</link><pubDate>Tue, 09 Sep 2025 22:54:15 -0700</pubDate><guid>https://willdrevo.com/2025/09/09/introducing-a-musical-mel-transform-in-pytorch/</guid><description>&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/lowest_filters.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>I&amp;rsquo;m open sourcing a useful tool in our realtime audio AI toolbox here at &lt;a href="https://vjlab.ai/">VJLab&lt;/a>, a &lt;a href="https://github.com/worldveil/musical_mel_transform_torch">Musical mel transform&lt;/a>.&lt;/p>
&lt;p>It&amp;rsquo;s written in PyTorch and can be ONNX-compatibile with a convolutional FFT (with &lt;code>use_conv_fft=True&lt;/code>).&lt;/p>
&lt;p>If you&amp;rsquo;ve ever wanted audio features that directly represent semitones (or quarter tones!) this is the package for you.&lt;/p>
&lt;a href="https://github.com/worldveil/musical_mel_transform_torch">&lt;img src="https://gh-card.dev/repos/worldveil/musical_mel_transform_torch.svg">&lt;/a>
&lt;h3 id="why-have-a-mel-transform-centered-on-musical-notes">Why have a mel transform centered on musical notes?&lt;/h3>
&lt;p>In general, the mel transform has the following benefits:&lt;/p>
&lt;ul>
&lt;li>Better featurization for perceptually relevant frequencies for human ears&lt;/li>
&lt;li>Dimensionality reduction&lt;/li>
&lt;li>Some noise robustness (since mel transforms average or smooth over multiple FFT bins)&lt;/li>
&lt;/ul>
&lt;p>And what I&amp;rsquo;m calling a &amp;ldquo;musical&amp;rdquo; mel transform, where the mel bins are aligned to pitch centers, has additional advantages if:&lt;/p>
&lt;ul>
&lt;li>Your task is transcription or musical note related&lt;/li>
&lt;li>Your case is realtime/speed critical and care about low-end discrimination (vs say, a CQT that would do well on low frequencies but is very slow)&lt;/li>
&lt;li>You&amp;rsquo;re comparing against a completely learned filterbank, or that approach isn&amp;rsquo;t working&lt;/li>
&lt;/ul>
&lt;p>Personally I have found this &lt;code>MusicalMelTransform&lt;/code> beats raw FFTs and standard mels for realtime usecases. The package also has an option &lt;code>learnable_weights=&amp;quot;fft&amp;quot;&lt;/code> to add learnable parameters to reweight the incoming FFT bins for loudness, which is important.&lt;/p>
&lt;p>The default arguments convert the FFT magnitudes to power (&lt;code>power: int = 2&lt;/code>) and then to a dB scale (&lt;code>to_db: bool = True&lt;/code>) as well, which is common in audio AI frontend feature extraction.&lt;/p>
&lt;p>TL;DR - if you&amp;rsquo;re working with music in your AI usecase, then having features that map directly to musical notes can sometimes help with performance!&lt;/p>
&lt;h3 id="how-does-it-work">How does it work?&lt;/h3>
&lt;p>Mel scale is just a mapping of FFT bins -&amp;gt; new bins. So each mel bin is just a weighted sum of the linearly-spaced FFT bins. That&amp;rsquo;s it!&lt;/p>
&lt;p>This code:&lt;/p>
&lt;ol>
&lt;li>Adds some adaptive (with &lt;code>adaptive=True&lt;/code>) widening to interpolate great weighted combinations of FFT bins to make pitches discernable at pitch centers&lt;/li>
&lt;li>Gives a configurable way to control the number of high frequency features (with &lt;code>passthrough&lt;/code> arguments)&lt;/li>
&lt;li>Provides an optional ONNX compatible FFT operator&lt;/li>
&lt;/ol>
&lt;p>You can also shorten or widen your tone granulariy &amp;ndash; so semi- or quarter- tones is just a parameter change:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># `interval` is the &amp;#34;number of semitones&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>chromatic_transform &lt;span style="color:#f92672">=&lt;/span> MusicalMelTransform(interval&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1.0&lt;/span>) &lt;span style="color:#75715e"># semitone scale&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>quarter_tone_transform &lt;span style="color:#f92672">=&lt;/span> MusicalMelTransform(interval&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.5&lt;/span>) &lt;span style="color:#75715e"># quarter tone scale&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;h3 id="how-does-it-compare-to-other-options">How does it compare to other options?&lt;/h3>
&lt;p>Here&amp;rsquo;s a quick comparison between:&lt;/p>
&lt;ol>
&lt;li>Traditional linearly-spaced FFT&lt;/li>
&lt;li>&lt;code>torchaudio&lt;/code> mel scale transform&lt;/li>
&lt;li>MusicalMelTransform (this repo)&lt;/li>
&lt;/ol>
&lt;p>I have constrained the two mel transforms (2 &amp;amp; 3) to have the same dimensionality, and with &lt;code>f_max&lt;/code> at 16khz to make the comparison fair:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/specs/fft.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/specs/torchaudio_mel.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/specs/musical_mel.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>As you can see, especially in the low frequencies, the resolution of MusicalMelTransform is better. This is great for music, and especially for low-frequency heavy music like today&amp;rsquo;s pop and electronic music. The graph here shows a kick pattern, typical in house or techno music.&lt;/p>
&lt;p>If we pick a number of low-end sub notes and plot the corresponding &amp;ldquo;filters&amp;rdquo; from the &lt;code>MusicalMelTransform&lt;/code> you can see how this works more concretely:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/low_freq_filters.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>Low notes are impossibly close to each other, especially under 100hz, but that&amp;rsquo;s life (unless you can stomach &lt;a href="https://dsp.stackexchange.com/a/46657/">the speed of a CQT transform&lt;/a>). This package tries to cleverly interpolate FFT bins to mel pitch center bins in a way that lower frequencies are &amp;ldquo;discernable&amp;rdquo; from each other. But keep in mind we only have what the humble FFT offers us! We are just interpolating.&lt;/p>
&lt;p>Contrast this to a normal FFT. The FFT linearly spaces features, so for the top frequencies end we end up with many, many features that aren&amp;rsquo;t as musically relevant.&lt;/p>
&lt;p>To illustrate, let&amp;rsquo;s compare the resulting features for different transforms across different musically-relevant frequency ranges so we can see how different transforms vary:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/freq_bin_distribution.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>As you can see:&lt;/p>
&lt;ul>
&lt;li>The vanilla FFT has a huge number of features, most of which are on veryyy high frequencies &amp;gt;6khz, which is non-ideal&lt;/li>
&lt;li>Under 150hz, where low or sub-&amp;ldquo;bass&amp;rdquo; lives, &lt;code>MusicalMelTransform&lt;/code> smoothly interpolates, giving a model better features to work with&lt;/li>
&lt;li>Under 500hz, the &lt;code>MusicalMelTransform&lt;/code> still has the best coverage &amp;ndash; where most all the bass, root notes, and fundamental frequencies reside&lt;/li>
&lt;li>For a transform with the exact number of features, torchaudio transform has ~1.5x as many features from 1khz and up&lt;/li>
&lt;li>But if we&amp;rsquo;re willing to spend a few more features, an optimized &lt;code>MusicalMelTransform&lt;/code> with passthrough @ 5khz to let the FFT bins come through &amp;ldquo;covers&amp;rdquo; the torchaudio mel transform pretty much everywhere! So we can (except for the 1-3khz band) have our cake and eat it too.&lt;/li>
&lt;/ul>
&lt;h3 id="-warning-of-non-magic-">⚠️ Warning of non-magic ⚠️&lt;/h3>
&lt;p>It&amp;rsquo;s important to remember all mel features are derivative of the FFT. If you&amp;rsquo;re working with a small FFT of, like 128 or whatever, this package won&amp;rsquo;t work miracles!&lt;/p>
&lt;p>Your resolution on low end will still be crap.&lt;/p>
&lt;p>I wouldn&amp;rsquo;t use this package below FFT size of 512, tbh. But by cleverly assigning and interpolating those FFT bins you do have, this package is a way to &amp;ldquo;stretch&amp;rdquo; the resolution you do have to make discrimination on the low end easier.&lt;/p>
&lt;p>The main benefit is just namely that all the features you have are, by definition, musically relevant.&lt;/p>
&lt;h3 id="characteristics-of-mel-transforms-and-some-helpful-tweaks-to-make">Characteristics of mel transforms, and some helpful tweaks to make&lt;/h3>
&lt;p>Here are some plots of mel bins (the x-axis dots + colored lines) as composed of FFT bin centers (the vertical grey lines) as we move up in frequency. We&amp;rsquo;ll talk through some implications.&lt;/p>
&lt;p>If we zoom in to the first (very lowest) filters on &lt;code>MusicalMelTransform&lt;/code> @ 2048 FFT size, 44.1khz you can see how related the lowest filters are. Because the FFT bins themselves are ~20hz apart, the mel bins below are just sliiightly different linear combinations of 2-3 low bins:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/lowest_filters.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>The situation, of course, gets much better as we move up in frequency to even 400-800Hz range:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/400_800hz_filters.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>And just as with any mel scale, once we get up to the really high frequencies (8th octave), the mels:&lt;/p>
&lt;ol>
&lt;li>Span multiple bins&lt;/li>
&lt;li>Ignore bins halfway between mel (pitch) centers&lt;/li>
&lt;/ol>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/high_filters.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>for reference, the top note on an 88-key piano is C8 &amp;ndash; these frequencies are all above that! (unless you have a &lt;a href="https://en.wikipedia.org/wiki/B%C3%B6sendorfer">Bösendorfer&lt;/a>)&lt;/p>
&lt;p>These mostly-ignored bins between filters are usually fine, since at these high of frequencies we generally are hearing harmonics which are represented in a neighborhood around each other at harmonic intervals. So throwing out much of the contribution of a few bins is less important.&lt;/p>
&lt;p>But as the frequencies continue the gaps get larger. And if some of that information is important (or you&amp;rsquo;d rather just pick an arbitrary point to have higher resolution than mels!), you can use &lt;code>MusicalMelTransform&lt;/code>&amp;rsquo;s &lt;code>passthrough_cutoff_hz&lt;/code> argument.&lt;/p>
&lt;p>Here I show what happens using &lt;code>passthrough_cutoff_hz=5000&lt;/code> and &lt;code>passthrough_grouping_size=3&lt;/code>. This effectively means, &amp;ldquo;after 5khz, don&amp;rsquo;t compute mel bins, just pass through the original FFT bins, grouping every 3 bins together&amp;rdquo;. This is the result:&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/musical_mel/passthrough_5khz_3_bins.png" alt="" width=800>
&lt;figcaption>&lt;/figcaption>
&lt;/figure>
&lt;p>Here you can see that after 5khz, we start simply grouping every next three FFT bins into a mel bins. While it depends on your cutoff, generally the higher you have this &lt;code>passthrough_cutoff_hz&lt;/code>, the larger your &lt;code>passthrough_grouping_size&lt;/code> should be.&lt;/p>
&lt;p>And of course these passthrough bins are no longer directly centered on musical notes.&lt;/p>
&lt;h3 id="scaling--normalization">Scaling &amp;amp; normalization&lt;/h3>
&lt;p>You will also notice that the magnitudes of each FFT bin going into the mel bins get much smaller than 1.0 as we climb frequencies. This is because pitches are spread across many more bins at high frequencies, and the plots have the &lt;code>norm=True&lt;/code> parameter set, which normalized each filter to a total weight of 1.&lt;/p>
&lt;p>Due to all this rescaling, I suggest using &lt;code>learnable_weights=&amp;quot;fft&amp;quot;&lt;/code> as this inserts a vector of learnable parameters that helps you scale the original FFT magnitudes (or power, depending on your setting for &lt;code>power&lt;/code>) for your usecase. You probably want to have &lt;code>norm=False&lt;/code> in this case.&lt;/p>
&lt;p>Otherwise the &lt;code>MusicalMelTransform&lt;/code> has no learnable weights.&lt;/p>
&lt;h3 id="dont-ignore-the-bitter-lesson">Don&amp;rsquo;t ignore the bitter lesson&lt;/h3>
&lt;p>At some point we should be careful &amp;ndash; the temptation to ignore &lt;a href="https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf">The Bitter Lesson&lt;/a> by constantly tweaking the &lt;code>f_max&lt;/code>, &lt;code>passthrough_cutoff_hz&lt;/code>, &lt;code>passthrough_grouping_size&lt;/code>, &lt;code>norm&lt;/code>, etc with your transform to make your network perform better is a real temptation.&lt;/p>
&lt;p>At some point we just need the information to flow through to a reasonable network that will learn from it.&lt;/p>
&lt;p>While I do think the Bitter Lesson applies less in a realtime or resource constrained scenario, do think your your architecture and data through before spending your days tweaking your mel transform settings.&lt;/p>
&lt;p>The gainz you seek are in the former, not the latter.&lt;/p>
&lt;h3 id="summary">Summary&lt;/h3>
&lt;p>Again, to reiterate: a mel transform is not magic! It is a series of linear combinations on the original FFT bins.&lt;/p>
&lt;p>But if you&amp;rsquo;re clever about it, it really does help!&lt;/p>
&lt;p>Check out the repo here, make a PR, and write about any issues if you see them!&lt;/p>
&lt;a href="https://github.com/worldveil/musical_mel_transform_torch">&lt;img src="https://gh-card.dev/repos/worldveil/musical_mel_transform_torch.svg">&lt;/a>
&lt;h3 id="about-vjlabai">About VJLab.AI&lt;/h3>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/vjlab/audioslice_td.png" alt="" width=800>
&lt;figcaption>Our realtime stem splitter feeding into GLSL shaders in TouchDesigner&lt;/figcaption>
&lt;/figure>
&lt;p>If you&amp;rsquo;re curious to learn more about what kinds of things we&amp;rsquo;re doing at &lt;a href="https://vjlab.ai">VJLab.AI&lt;/a> with all this stuff, check out:&lt;/p>
&lt;ul>
&lt;li>A &lt;a href="https://youtu.be/colb1meAr-M?feature=shared&amp;amp;t=474">video&lt;/a> showcasing our tool, &lt;a href="https://vjlab.ai/p/audioslice-realtime-stem-splitter-for-touchdesigner/">AudioSlice&lt;/a> that &lt;a href="https://www.instagram.com/stories/highlights/17962725260622555/">I have personally used to peform visuals&lt;/a> for acts like John Summit, Dom Dolla, Gorgon City, Benny Benassi, GriZ and many more&lt;/li>
&lt;li>Our &lt;a href="https://vjlab.ai/p/beatsage/">beat tracker&lt;/a>, BeatSage, for live concert VJs&lt;/li>
&lt;/ul>
&lt;p>To stay up to date with what we&amp;rsquo;re doing:&lt;/p>
&lt;ul>
&lt;li>You can follow my &lt;a href="https://www.youtube.com/@its-drevo">YouTube&lt;/a> account for tutorials&lt;/li>
&lt;li>or &lt;a href="https://www.instagram.com/its.drevo/">Instagram&lt;/a> for tutorials and teasers&lt;/li>
&lt;li>Or our new &lt;a href="https://docs.google.com/forms/d/e/1FAIpQLSe1ZFUTfdiJ-W563tILt-F9KBo75PgvgPTlAUBxDEFzfUaHGA/viewform?usp=dialog">email list&lt;/a> for updates on new tools, models, repos, and updates to our existing apps&lt;/li>
&lt;/ul>
&lt;p>Our next generation of realtime audio models for visual artists and live performers are coming soon :)&lt;/p></description></item><item><title>Music AI state of the union: an ISMIR '24 summary</title><link>https://willdrevo.com/2024/12/05/music-ai-state-of-the-union-an-ismir-24-summary/</link><pubDate>Thu, 05 Dec 2024 01:20:03 -0500</pubDate><guid>https://willdrevo.com/2024/12/05/music-ai-state-of-the-union-an-ismir-24-summary/</guid><description>&lt;figure>
&lt;img src="https://willdrevo.com/static/img/ismir/crowd.jpeg" alt="ISMIR '24">
&lt;figcaption>ISMIR '24 held in San Francisco&lt;/figcaption>
&lt;/figure>
&lt;p>&lt;a href="https://ismir2024.ismir.net/">ISMIR&lt;/a> &amp;lsquo;24 (the conference for the International Society for Music Information Retrieval) this year was fantastic. I had an absolute blast getting to meet up with the brightest minds in the music AI space.&lt;/p>
&lt;p>The pace of innovation in music AI is absolutely breathtaking.&lt;/p>
&lt;p>For this post I chose a few themes I noticed at the conference. In each section I&amp;rsquo;ll describe my favorite paper and mention a few other papers to check out. You can see a full list of ISMIR &amp;lsquo;24 papers &lt;a href="https://ismir2024program.ismir.net/papers.html">here&lt;/a>.&lt;/p>
&lt;aside id="toc">
&lt;h4>Table of Contents&lt;/h4>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>
&lt;ul>
&lt;li>&lt;a href="#theme-1-latent-spaces---discrete-and-continuous">Theme #1: Latent spaces - discrete and continuous&lt;/a>&lt;/li>
&lt;li>&lt;a href="#theme-2-diffusion-for-audio-generation">Theme #2: Diffusion for audio generation&lt;/a>&lt;/li>
&lt;li>&lt;a href="#theme-3-self-supervised-learning-ssl-techniques-gaining-steam">Theme #3: Self-supervised learning (SSL) techniques gaining steam&lt;/a>&lt;/li>
&lt;li>&lt;a href="#theme-4-music-stem-separation-mss-separation-by-query">Theme #4: Music stem separation (MSS): separation by query&lt;/a>&lt;/li>
&lt;li>&lt;a href="#theme-5-better-transcription-data-better-transcription-models">Theme #5: Better transcription data, better transcription models&lt;/a>&lt;/li>
&lt;li>&lt;a href="#theme-6-attribution">Theme #6: Attribution&lt;/a>&lt;/li>
&lt;li>&lt;a href="#summary">Summary&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/aside>
&lt;p>Finally, if you’re in the music AI space and want to be friends or grab a coffee, hit me up on &lt;a href="https://x.com/itsdrevo">twitter&lt;/a> or shoot me a &lt;a href="https://www.linkedin.com/in/willdrevo">message&lt;/a>!&lt;/p>
&lt;p>I&amp;rsquo;m currently working on a new stealth project building realtime models that make audioreactive light shows &lt;a href="https://www.instagram.com/stories/highlights/17962725260622555/">like Coachella&lt;/a> possible &amp;ndash; a perfect fit for ISMIR.&lt;/p>
&lt;h3 id="theme-1-latent-spaces---discrete-and-continuous">Theme #1: Latent spaces - discrete and continuous&lt;/h3>
&lt;p>A recent trend in audio is training better latent space representations. They help with both compression and generation tasks. The two are somewhat related — audio is an extremely information dense modality, and bottlenecking information is playing out much like we saw in the image world once diffusion started happening in latent space rather than pixel space.&lt;/p>
&lt;p>Neural codecs using RVQ (ie: &lt;a href="https://github.com/facebookresearch/encodec">Encodec&lt;/a>, &lt;a href="https://github.com/descriptinc/descript-audio-codec">DAC&lt;/a>) or continuous autoencoders are the two preferred types of information bottlenecks today.&lt;/p>
&lt;p>Codecs are better at high quality reconstruction and phase coherence, but reconstruction falls apart if you shift them in time. The codebook vectors also can be used as discrete tokens, or the last layer before the quantization can also be used as a continuous latent.&lt;/p>
&lt;p>Continuous latents are wonderful for downstream tasks and are quite good at capturing lower frequency or harmonic components, though often at the expense of phase when decoded.&lt;/p>
&lt;p>My favorite ISMIR &amp;lsquo;24 paper on this theme was:&lt;/p>
&lt;p>📚 &lt;em>&lt;strong>Music2Latent: Consistency Autoencoders for Latent Audio Compression&lt;/strong>&lt;/em> [&lt;a href="https://arxiv.org/abs/2408.06500">paper&lt;/a>] [&lt;a href="https://github.com/SonyCSLParis/music2latent">github&lt;/a>], the PhD work of &lt;a href="https://x.com/marco_ppasini">Marco Pasini&lt;/a> in partnership with Sony Paris.&lt;/p>
&lt;p>&lt;em>Music2Latent&lt;/em> broke the mold of difficult-to-train audio autoencoders (no GAN!) and trains with a single loss term. Most interestingly, as Marco revealed in the poster session, if one takes two latent embeddings from &lt;em>Music2Latent&lt;/em> and interpolates between them and then decodes, you get audio that sounds like the two original waveforms mixed together in equal proportion. Extremely cool, and a huge step forward for a better latent space forgenerative models.&lt;/p>
&lt;p>The only dissapointing part is that the model will not be released, and that the code is under CC BY-NC 4.0 :/ but the code is on &lt;a href="https://github.com/SonyCSLParis/music2latent">Github&lt;/a>!&lt;/p>
&lt;p>Other noteworthy papers:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://arxiv.org/abs/2406.10970">Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation&lt;/a>
&lt;ul>
&lt;li>An excellent example of a hybrid appraoch using Encodec: &lt;code>&amp;quot;...we use the continuous tensor z as the latent representation, while leveraging the discrete representation q for audio conditioning.&amp;quot;&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://arxiv.org/abs/2407.12563">Audio Conditioning for Music Generation via Discrete Bottleneck Features&lt;/a>
&lt;ul>
&lt;li>This is another FAIR paper, and so they also use Encodec, but as tokens in an autoregressive model&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="theme-2-diffusion-for-audio-generation">Theme #2: Diffusion for audio generation&lt;/h3>
&lt;p>Increasingly diffusion is being used for audio generation. It has a few nice properties:&lt;/p>
&lt;ul>
&lt;li>Inference can happen in parallel (not autoregressive)&lt;/li>
&lt;li>We can borrow a lot of techniques from image diffusion generation&lt;/li>
&lt;li>We don&amp;rsquo;t have to think about tokenization&lt;/li>
&lt;/ul>
&lt;p>The star paper here for me was another from both Sony and &lt;a href="https://x.com/c4dm">Queen Mary University of London&lt;/a>:&lt;/p>
&lt;p>📚 &lt;em>&lt;strong>Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models&lt;/strong>&lt;/em> [&lt;a href="https://arxiv.org/abs/2406.08384">paper&lt;/a>].&lt;/p>
&lt;p>And coincidentally, it was trained using &lt;em>Music2Latent&lt;/em>! So this is a nice segue from the last theme.&lt;/p>
&lt;p>First off, the &lt;em>Diff-A-Riff&lt;/em> generation quality is incredible. &lt;a href="https://sonycslparis.github.io/diffariff-companion/">Take a listen for yourself&lt;/a>.&lt;/p>
&lt;p>&lt;em>Diff-A-Riff&lt;/em> generates audio, conditioned by other stems, to create a target stem (the &amp;ldquo;accompaniment&amp;rdquo;). So you give it a guitar and a bass line and tell it to create a drum stem of the same length, and it will. You can even guide the accompaniment creation by conditioning with either an audio snippet or a text prompt.&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/ismir/diff-a-riff.png" alt="Diff-A-Riff">
&lt;figcaption>Diff-A-Riff allows conditioning with either text or audio&lt;/figcaption>
&lt;/figure>
&lt;p>As you might expect CLAP is used to achieve a shared text-audio space, but they have several other clever ways of handling the conditioning. Though the code and models will not be open sourced, it’s a really fascinating paper and some stellar work by the Sony Paris team.&lt;/p>
&lt;p>Back on theme, while it’s tempting to say that diffusion looks like the winning approach for audio generation I don’t think we can quite be sure.&lt;/p>
&lt;p>We know &lt;a href="https://suno.com/">Suno&lt;/a> uses an autoregressive architecture (at least in v2-3) (&lt;a href="https://open.spotify.com/episode/2c1yL8hlttlkCs6nPysVi0?si=9e378b7d0fdf47fb">see this podcast&lt;/a> with their CTO, &lt;a href="https://x.com/mikeyshulman">Mikey Shulman&lt;/a>), and their generation quality is the best in the world for full-length tracks. And I don&amp;rsquo;t actually know what &lt;a href="https://www.udio.com/">Udio&lt;/a> uses, but if you do let me know!&lt;/p>
&lt;p>Other noteworthy papers:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://arxiv.org/abs/2404.10301">Long-form music generation with latent diffusion&lt;/a> (Stable Audio paper)&lt;/li>
&lt;li>&lt;a href="https://arxiv.org/abs/2408.00196">Combining audio control and style transfer using latent diffusion&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://arxiv.org/abs/2406.10970">Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation&lt;/a> (this one uses flow matching, but it&amp;rsquo;s just a great paper)&lt;/li>
&lt;/ul>
&lt;h3 id="theme-3-self-supervised-learning-ssl-techniques-gaining-steam">Theme #3: Self-supervised learning (SSL) techniques gaining steam&lt;/h3>
&lt;p>Across a number of tasks like &lt;a href="https://arxiv.org/abs/2411.04152">beat tracking&lt;/a>, &lt;a href="https://arxiv.org/pdf/2408.02514">stem-affinity&lt;/a>, and &lt;a href="https://arxiv.org/abs/2407.07408">tonalty estimation&lt;/a>, self-supervised learning (SSL) techniques started to shine this year at ISMIR.&lt;/p>
&lt;p>These techniques are especially important in the music space where labeled data is far more limited than in the text, image, or video domains.&lt;/p>
&lt;p>Favorite SSL paper:&lt;/p>
&lt;p>📚 &lt;em>&lt;strong>Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation&lt;/strong>&lt;/em> [&lt;a href="https://arxiv.org/abs/2408.02514">paper&lt;/a>] [&lt;a href="https://github.com/SonyCSLParis/Stem-JEPA">github&lt;/a>] by &lt;a href="https://x.com/howariou">Alain Riou&lt;/a> et al, from Institut Polytechnique de Paris and Sony.&lt;/p>
&lt;p>Basically the jist here, is that given a few stems aligned in time, you can train a model to output the &lt;em>latent representation&lt;/em> of yet another stem that best fits the existing stem mixture (and any given conditioning signals you supply).&lt;/p>
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/ismir/stem-jepa.png" alt="Stem JEPA">
&lt;figcaption>The Stem-JEPA architecture&lt;/figcaption>
&lt;/figure>
&lt;p>So, why is this nice?&lt;/p>
&lt;p>Well, if you want to generate, say a bassline for your jazzy vocal, what are your options?&lt;/p>
&lt;ul>
&lt;li>Manually swap in and out stems, listening for compatibility
&lt;ul>
&lt;li>Time consuming&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Generate the missing stem
&lt;ul>
&lt;li>Expensive FLOPS-wise&lt;/li>
&lt;li>Requires you to have such a generative model in the first place with extremely high quality&lt;/li>
&lt;li>You&amp;rsquo;d still need to score the generated stem against your existing stems to make sure it&amp;rsquo;s compatible, or have a model that uses the existing mix stems as conditioning&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Train some sort of model to score your existing stems for compatibility
&lt;ul>
&lt;li>Expensive computationally &amp;ndash; we’d have to score each existing stem in your database against your currently active stem mixture to calculate a ranking by score&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>The JEPA approach here let&amp;rsquo;s us instead generate the &amp;ldquo;idea&amp;rdquo; of which kind of stem would fit best.&lt;/p>
&lt;p>With this, we can then query a database of stems (with precomputed JEPA embeddings) to find which are the most compatible, using a simple nearest-neighbor approach. This does require one to precompute all the embeddings for the stems in your dataset, but that’s easily done ahead of time. At inference time, the JEPA system can be much faster. For that reason, &lt;em>Stem-JEPA&lt;/em> is a wonderfully clever piece of work.&lt;/p>
&lt;p>A downside: when training, &lt;em>Stem-JEPA&lt;/em> does require split stems (which are less plentiful in the world than mixed audio). Luckily, it appears this model is quite data efficient!&lt;/p>
&lt;p>With ~100x less data, downstream tasks using this learned embedding space are on par with representations generated with &lt;a href="https://github.com/PandoraMedia/music-audio-representations">MULE&lt;/a> (trained on 117k hours) and &lt;a href="https://github.com/openai/jukebox">Jukebox&lt;/a> (1.7M songs). &lt;em>Stem-JEPA&lt;/em> was trained on Sony&amp;rsquo;s 20k multitracks (only ~1,350 hours by comparison).&lt;/p>
&lt;p>A few other papers I enjoyed in the self-supervised realm at ISMIR &amp;lsquo;24:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://arxiv.org/abs/2407.07408">STONE: Self-supervised Tonality Estimator&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://hal.science/hal-04733487/document">SKY: Self-supervised Learning of Major and Minor Keys from Audio&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://arxiv.org/abs/2411.04152">A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="theme-4-music-stem-separation-mss-separation-by-query">Theme #4: Music stem separation (MSS): separation by query&lt;/h3>
&lt;p>We’re all familiar with the traditional Vocals, Drums, Bass &amp;amp; Other (VDBO) separation — you input an audio mix, and a model like Demucs or RoFormer outputs an estimate of each of this fixed set stems.&lt;/p>
&lt;p>Today &lt;em>fine-tuned, offline, single-stem&lt;/em> MSS extraction models can be in the range of ~8-12 dB SDR when compared with the ground truth stems, which is very impressive.&lt;/p>
&lt;p>However offline SDR gains on those fronts are diminishing returns and increasingly, the field is moving towards:&lt;/p>
&lt;ul>
&lt;li>Extracting a larger set of stems (ie: piano, acoustic guitar, electric guitar)&lt;/li>
&lt;li>Extracting a stem by a text or audio &amp;ldquo;query&amp;rdquo;&lt;/li>
&lt;li>Making the separation process more efficient&lt;/li>
&lt;/ul>
&lt;p>&lt;em>Realtime MSS&lt;/em> is a different story (largely ignored at ISMIR &amp;lsquo;24), and you can contact me if you want to chat about this :)&lt;/p>
&lt;p>But for the offline MSS theme, my favorite paper was led by the indomitable &lt;a href="https://x.com/appoggiaturaaa">Karn Watcharasupat&lt;/a>, who has a number of papers on this topic.&lt;/p>
&lt;p>📚 &lt;em>&lt;strong>A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems&lt;/strong>&lt;/em> [&lt;a href="https://arxiv.org/abs/2406.18747">paper&lt;/a>] [&lt;a href="https://github.com/kwatcharasupat/query-bandit">github&lt;/a>].&lt;/p>
&lt;p>Their &lt;em>Banquet&lt;/em> model architecture lets you train effectively infinite &amp;ldquo;decoders&amp;rdquo; for different stems. These decoders are simply a FiLM query embedding that you pair with a known, fixed stem type.&lt;/p>
&lt;p>The benefit of course, is that unlike other architectures, your number of parameters dedicated to decoding a single stem drastically decreases from millions down to a few hundred, the cost of a single vector!&lt;/p>
&lt;p>So in the last step of the model at inference time, you use the FiLM query vector to extract the stem you&amp;rsquo;re after &amp;ndash; a sort of latent space &amp;ldquo;mask&amp;rdquo;.&lt;/p>
&lt;p>Even better, there&amp;rsquo;s nothing inherently keeping the query set (and therefore the stems that are extractable) fixed. In their paper, the FiLM query vector set was, in fact, frozen per stem due to training instability, but it feels like a similar architecture could support arbitrary embeddings being used to extract arbitrarystems. This is the next frontier of MSS in my opinion.&lt;/p>
&lt;p>Being able to query a mix for a bass guitar, for example, using an audio snippet of a similar stem (or an isolated snippet of said bass guitar from another part of the track) feels like the correct UI for MSS to extract the exact stem you&amp;rsquo;re after.&lt;/p>
&lt;p>As a final note, MSS is close to my heart as the area I work most heavily in. At VJ Labs we work (among other realtime techniques) on realtime MSS — something which we proudly surpass the SOTA in :) But alas! No papers about realtime MSS this year at ISMIR!&lt;/p>
&lt;p>Other stem separation (MSS) related papers from this year&amp;rsquo;s ISMIR:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://arxiv.org/abs/2409.04702">Mel-RoFormer for Vocal Separation and Vocal Melody Transcription&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.audiolabs-erlangen.de/resources/MIR/2024-ISMIR-PianoSepEval2">Notewise Evaluation of Source Separation: A Case Study For Separated Piano Tracks&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://drive.google.com/file/d/1hmHE0nv8wZsj51ajCsSdN_UehrQKrMDt/view">Classical Guitar Duet Separation using GuitarDuets - a Dataset of Real and Synthesized Guitar Recordings&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="theme-5-better-transcription-data-better-transcription-models">Theme #5: Better transcription data, better transcription models&lt;/h3>
&lt;p>While of course models are getting better, largely the transcription fight seems like it’s being won on the data front, both natural and synthetic.&lt;/p>
&lt;p>My favorite paper in this space was undoubtedly for guitar transcription:&lt;/p>
&lt;p>📚 &lt;em>&lt;strong>GAPS: A Large and Diverse Classical Guitar Dataset and Benchmark Transcription Model&lt;/strong>&lt;/em> [&lt;a href="https://arxiv.org/abs/2408.08653">paper&lt;/a>], from first authors &lt;a href="https://x.com/xavriley">Xavier Riley&lt;/a> &amp;amp; &lt;a href="https://x.com/nicolasguozixun">Nicolas Guo&lt;/a> from &lt;a href="https://x.com/c4dm">C4DM&lt;/a>.&lt;/p>
&lt;p>I highly encourage you to &lt;a href="https://youtu.be/xifkG2tTEwU?feature=shared&amp;amp;t=56">watch the video showing the played vs transcribed MIDI side by side&lt;/a>. The results are stunning.&lt;/p>
&lt;figure>
&lt;a href="https://youtu.be/xifkG2tTEwU?feature=shared&amp;t=56">
&lt;img src="https://willdrevo.com/static/img/ismir/gaps.png" alt="GAPS dataset presentation">
&lt;/a>
&lt;figcaption>The GAPS guitar transcription dataset&lt;/figcaption>
&lt;/figure>
&lt;p>Piano transcription datasets (&lt;a href="https://magenta.tensorflow.org/datasets/maestro">MAESTRO&lt;/a>, &lt;a href="https://inria.hal.science/inria-00544155/en">MAPS&lt;/a>, etc) are much larger, today. So Xavier &amp;amp; team created their own dataset and used it to fine-tune a &lt;a href="https://arxiv.org/pdf/2010.01815">piano transcription model from Bytedance&lt;/a>.&lt;/p>
&lt;p>Transcription data pipelines are no joke (extensive alignment and quality checking), so even though the dataset is on the smaller side, it&amp;rsquo;s quite impressive that ~14 hours of guitar was so effective.&lt;/p>
&lt;p>Notably, the model is a &amp;ldquo;simple&amp;rdquo; (ie: non-Transformer) CRNN (log mel frontend + convolutional features + bidirectional RNN) operating at roughly 10ms granularity.&lt;/p>
&lt;p>Other transcription papers:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://researchdiscovery.drexel.edu/view/pdfCoverPage?instCode=01DRXU_INST&amp;amp;filePid=13549920770004721&amp;amp;download=true">Leveraging Unlabeled Data to Improve Automatic Guitar Tablature Transcription&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://drive.google.com/file/d/1CHhf2YqFLE4yhviOEnoiWPkquE-MqLBy/view">Semi-Supervised Piano Transcription Using Pseudo-Labeling Techniques&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://repositori.upf.edu/bitstream/handle/10230/61103/kim_ismir_meth.pdf?sequence=1&amp;amp;isAllowed=y">A Method for MIDI Velocity Estimation for Piano Performance by a U-Net with Attention and FiLM&lt;/a>&lt;/li>
&lt;/ul>
&lt;!-- * Streaming Piano Transcription Based on Consistent Onset and Offset Decoding with Sustain Pedal Detection
* Scoring Time Intervals Using Non-Hierarchical Transformer for Automatic Piano Transcription
* Robust and Accurate Audio Synchronization Using Raw Features From Transcription Models -->
&lt;h3 id="theme-6-attribution">Theme #6: Attribution&lt;/h3>
&lt;p>Having a trail of provenance for which music samples, ideas, models, or styles inspired or created a given piece of music was also clearly a theme at this year&amp;rsquo;s ISMIR, though more so in conversations and panels than papers.&lt;/p>
&lt;p>As you might imagine, there’s a huge storm coming in terms of the rights holders of the world (record labels, copyright holders, artists) wanting their piece of the generative AI pie.&lt;/p>
&lt;p>The big questions fall at the input and output:&lt;/p>
&lt;ul>
&lt;li>➡️ On the &lt;em>input&lt;/em> side: is training models on copyrighted audio “fair use”?&lt;/li>
&lt;li>⬅️ On the &lt;em>output&lt;/em> side: by what metric is a new piece of audio deemed to “copy” another, and to what extent?&lt;/li>
&lt;/ul>
&lt;p>To be fair, papers aren’t the place to tackle these issues. Likely the US Supreme Court will have that honor. But the various technical approaches being explored reflect these thorny issues.&lt;/p>
&lt;p>The only paper really worth mentioning was:&lt;/p>
&lt;p>📚 &lt;em>&lt;strong>Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model&lt;/strong>&lt;/em> [&lt;a href="https://arxiv.org/pdf/2401.14542">paper&lt;/a>] [&lt;a href="https://exploring-musical-roots.notion.site/Exploring-musical-roots-an-audio-walkthrough-83da76f6311b46198b992d372b37e70f">examples&lt;/a>]&lt;/p>
&lt;p>The paper basically amounts to &amp;ldquo;fingerprinting&amp;rdquo; a dataset of audio using CLAP and CLMR embeddings, and then querying a dataset of music with these fingerprints nearest-neighbor style, and seeing how similar the retrieved audio was to the originally embedded query audio.&lt;/p>
&lt;p>Largely if you listen to the examples (or just think about what&amp;rsquo;s being done) it&amp;rsquo;s clear that this is not a good approach.&lt;/p>
&lt;p>This falls on the &amp;ldquo;output&amp;rdquo; side of the attribution question. And querying based on feel or vibe (basically what CLAP and CLMR are good at) is just going to return matches that are of a similar style or genre, not musical infringement.&lt;/p>
&lt;p>To me, the more sensible approaches center around a couple of &amp;ldquo;output&amp;rdquo;-based infringments:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Sampling (copying audio)&lt;/strong>
&lt;ul>
&lt;li>&lt;em>&amp;ldquo;Did the artist literally copy and paste another piece of audio?&amp;rdquo;&lt;/em>&lt;/li>
&lt;li>Spectral (traditional) audio fingerprinting is a much better approach here, with far less computation, less learnable parameters, and far lower false positives&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Structural similarity (copying structure)&lt;/strong>
&lt;ul>
&lt;li>&lt;em>&amp;ldquo;Did the artist directly rip off the chords or melody or lyrics?&amp;rdquo;&lt;/em>&lt;/li>
&lt;li>This likely is solved technically with transcription models and some kind of MIDI similarity metric&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>For either approach, the thorny issue still is &amp;ldquo;to what extent&amp;rdquo; is a piece of music considered to be a copy? And if such a determination is made, what are the monetary and access consequences for the creators, the original rights holders, and the public?&lt;/p>
&lt;p>This line of thinking merits an entire post (or book) of its own, so I&amp;rsquo;ll stop here.&lt;/p>
&lt;p>Interested readers, artists, or music AI researchers should check out my favorite book on the subject: &lt;a href="https://lessig.org/product/free-culture/">Free Culture&lt;/a> by the famous &lt;a href="https://hls.harvard.edu/faculty/lawrence-lessig/">Lawrence Lessig&lt;/a>, founder of &lt;a href="https://creativecommons.org/">Creative Commons&lt;/a> (yes, that one!). It&amp;rsquo;s an indispensable read.&lt;/p>
&lt;!-- ### Bonus section: my favorite paper overall was...
📚 *ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time Optimization* [[paper](https://arxiv.org/abs/2410.21233)] [[github](https://github.com/csteinmetz1/st-ito)], led by [Chris Steinmetz](https://x.com/csteinmetz1) from [Suno](https://suno.com/).
This paper was also one of the three winners of the ISMIR best paper award.
The problem statement is simple: could we transfer the style of one audio segment to another, using the (non-differentiable) VSTs availiable in your DAW?
The approach is quite clever.
&lt;figure>
&lt;img src="https://willdrevo.com/static/img/ismir/st-ito.png" alt="ST-ITO">
&lt;figcaption>ST-ITO&lt;/figcaption>
&lt;/figure> -->
&lt;h3 id="summary">Summary&lt;/h3>
&lt;p>ISMIR &amp;lsquo;24 was a blast. I&amp;rsquo;m already looking forward to next year!&lt;/p>
&lt;p>See you all in Korea in &amp;lsquo;25 🇰🇷&lt;/p></description></item><item><title>Audio Fingerprinting</title><link>https://willdrevo.com/fingerprinting-and-audio-recognition-with-python/</link><pubDate>Fri, 15 Nov 2013 18:07:10 -0700</pubDate><guid>https://willdrevo.com/fingerprinting-and-audio-recognition-with-python/</guid><description>&lt;blockquote>
&lt;p>Note: this post was authored way, way back in my grad school days (in 2013!) but continues to be quite popular and cited in a number of papers. So I&amp;rsquo;ve copied and pasted it over to this newish blog site. Please keep in mind there are more advanced and scalable fingerprinting systems out there these days, but this is an excellent introduction and example codebase to start from. Enjoy!&lt;/p>
&lt;/blockquote>
&lt;p>The first day I tried out Shazam, I was blown away. Next to GPS and surviving the fall down a flight of stairs, being able to recognize a song from a vast corpus of audio was the most incredible thing I&amp;rsquo;d ever seen my phone do. This recognition works though a process called &lt;a href="http://en.wikipedia.org/wiki/Acoustic_fingerprint">audio fingerprinting&lt;/a>. Examples include:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf">Shazam&lt;/a>&lt;/li>
&lt;li>&lt;a href="http://www.midomi.com/">SoundHound / Midomi&lt;/a>&lt;/li>
&lt;li>&lt;a href="http://acoustid.org/chromaprint">Chromaprint&lt;/a>&lt;/li>
&lt;li>&lt;a href="http://echoprint.me/">Echoprint&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>After a few weekends of puzzling through academic papers and writing code, I came up with the Dejavu Project, an open-source audio fingerprinting project in Python. You can &lt;a href="https://github.com/worldveil/dejavu">see it here on Github&lt;/a>.&lt;/p>
&lt;!-- &lt;a href="https://github.com/worldveil/dejavu">&lt;img src="https://gh-card.dev/repos/worldveil/dejavu.svg">&lt;/a> -->
&lt;a href="https://github.com/worldveil/dejavu">&lt;img src="https://gh-card.dev/repos/worldveil/dejavu.svg">&lt;/a>
&lt;p>On my testing dataset, Dejavu exhibits 100% recall when reading an unknown wave file from disk or listening to a recording for at least 5 seconds.&lt;/p>
&lt;p>Following is all the knowledge you need to understand audio fingerprinting and recognition, starting from the basics. Those with signals experience should skip to &amp;ldquo;Peak Finding&amp;rdquo;.&lt;/p>
&lt;aside id="toc">
&lt;h4>Table of Contents&lt;/h4>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#music-as-a-signal">Music as a signal&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#sampling">Sampling&lt;/a>&lt;/li>
&lt;li>&lt;a href="#spectrograms">Spectrograms&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#peak-finding">Peak Finding&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#fingerprint-hashing">Fingerprint hashing&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#learning-a-song-database-structure">Learning a Song: Database structure&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#fingerprints-table">Fingerprints table&lt;/a>&lt;/li>
&lt;li>&lt;a href="#songs-table">Songs table&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#fingerprint-alignment">Fingerprint Alignment&lt;/a>&lt;/li>
&lt;li>&lt;a href="#how-well-it-works">How well it works&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#1-reading-from-disk">1. Reading from Disk&lt;/a>&lt;/li>
&lt;li>&lt;a href="#2-audio-over-laptop-microphone">2. Audio over laptop microphone&lt;/a>&lt;/li>
&lt;li>&lt;a href="#3-compressed-streamed-music-played-on-my-iphone">3. Compressed streamed music played on my iPhone&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#performance-speed">Performance: Speed&lt;/a>&lt;/li>
&lt;li>&lt;a href="#performance-storage">Performance: Storage&lt;/a>&lt;/li>
&lt;li>&lt;a href="#conclusion">Conclusion&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/aside>
&lt;h2 id="music-as-a-signal">Music as a signal&lt;/h2>
&lt;p>As a computer scientist, my familiarity with the &lt;a href="http://en.wikipedia.org/wiki/Fast_Fourier_transform">Fast Fourier Transform (FFT)&lt;/a> was only that it was a cool way to mutliply polynomials in &lt;code>O(nlog(n))&lt;/code> time. Luckily it is much cooler for doing signal processing, its canonical usage.&lt;/p>
&lt;p>Music, it turns out, is digitally encoded as just a long list of numbers. In an uncompressed .wav file, there are a lot of these numbers - 44100 per second per channel. This means a 3 minute long song has almost 16 million samples.&lt;/p>
&lt;blockquote>
&lt;p>3 min * 60 sec * 44100 samples per sec * 2 channels = 15,876,000 samples&lt;/p>
&lt;/blockquote>
&lt;p>A channel is a separate sequence of samples that a speaker can play. Think of having two earbuds - this is a &amp;ldquo;stereo&amp;rdquo;, or two channel, setup. A single channel is called &amp;ldquo;mono&amp;rdquo;. Today, modern surround sound systems can support many more channels. But unless the sound is recorded or mixed with the same number of channels, the extra speakers are redundant and some speakers will just play the same stream of samples as other speakers.&lt;/p>
&lt;h3 id="sampling">Sampling&lt;/h3>
&lt;p>Why 44100 samples per second? The mysterious choice of 44100 samples per second seems quite arbitrary, but it relates to the &lt;a href="http://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem">Nyquist-Shannon Sampling Theorum&lt;/a>. This is a long, mathematical way to say that there is a theoretical limit on the maximum frequency we can capture accurately when recording. This maximum frequency is based on how &lt;em>fast&lt;/em> we sample the signal.&lt;/p>
&lt;p>If this doesn&amp;rsquo;t make sense, think about watching a fan blade that makes a full revolution at a rate of exactly once per second (1 Hz). Now imagine keeping your eyes closed, but opening them briefly once per second. If the fan still happens to be making exactly a full revolution every 1 second as well, it will appear as though the fan blade hasn&amp;rsquo;t moved! Each time you open your eyes, the blade happens to be in the same spot. But there&amp;rsquo;s a problem. In fact, as far as you know, the fan blade could be making 0, 1, 2, 3, 10, 100, or even 1 million spins per second and you would never know - it would still appear stationary! Thus in order to be assured you are correctly sampling (or &amp;ldquo;seeing&amp;rdquo;) higher frequencies (or &amp;ldquo;spins&amp;rdquo;), you need to sample (or &amp;ldquo;open your eyes&amp;rdquo;) more frequently. To be exact, we need to sample twice as frequently as the frequency we want to see to make sure we&amp;rsquo;re detecting it.&lt;/p>
&lt;p>In the case of recording audio, the accepted rule is that we&amp;rsquo;re OK missing out on frequencies above 22050 Hz since humans can&amp;rsquo;t even hear frequencies above 20,000 Hz. Thus by Nyquist, we have to sample &lt;em>twice&lt;/em> that:&lt;/p>
&lt;blockquote>
&lt;p>Samples per sec needed = Highest-Frequency * 2 = 22050 * 2 = 44100&lt;/p>
&lt;/blockquote>
&lt;p>The MP3 format compresses this in order to 1) save space on your hard drive, and 2) irritate audiophiles, but a pure .wav formatted file on your computer is just a list of 16 bit integers (with a small header).&lt;/p>
&lt;h3 id="spectrograms">Spectrograms&lt;/h3>
&lt;p>Since these samples are a signal of sorts, we can repeatedly use an FFT over small windows of time in the song&amp;rsquo;s samples to create a &lt;a href="http://en.wikipedia.org/wiki/Spectrogram">spectrogram&lt;/a> of the song. Here&amp;rsquo;s a spectrogram of the first few seconds of &amp;ldquo;Blurred Lines&amp;rdquo; by Robin Thicke.&lt;/p>
&lt;p>&lt;img src="https://willdrevo.com/static/img/dejavu/spectrogram_no_peaks.png" alt="Blurred Lines">&lt;/p>
&lt;p>As you can see, it&amp;rsquo;s just a 2D array with amplitude as a function of time and frequency. The FFT shows us the strength (amplitude) of the signal at that particular frequency, giving us a column. If we do this enough times with our sliding window of FFT, we put them together and get a 2D array spectrogram.&lt;/p>
&lt;p>It&amp;rsquo;s important to note that the frequency and time values are discretized, each representing a &amp;ldquo;bin&amp;rdquo;, while the amplitudes are real valued. The color shows the real value (red -&amp;gt; higher, green -&amp;gt; lower) of the amplitude at the discretized (time, frequency) coordinate.&lt;/p>
&lt;p>As a thought experiment, if we were to record and create a spectrogram of a single tone, we&amp;rsquo;d get a straight horizontal line at the frequency of the tone. This is because the frequency does not vary from window to window.&lt;/p>
&lt;p>Great. So how does this help us recognize audio? Well, we&amp;rsquo;d like to use this spectrogram to identify this song uniquely. The problem is that if you have your phone in your car and you try to recognize the song on the radio, you&amp;rsquo;ll get noise - someone is talking in the background, another car honking its horn, etc. We have to find a robust way to capture unique &amp;ldquo;fingerprints&amp;rdquo; from the audio signal.&lt;/p>
&lt;h2 id="peak-finding">Peak Finding&lt;/h2>
&lt;p>Now that we&amp;rsquo;ve got a specrogram of our audio signal, we can start by finding &amp;ldquo;peaks&amp;rdquo; in amplitude. We define a peak as a (time, frequency) pair corresponding to an amplitude value which is the greatest in a local &amp;ldquo;neighborhood&amp;rdquo; around it. Other (time, frequency) pairs around it are lower in amplitude, and thus less likely to survive noise.&lt;/p>
&lt;p>Finding peaks is an entire problem itself. I ended up treating the spectrogram as an image and using the image processing toolkit and techniques from &lt;code>scipy&lt;/code> to find peaks. A combination of a high pass filter (accentuating high amplitudes) and &lt;code>scipy&lt;/code> local maxima structs did the trick.&lt;/p>
&lt;p>Once we&amp;rsquo;ve extracted these noise-resistant peaks, we have found points of interest in a song that identify it. We are effectively &amp;ldquo;squashing&amp;rdquo; the spectrogram down once we&amp;rsquo;ve found the peaks. The amplitudes have served their purpose, and are no longer needed.&lt;/p>
&lt;p>Let&amp;rsquo;s plot them to see what it looks like:&lt;/p>
&lt;p>&lt;img src="https://willdrevo.com/static/img/dejavu/spectrogram_peaks.png" alt="Blurred Lines">&lt;/p>
&lt;p>You&amp;rsquo;ll notice there are a lot of these. Tens of thousands per song, actually. The beauty is that since we&amp;rsquo;ve done away with amplitude, we only have two things, time and frequency, which we&amp;rsquo;ve conviently made into discrete, integer values. We&amp;rsquo;ve binned them, essentially.&lt;/p>
&lt;p>We have a somewhat schizophrenic situation: on one hand, we have a system that will bin peaks from a signal into discrete (time, frequency) pairs, giving us some leeway to survive noise. On the other hand, since we&amp;rsquo;ve discretized, we&amp;rsquo;ve reduced the information of the peaks from infinite to finite, meaning that peaks found in one song could (hint: will!) collide, emitting the pairs as peaks extracted from other songs. Different songs can and most likely will emit the same peaks! So what now?&lt;/p>
&lt;h3 id="fingerprint-hashing">Fingerprint hashing&lt;/h3>
&lt;p>So we might have similar peaks. No problem, let&amp;rsquo;s combine peaks into fingerprints! We&amp;rsquo;ll do this by using a hash function.&lt;/p>
&lt;p>A &lt;a href="http://en.wikipedia.org/wiki/Hash_function">hash function&lt;/a> takes an integer input and returns another integer as output. The beauty is that a good hash function will not only return the &lt;em>same&lt;/em> output integer each time the input is the same, but also that very few different inputs will have the same output.&lt;/p>
&lt;p>By looking at our spectrogram peaks and combining peak frequencies along with their time difference between them, we can create a hash, representing a unique fingerprint for this song.&lt;/p>
&lt;pre>&lt;code>hash(frequencies of peaks, time difference between peaks) = fingerprint hash value
&lt;/code>&lt;/pre>
&lt;p>There are lots of different ways to do this, Shazam has their own, SoundHound another, and so on. You can peruse my source to see how I do it, but the point is that by taking into account more than a single peak&amp;rsquo;s values you create fingerprints that have more entropy and therefore contain more information. Thus they are more powerful identifiers of songs since they will collide less.&lt;/p>
&lt;p>You can visualize what is going on with the zoomed in annotated spectrogram snipped below:&lt;/p>
&lt;p>&lt;img src="https://willdrevo.com/static/img/dejavu/spectrogram_zoomed.png" alt="Blurred Lines">&lt;/p>
&lt;p>The Shazam whitepaper likens these groups of peaks as a sort of &amp;ldquo;constellation&amp;rdquo; of peaks used to identify the song. In reality they use pairs of peaks along with the time delta in between. You can imagine lots of different ways to group points and fingerprints. On one hand, more peaks in a fingerprint means a rarer fingerprint that more strongly would identify a song. But more peaks also means less robust in the face of noise.&lt;/p>
&lt;h2 id="learning-a-song-database-structure">Learning a Song: Database structure&lt;/h2>
&lt;p>Now we can get started into how such a system works. An audio fingerprinting system has two tasks:&lt;/p>
&lt;ol>
&lt;li>Learn new songs by fingerprinting them&lt;/li>
&lt;li>Recognize unknown songs by searching for them in the database of learned songs&lt;/li>
&lt;/ol>
&lt;p>For this, we&amp;rsquo;ll use our knowledge thus far and MySQL for the database functionality. Our database schema will contain two tables:&lt;/p>
&lt;ul>
&lt;li>fingerprints&lt;/li>
&lt;li>songs&lt;/li>
&lt;/ul>
&lt;h3 id="fingerprints-table">Fingerprints table&lt;/h3>
&lt;p>The fingerprints table will have the following fields:&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> fingerprints (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> hash binary(&lt;span style="color:#ae81ff">10&lt;/span>) &lt;span style="color:#66d9ef">not&lt;/span> &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> song_id mediumint unsigned &lt;span style="color:#66d9ef">not&lt;/span> &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">offset&lt;/span> int unsigned &lt;span style="color:#66d9ef">not&lt;/span> &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">INDEX&lt;/span>(hash),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">UNIQUE&lt;/span>(song_id, &lt;span style="color:#66d9ef">offset&lt;/span>, hash)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>First, notice we have not only a hash and a song ID, but an offset. This corresponds to the time window from the spectrogram where the hash originated from. This will come into play later when we need to filter through our matching hashes. Only the hashes that &amp;ldquo;align&amp;rdquo; will be from the true signal we want to identify (more on this in the &amp;ldquo;Fingerprint Alignment&amp;rdquo; section below).&lt;/p>
&lt;p>Second, we&amp;rsquo;ve made an &lt;code>INDEX&lt;/code> on our hash - with good reason. All of the queries will need to match that, so we need a really quick retrieval there.&lt;/p>
&lt;p>Next, the &lt;code>UNIQUE&lt;/code> index just insures we don&amp;rsquo;t have duplicates. No need to waste space or unduly weight matching of audio by having duplicates lying around.&lt;/p>
&lt;p>If you&amp;rsquo;re scratching your head on why I used a &lt;code>binary(10)&lt;/code> field for the hash, the reason is that we&amp;rsquo;ll have a &lt;em>lot&lt;/em> of these hashes and cutting down space is imperative. Below is a graph of the number of fingerprints for each song:&lt;/p>
&lt;p>&lt;img src="https://willdrevo.com/static/img/dejavu/num_fingerprints.png" alt="Fingerprint counts">&lt;/p>
&lt;p>At the front of the pack is &amp;ldquo;Mirrors&amp;rdquo; by Justin Timberlake, with over 240k fingerprints, followed by &amp;ldquo;Blurred Lines&amp;rdquo; by Robin Thicke with 180k. At the bottom is the acapella &amp;ldquo;Cups&amp;rdquo; which is a sparsely instrumented song - just voice and literally a cup. In contract, listen to &amp;ldquo;Mirrors&amp;rdquo;. You&amp;rsquo;ll notice the obvious &amp;ldquo;wall of noise&amp;rdquo; instrumentation and arranging the fills out the frequency spectrum from high to low, meaning that the spectrogram is abound with peaks in high and low frequencies alike. The average is well over 100k fingerprints per song for this dataset.&lt;/p>
&lt;p>With this many fingerprints, we need to cut down on unecessary disk storage from the hash value level. For our fingerprint hash, we&amp;rsquo;ll start by using a &lt;code>SHA-1&lt;/code> hash and then cutting it down to half its size (just the first 20 characters). This cuts our byte usage per hash in half:&lt;/p>
&lt;blockquote>
&lt;p>char(40) =&amp;gt; char(20) goes from 40 bytes to 20 bytes&lt;/p>
&lt;/blockquote>
&lt;p>Next we&amp;rsquo;ll take this hex encoding and convert it to binary, once again cutting the space down considerably:&lt;/p>
&lt;blockquote>
&lt;p>char(20) =&amp;gt; binary(10) goes from 20 bytes to 10 bytes&lt;/p>
&lt;/blockquote>
&lt;p>Much better. We went from 320 bits down to 80 bits for the &lt;code>hash&lt;/code> field, a reduction of 75%.&lt;/p>
&lt;p>My first try at the system, I used a &lt;code>char(40)&lt;/code> field for each hash - this resulted in over 1 GB of space for fingerprints alone. With &lt;code>binary(10)&lt;/code> field, we cut down the table size to just 377 MB for 5.2 million fingerprints.&lt;/p>
&lt;p>We do lose some of the information - our hashes will, statistically speaking, collide much more often now. We&amp;rsquo;ve reduced the &amp;ldquo;entropy&amp;rdquo; of the hash considerably. However, its important to remember that our entropy (or information) also includes the &lt;code>offset&lt;/code> field, which is 4 bytes. This brings the total entropy of each of our fingerprints to:&lt;/p>
&lt;blockquote>
&lt;p>10 bytes (hash) + 4 bytes (offset) = 14 bytes = 112 bits = 2^112 ~= 5.2+e33 possible fingerprints&lt;/p>
&lt;/blockquote>
&lt;p>Not too shabby. We&amp;rsquo;ve saved ourself 75% of the space and still managed to have an unimaginably large fingerprint space to work with. Gurantees on the distribution of keys is a hard argument to make, but we certainly have enough entropy to go around.&lt;/p>
&lt;h3 id="songs-table">Songs table&lt;/h3>
&lt;p>The songs table will be pretty vanilla, essentially we&amp;rsquo;ll just use it for holding information about songs. We&amp;rsquo;ll need it to pair a &lt;code>song_id&lt;/code> to the song&amp;rsquo;s string name.&lt;/p>
&lt;div class="highlight">&lt;div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">
&lt;table style="border-spacing:0;padding:0;margin:0;border:0;">&lt;tr>&lt;td style="vertical-align:top;padding:0;margin:0;border:0;">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6
&lt;/span>&lt;span style="white-space:pre;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7
&lt;/span>&lt;/code>&lt;/pre>&lt;/td>
&lt;td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sql" data-lang="sql">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">CREATE&lt;/span> &lt;span style="color:#66d9ef">TABLE&lt;/span> songs (
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> song_id mediumint unsigned &lt;span style="color:#66d9ef">not&lt;/span> &lt;span style="color:#66d9ef">null&lt;/span> auto_increment,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> song_name varchar(&lt;span style="color:#ae81ff">250&lt;/span>) &lt;span style="color:#66d9ef">not&lt;/span> &lt;span style="color:#66d9ef">null&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> fingerprinted tinyint &lt;span style="color:#66d9ef">default&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">PRIMARY&lt;/span> &lt;span style="color:#66d9ef">KEY&lt;/span> (song_id),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">UNIQUE&lt;/span> &lt;span style="color:#66d9ef">KEY&lt;/span> song_id (song_id)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>);
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/td>&lt;/tr>&lt;/table>
&lt;/div>
&lt;/div>&lt;p>The &lt;code>fingerprinted&lt;/code> flag is used by Dejavu internally to decide whether or not to fingprint a file. We set the bit to 0 initially and only set it to 1 after the fingerprinting process (usually two channels) is complete.&lt;/p>
&lt;h2 id="fingerprint-alignment">Fingerprint Alignment&lt;/h2>
&lt;p>Great, so now we&amp;rsquo;ve listened to an audio track, performed FFT in overlapping windows over the length of the song, extracted peaks, and formed fingerprints. Now what?&lt;/p>
&lt;p>Assuming we&amp;rsquo;ve already performed this fingerprinting on known tracks, ie we have already inserted our fingerprints into the database labeled with song IDs, we can simply match.&lt;/p>
&lt;p>Our pseudocode looks something like this:&lt;/p>
&lt;pre>&lt;code>channels = capture_audio()
fingerprints_matching = [ ]
for channel_samples in channels
hashes = process_audio(channel_samples)
fingerprints_matching += find_database_matches(hashes)
predicted_song = align_matches(fingerprints_matching)
&lt;/code>&lt;/pre>
&lt;p>What does it mean for hashes to be aligned? Let&amp;rsquo;s think about the sample that we are listening to as a subsegment of the original audio track. Once we do this, the hashes we extract out of the sample will have an &lt;code>offset&lt;/code> that is &lt;em>relative&lt;/em> to the start of the sample.&lt;/p>
&lt;p>The problem of course, is that when we originally fingerprinted, we recorded the &lt;em>absolute&lt;/em> offset of the hash. The relative hashes from the sample and the absolute hashes from the database won&amp;rsquo;t ever match unless we started recording a sample from exactly the start of the song. Pretty unlikely.&lt;/p>
&lt;p>But while they may not be the same, we do know something about the matches from the real signal behind the noise. We know all the relative offsets will be the same distance apart. This requires the assumption that the track is being played and sampled at the same speed it was recorded and released in the studio. Actually, we&amp;rsquo;d be out of luck anyway in the case the playback speed was different, since this would affect the frequency of the playback and therefore the peaks in the spectrogram. At any rate, the playback speed assumption is a good (and important) one.&lt;/p>
&lt;p>Under this assumption, for each match we calculate a difference between the offsets:&lt;/p>
&lt;blockquote>
&lt;p>difference = database offset from original track - sample offset from recording&lt;/p>
&lt;/blockquote>
&lt;p>which will always yield a postiive integer since the database track will always be at least the length of the sample. All of the true matches with have this same difference. Thus our matches from the database are altered to look like:&lt;/p>
&lt;blockquote>
&lt;p>(song_id, difference)&lt;/p>
&lt;/blockquote>
&lt;p>Now we simply look over all of the matches and predict the song ID for which the largest count of a difference falls. This is easy to imagine if you visualize it as a histogram.&lt;/p>
&lt;p>And that&amp;rsquo;s all there is to it!&lt;/p>
&lt;h2 id="how-well-it-works">How well it works&lt;/h2>
&lt;p>To truly get the benefit of an audio fingerprinting system, it can&amp;rsquo;t take a long time to fingerprint. It&amp;rsquo;s a bad user experience, and furthermore, a user may only decide to try to match the song with only a few precious seconds of audio left before the radio station goes to a commercial break.&lt;/p>
&lt;p>To test Dejavu&amp;rsquo;s speed and accuracy, I fingerprinted a list of 45 songs from the US VA Top 40 from July 2013 (I know, their counting is off somewhere). I tested in three ways:&lt;/p>
&lt;ol>
&lt;li>Reading from disk the raw mp3 -&amp;gt; wav data, and&lt;/li>
&lt;li>Playing the song over the speakers with Dejavu listening on the laptop microphone.&lt;/li>
&lt;li>Compressed streamed music played on my iPhone&lt;/li>
&lt;/ol>
&lt;p>Below are the results.&lt;/p>
&lt;h3 id="1-reading-from-disk">1. Reading from Disk&lt;/h3>
&lt;p>Reading from disk was an overwhelming 100% recall - no mistakes were made over the 45 songs I fingerprinted. Since Dejavu gets all of the samples from the song (without noise), it would be nasty surprise if reading the same file from disk didn&amp;rsquo;t work every time!&lt;/p>
&lt;h3 id="2-audio-over-laptop-microphone">2. Audio over laptop microphone&lt;/h3>
&lt;p>Here I wrote a script to randomly chose &lt;code>n&lt;/code> seconds of audio from the original mp3 file to play and have Dejavu listen over the microphone. To be fair I only allowed segments of audio that were more than 10 seconds from the starting/ending of the track to avoid listening to silence.&lt;/p>
&lt;p>Additionally my friend was even talking and I was humming along a bit during the whole process, just to throw in some noise.&lt;/p>
&lt;p>Here are the results for different values of listening time (&lt;code>n&lt;/code>):&lt;/p>
&lt;p>&lt;img src="https://willdrevo.com/static/img/dejavu/accuracy.png" alt="Matching time">&lt;/p>
&lt;p>This is pretty rad. For the percentages:&lt;/p>
&lt;table border="1" align="center" cellpadding="10">
&lt;tr align="center">
&lt;th>Number of Seconds&lt;/th>
&lt;th>Number Correct&lt;/th>
&lt;th>Percentage Accuracy&lt;/th>
&lt;/tr>
&lt;tr align="center">
&lt;td>1&lt;/td>
&lt;td>27 / 45&lt;/td>
&lt;td>60.0%&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>2&lt;/td>
&lt;td>43 / 45&lt;/td>
&lt;td>95.6%&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>3&lt;/td>
&lt;td>44 / 45&lt;/td>
&lt;td>97.8%&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>4&lt;/td>
&lt;td>44 / 45&lt;/td>
&lt;td>97.8%&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>5&lt;/td>
&lt;td>45 / 45&lt;/td>
&lt;td>100.0%&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>6&lt;/td>
&lt;td>45 / 45&lt;/td>
&lt;td>100.0%&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Even with only a single second, randomly chosen from anywhere in the song, Dejavu is getting 60%! One extra second to 2 seconds get us to around 96%, while getting perfect only took 5 seconds or more. Honestly when I was testing this myself, I found Dejavu beat me - listening to only 1-2 seconds of a song out of context to identify is pretty hard. I had even been listening to these same songs for two days straight while debugging&amp;hellip;&lt;/p>
&lt;p>In conclusion, Dejavu works amazingly well, even with next to nothing to work with.&lt;/p>
&lt;h3 id="3-compressed-streamed-music-played-on-my-iphone">3. Compressed streamed music played on my iPhone&lt;/h3>
&lt;p>Just to try it out, I tried playing music from my Spotify account (160 kbit/s compressed) through my iPhone&amp;rsquo;s speakers with Dejavu again listening on my MacBook mic. I saw no degredation in performance; 1-2 seconds was enough to recognize any of the songs.&lt;/p>
&lt;h2 id="performance-speed">Performance: Speed&lt;/h2>
&lt;p>On my MacBook Pro, matching was done at 3x listening speed with a small constant overhead. To test, I tried different recording times and plotted the recording time plus the time to match. Since the speed is mostly invariant of the particular song and more dependent on the length of the spectrogram created, I tested on a single song, &amp;ldquo;Get Lucky&amp;rdquo; by Daft Punk:&lt;/p>
&lt;p>&lt;img src="https://willdrevo.com/static/img/dejavu/matching_time.png" alt="Matching time">&lt;/p>
&lt;p>As you can see, the relationship is quite linear. The line you see is a least-squares linear regression fit to the data, with the corresponding line equation:&lt;/p>
&lt;blockquote>
&lt;p>1.364757 * record time - 0.034373 = time to match&lt;/p>
&lt;/blockquote>
&lt;p>Notice of course since the matching itself is single threaded, the matching time includes the recording time. This makes sense with the 3x speed in purely matching, as:&lt;/p>
&lt;blockquote>
&lt;p>1 (recording) + 1/3 (matching) = 4/3 ~= 1.364757&lt;/p>
&lt;/blockquote>
&lt;p>if we disregard the miniscule constant term.&lt;/p>
&lt;p>The overhead of peak finding is the bottleneck - I experimented with mutlithreading and realtime matching, and alas, it wasn&amp;rsquo;t meant to be in Python. An equivalent Java or C/C++ implementation would most likely have little trouble keeping up, applying FFT and peakfinding in realtime.&lt;/p>
&lt;p>An important caveat is of course, the round trip time (RTT) for making matches. Since my MySQL instance was local, I didn&amp;rsquo;t have to deal with the latency penalty of transfering fingerprint matches over the air. This would add RTT to the constant term in the overall calculation, but would not effect the matching process.&lt;/p>
&lt;h2 id="performance-storage">Performance: Storage&lt;/h2>
&lt;p>For the 45 songs I fingerprinted, the database used 377 MB of space for 5.4 million fingerprints. In comparison, the disk usage is given below:&lt;/p>
&lt;table border="1" align="center" cellpadding="10">
&lt;tr align="center">
&lt;th>Audio Information Type&lt;/th>&lt;th>Storage in MB&lt;/th>
&lt;/tr>
&lt;tr align="center">
&lt;td>mp3&lt;/td>&lt;td>339&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>wav&lt;/td>&lt;td>1885&lt;/td>
&lt;/tr>
&lt;tr align="center">
&lt;td>fingerprints&lt;/td>&lt;td>377&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>There&amp;rsquo;s a pretty direct trade-off between the necessary record time and the amount of storage needed. Adjusting the amplitude threshold for peaks and the fan value for fingerprinting will add more fingerprints and bolster the accuracy at the expense of more space.&lt;/p>
&lt;p>It&amp;rsquo;s true, the fingerprints take up a surprising amount of space (slighty more than raw MP3 files). This seems alarming until you consider there are tens and sometimes hundreds of thousands of hashes per song. We&amp;rsquo;ve traded off the pure information of the entire audio signal in the wave files for about 20% of that storage in fingerprints. We&amp;rsquo;ve also enabled matching songs very reliably in five seconds, so our space / speed tradeoff appears to have paid off.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Audio fingerprinting seemed magical the first time I saw it. But with a small amount of knowledge about signal processing and basic math, it&amp;rsquo;s a fairly accessible field.&lt;/p>
&lt;p>My hope is that anyone reading this will check out the Dejavu Project and drop a few stars on me or, better yet, fork it! Check out Dejavu here:&lt;/p>
&lt;blockquote>
&lt;p>&lt;a href="https://github.com/worldveil/dejavu">https://github.com/worldveil/dejavu&lt;/a>&lt;/p>
&lt;/blockquote></description></item></channel></rss>