Downloading YouTube and Soundcloud Audio with Python and Pandas

December 17, 2014

As a way to cross-polinate our music interests and listen to some new grooves, a friend of mine included me on mixtape swap for 2014 – everyone picked one track to submit to the group. All of the tracks got put it in a Google poll, and naturally it was easy enough to get downloaded into a CSV and emailed out.

Rather than copying and pasting my way through the list to try to watch them all (or making a YouTube playlist which I couldn’t watch/listen to on my flight home), I decided to script it.

Turns out, it’s pretty easy in Python with the help of youtube-dl.

Caveat: this post is purely an excercise in knowledge, and you shouldn’t do anything illegal.

YouTub Downloader

youtube-dl is a great little open-source collaboration that allows you to pull video and audio from a number of different video sites.

If you download it from pip,

$ pip install youtube-dl

you can see all the sites currently supported (all 493 of them!):

$ youtube-dl --extractor-descriptions | wc -l
493
$ youtube-dl --extractor-descriptions | head -n 15
1up.com
220.ro
24video
3sat
4tube
56.com
5min
8tracks
9gag
abc.net.au
AcademicEarth:Course
AddAnime
AdultSwim
Aftonbladet
Allocine

You’ll find everything from Soundcloud to YouTube to YouPorn. Obviously I haven’t verifed all these work, but the devs have made it quite easy to add new sources.

Conviently, it’s also a nicely wrapped command line utility. However, I wanted to be able to use YouTube videos programmatically in a few other projects, so I wanted to be able to script it out.

Don’t forget: youtube-dl is a project that depends on the layout of sites on the web, meaning that the extractors need to constantly change as well. Save yourself the trouble and periodically update the tool:

$ pip install --upgrade youtube-dl

Otherwise you might find that extractors stop functioning without any apparent reason.

Scripting `youtube-dl`

You can see all the options for youtube-dl, but beware, there are a ton. The good news is there’s a nice correspondence between the command line opts and the ones for scripting.

It’s quite easy to do something simple, like download a video of a bunny taking a shower:

import youtube_dl

options = {'outtmpl': '%(id)s'}  # save file as the YouTube ID
with youtube_dl.YoutubeDL(options) as ydl:
    ydl.download(['http://www.youtube.com/watch?v=ZHWZf1Z4B5k'])

It turns out of course that if we only want audio (in the case of a song), we’d rather not download the entire video file. We can adjust our options not only to only download the audio, but convert it to the format of our choosing!

options = {
  'format': 'bestaudio/best',
  'extractaudio' : True,  # only keep the audio
  'audioformat' : "mp3",  # convert to mp3 
  'outtmpl': '%(id)s',    # name the file the ID of the video
  'noplaylist' : True,    # only download single song, not playlist
}

Note that the slashes in options fields correspond to your preference – the first will be tried, but if not availiable, your next preference (following the next slash) will be used.

We also don’t want to automate the downloading of an entire playlist (thought sometimes that comes in handy).

Also, some of the time youtube-dl isn’t able to just extract the audio, and will have to download the video first and then convert it. Underneath the hood youtube-dl appears to be using pydub, an ffmpeg wrapper. You can see the postprocessors here.

Metadata from YouTube videos

You can also use youutbe-dl to get information about videos. The type and variety of fields you can extract depends on the source extractor, but let’s look at an example from YouTube.

import youtube_dl

# download metadata
ydl = youtube_dl.YoutubeDL()
r = None
url = "https://www.youtube.com/watch?v=hwsXo6fsmso"
with ydl:
    # don't download, much faster 
    r = ydl.extract_info(url, download=False)  

# print some typical fields
print "%s uploaded by '%s', has %d views, %d likes, and %d dislikes" % (
    r['title'], r['uploader'], r['view_count'], r['like_count'], r['dislike_count'])

outputs:

Skrillex - Ease My Mind with Niki & The Dove [AUDIO] uploaded by 'Skrillex' has 5318382 views, 54090 likes, and 1103 dislikes

Putting it all together

Let’s go right from a CSV with pandas to mp3 files on disk. Here’s a short couple of lines from a hypothetical CSV:

Artist;Title;Link
Skrillex;Ease My Mind;https://www.youtube.com/watch?v=hwsXo6fsmso
Elvis Costello; Pump It Up; ; https://www.youtube.com/watch?v=tpprOGsLWUo
Donnie Trumpet & The Social Experiment; Sunday Candy; https://soundcloud.com/donnietrumpet/donnie-trumpet-the-social-experiment-sunday-candy
Mos Def; Mathematics; ; https://www.youtube.com/watch?v=m5vw4ajnWGA
Thelonius Monk; Friday the 13th; ; https://www.youtube.com/watch?v=NT9xGJvW13c
Hauschka; The Key; ; https://www.youtube.com/watch?v=D2HX3peUN8o

Reading it in is pretty simple:

In[1]: df = pd.read_csv("videos.csv", sep=";", skipinitialspace=True)
In[2]: df.ix[13]  # print out a single row
Out[2]: 
Artist                                         Tame Impala
Title                                             Elephant
Link           https://www.youtube.com/watch?v=B6CZTDCb7g4
Name: 13, dtype: object

The process of downloading will also take a while, so I also want to make the process be able to cache results and know what it’s downloaded already without doing too much work.

Here’s the quick and dirty script!

import youtube_dl
import pandas as pd
import os
import traceback

CSV = "videos.csv"

# create directory
savedir = "mixtape"
if not os.path.exists(savedir):
    os.makedirs(savedir)

def make_savepath(title, artist, savedir=savedir):
    return os.path.join(savedir, "%s--%s.mp3" % (title, artist))

# create YouTube downloader
options = {
    'format': 'bestaudio/best', # choice of quality
    'extractaudio' : True,      # only keep the audio
    'audioformat' : "mp3",      # convert to mp3 
    'outtmpl': '%(id)s',        # name the file the ID of the video
    'noplaylist' : True,}       # only download single song, not playlist
ydl = youtube_dl.YoutubeDL(options)

with ydl:

    # read in videos CSV with pandas
    df = pd.read_csv(CSV, sep=";", skipinitialspace=True)
    df.Link = df.Link.map(str.strip)  # strip space from URLs

    # for each row, download
    for _, row in df.iterrows():
        print "Downloading: %s from %s..." % (row.Title, row.Link)
        
        # download location, check for progress
        savepath = make_savepath(row.Title, row.Artist)
        try:
            os.stat(savepath)
            print "%s already downloaded, continuing..." % savepath
            continue

        except OSError:
            # download video
            try:
                result = ydl.extract_info(row.Link, download=True)
                os.rename(result['id'], savepath)
                print "Downloaded and converted %s successfully!" % savepath

            except Exception as e:
                print "Can't download audio! %s\n" % traceback.format_exc()

You can also do something similar in the command line:

Start by adding this to your ~/.bash_profile:

# downloading from youtube
alias ydl="youtube-dl --extract-audio --audio-format wav"

Then whenever you open a new terminal, you can simply type ydl with the link of any applicable YouTube, Soundcloud, etc link and it will automatically download into the current working directory:

$ ydl http://www.youtube.com/watch?v=ZHWZf1Z4B5k

I find the flexibility of keeping everything in Python and extracting metadata there is more powerful, however.

Conclusion

Super easy! Be aware that sometimes the process for downloading audio fails (resulting in empty or corrupted audio), and so some kind of more complicated verification to ensure download and conversion is definitely needed.

Will Drevo

Previously, Product @ Anyscale and ML @ Coinbase, co-founder of two startups, and research at Microsoft Research.

Studied CS & music composition at MIT, Machine Learning at MIT CSAIL, entrepreneur, traveler, and electronic music producer.

Follow me at itsdrevo.

Theme inspired by this site.