MP3 Blogs and wget

07 Jul 2004

Here’s an interesting equation: Most bands and labels are posting free mp3s of their latest music on their sites. Add to that an army of fans scouring these sites daily, then blogging what they find. The result is a constant stream of new music being discovered, sorted, commented, and publicized.

But how to keep up?

For a while, I just visited a couple of interesting and well written mp3 blogs, but then they’d link to a couple more, and I’d start reading those. And then that happened a few dozen more times. My desire to stay in touch was in conflict with my increasingly limited free time.

Wget to the rescue. It’s a utility for unix/linux/etc. that goes and gets stuff from Web and FTP servers – kind of like a browser but without actually displaying what it downloads. And since it’s one of those awesomely configurable command line programs, there is very little it _can’t do. So I run wget, give it the URLs to those mp3 blogs, and let it scrape all the new audio files it finds. Then I have it keep doing that on a daily basis, save everything into a big directory, and have a virtual radio station of hand-filtered new music. Neat.

Here’s how I do it:

wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i ~/mp3blogs.txt

And here’s what this all means:

-r -H -l1 -np These options tell wget to download recursively. That means it goes to a URL, downloads the page there, then follows every link it finds. The -H tells the app to span domains, meaning it should follow links that point away from the blog. And the -l1 (a lowercase L with a numeral one) means to only go one level deep; that is, don’t follow links on the linked site. In other words, these commands work together to ensure that you don’t send wget off to download the entire Web – or at least as much as will fit on your hard drive. Rather, it will take each link from your list of blogs, and download it. The -np switch stands for “no parent”, which instructs wget to never follow a link up to a parent directory.

We don’t, however, want all the links – just those that point to audio files we haven’t yet seen. Including -A.mp3 tells wget to only download files that end with the .mp3 extension. And -N turns on timestamping, which means wget won’t download something with the same name unless it’s newer.

To keep things clean, we’ll add -nd, which makes the app save every thing it finds in one directory, rather than mirroring the directory structure of linked sites. And -erobots=off tells wget to ignore the standard robots.txt files. Normally, this would be a terrible idea, since we’d want to honor the wishes of the site owner. However, since we’re only grabbing one file per site, we can safely skip these and keep our directory much cleaner. Also, along the lines of good net citizenship, we’ll add the -w5 to wait 5 seconds between each request as to not pound the poor blogs.

Finally, -i ~/mp3blogs.txt is a little shortcut. Typically, I’d just add a URL to the command line with wget and start the downloading. But since I wanted to visit multiple mp3 blogs, I listed their addresses in a text file (one per line) and told wget to use that as the input.

I put this in a cron job, run it every day, and save everything to a local directory. And since it timestamps, the app only downloads new stuff. I’ll should probably figure out a way to import into iTunes automatically with a script and generate a smart playlist, so I can walk in, hit play, and have the music just go.

The following are a couple of lists of mp3 blogs that you can use to find authors that match your musical tastes. Put their URLs in your text file and off you go.

mp3 blogs: defining fair use Close Your Eyes: mp3 blogs