Downloading en mass

Railscast has a podcast on iTunes which lets you download all past episodes and thus this post is moot. But, I thought, what if it wasn't? What if I found a cool site with a bunch of media files I want to download? I knew wget can do this but it's been a while. So here's what I did.
Get the links
First I downloaded the archive page. It's full of links to other pages which have the .mov media files on it. But I really needed a list of those pages within the archive page. Ok, I wget'd the archive html source page in a file named archive.txt:
curl -o archive.txt http://railscasts.com/episodes/archive
However this page is full of html source and all I need is the list of episodes. So a simple:
grep href urls.txt|grep episodes > urls_href.txt gets me that. But it's really full of text that looks like this: <a href="/episodes/195-my-favorite-web-apps-in-2009">My Favorite Web Apps in 2009</a>
Clean the links
Ok now I just need to trim all this crap out. First, there's a bunch of whitespace out front in the links. Let's use sed inline. This works on OSX and Linux but won't work on Solaris (inline boo).
sed -ie 's/^[ ]*//' urls_href.txt. Now we're missing a those spaces out front.
Ok, now we need to trim down to the relative link to the episode. We want a URL that looks like:
http://railscasts.com/episodes/30-pretty-page-title
So at this point sed was failing me because the regex syntax is different than I'm used to. So let's switch to perl.
cat urls_href.txt | perl -e 'while(<>) { s/\<a\s*(.*)\>(.*)\<\/a\>/$1/; print}' > urls_href_clean.txt
We're almost there. We have URLs that look like this:
"/episodes/17-habtm-checkboxes" "/episodes/16-virtual-attributes" "/15-fun-with-find-conditions"
We need a prefix of the domain and to get rid of those quotes:
cat urls_href_clean.txt | perl -e 'while(<>) { s/href\=/http\:\/\/railscasts.com/; s/\"//; s/\"//; print}' > urls_href_super_clean.txt
Despite our horribly unmaintainable "super" naming convention, we now have a text file full of URLs that looks like this:
http://railscasts.com/episodes/17-habtm-checkboxes http://railscasts.com/episodes/16-virtual-attributes http://railscasts.com/episodes/15-fun-with-find-conditions
Scrape
Fire wget using our text file as input (-i). Recurse (-r), go only two levels deep (-l), don't download the file if we have one that's newer and use timestamps to make this possible (-Nc), span hosts (-H), no directories (-nd), disrespect the robots.txt file (-erobots=off) and look for only .mov files (-A).
wget -r -erobots=off -l2 -H -Nc -nd -A.mov -i urls_href_super_clean.txt
You might want to create a working directory for this before you run it. And probably run it under "screen" if you have that command. After you do, you'll eventually you'll have a directory full of railscasts. Or, you could just subscribe to their feed on iTunes. Getting it through iTunes is a lot easier but that wasn't the point.
Goddamn Solaris

Solaris is a rock-solid serious OS with a dependable hardware platform. Sun has always been solid technically. Sun pours a lot of good tech into Solaris. ZFS is amazing. Dtrace is revolutionary. Zones are flexible and lightweight. They're doing good engineering all around in a crazy number of areas. But we're going to temporarily ignore those things and talk about why Solaris is horrifying. Hopefully Oracle can fix a number of these Goddamn Problems™. If not, maybe OpenSolaris will/has.
I'm open to counter-points except for the argument "but you can change it". I know a lot of this is configurable (like adding packages or editing the default files) but that's not the point. Linux comes out of the box "right" or most specifically, easier/better. It's especially annoying because Sun boxes are so much more expensive and are sooo close to perfection. And many times the box isn't yours or the baseline isn't yours so you can't just add packages or reconfigure a global config file.
1. Default Bash Prompt.
Seriously, -bash-3.00# in 2010? Which box am I on? I don't know! What directory am I in? Every box, I edit /etc/profile to have export PS1='[\u@\h \W]\$ '
2. No `locate'.
Everyone does a find /. Yay. I love waiting 5 minutes to find one file. In linux, I schedule updatedb at 2am and find files sub-second.
3. No `screen'.
Greatest command in the world and it's an optional install from sunfreeware. So I have a crappy default term program and no way to start processes except with nohup command &. Boo.
4. Solaris tar is stupid.
Can't have filenames greater than 100 characters. Ok, use gtar. Great. Gtar doesn't do bzip or gzip on the fly. Argh. Why can't Solaris just include and use exclusively GNU tar (as well as everything else GNU)?!
5. No network package installs
Yum is great. Aptitude is better. Emerge is slow and dangerous (try moving between major Gentoo profiles). But at least I can download packages from the net on the fly. Ubuntu even tells me what to install when a command isn't found!
6. You know what I meant Mr. Picky.
If I type ls file -l I get -l: No such file or directory and then a non-long file return. Because it thinks I meant "show me two files file and -l". When you type a command in Solaris and forget to put a switch on, you're doomed to Ctrl-A, insert the switch and make it look like ls -l file. In Linux, it knows what you meant. Most core commands are more flexible in this way under Linux. It's goddamn maddening when you go back to Solaris.
7. UFS is pathetic.
ZFS is pure bliss but it's tricky to put on your root partition. You can't do it during the default install and you have to migrate everything. Which is not only complicated but hard to do if you have UFS permissions all over the place (next point). Also, UFS logging should be the default. I don't want to do fsck checks if my box crashes. Really. W(hy)TF isn't logging the default? Who doesn't want journaling on?! What's the drawback?! *head asplode*
8. ACL translations
The problem with migrating is that fine-grained POSIX ACLs (UFS) aren't compatible with ZFS ACLs. You control POSIX ACLs on UFS with setfacl & getfacl. You control ACLs on ZFS with chmod. The two commands are very different in syntax. So you have to "port" them. Which is especially maddening if you do a `man get_acl'. There's a goddamn C system call that can translate the ACLs built into the OS! How do you think this works?
cp /zfs/my_file /ufs/
The OS doesn't strip the extended ACLs off the file when it appears in /ufs/my_file! The capability is in the OS but it's not exposed as a command or another acl_translate call. You'd have to write your own utility ... hmm. Maybe that's a project.
9. Embrace GCC plz
GCC can do a zillion things more than CC. Get rid of CC, embrace GNU. Keep format, keep whatever bios-replacement commands you want. Keep the nice stuff from Solaris but please don't try to compete with better GNU utilities that already are switch compatible with Solaris tools. Some of these GNU binaries are literally drop-in replacements with Solaris compatible switches. And even if not, make Solaris 11 a GNU platform milestone and let customers decide if they want to upgrade.
10. X11, CDE, openwin
I love the new /SP/console and LOM stuff. I love the console architecture. I love all the things that PCs can't do (unless you buy Vendor specific add-in cards). I do not love CDE (yes Gnome is there) and OpenWindows. /usr/openwin/bin/xclock should not be the path to xclock. xclock should be in my default /usr/bin/ path and Xorg should be the only X11 anything anywhere. I'm sick of not having Xnest and all the other awesome Linux standard tools.
11. netstat -pan doesn't work.
Solaris netstat won't show me the process number of a network port. This is heresy. Just add the goddamn -p. Do you know how useful it is to do something like netstat -pan | grep `ps -ef|grep -v grep |grep java | awk '{print $2}'` | grep LISTEN is? Works on Linux. Bam! All the java listening ports. Done. Great for security, debugging and scripting.
12. Solaris ps -ef is stupid.
Linux ps auxww is God. GNU ps rules. Holy ass this is annoying on Solaris.
13. No top.
All you get is prstat. Linux (some distros) has this awesome improvement on top called htop. Htop lets you nice, kill and highlight processes. It's really great but I'd be happy with top on default Solaris installs. Even my Mac has top on it. I'd love to know why adding a binary in /usr/bin is so hard. Does it break some other command called top? Do things detect top in the $PATH and set Linux mode to true?
*breathe*
Rants are great. By the end of the list, I realize I'm nitpicking. But day in and day out, I'm praying to the Sun gods that they develops the absolute crap out of Solaris and get an amazing milestone together for Solaris 11 or Solaris 12 or whatever. Something that will appeal to ever-increasing Ubuntu and casual hacker community instead of reminiscing about legacy compatibility until x86 mediocrity makes the obscure Solaris way of doing things completely irrelevant.
I remember when Solaris didn't even have ssh by default. I guess this "behind" strategy continues. Maybe OpenSolaris can become the default on Sparc and people can stay behind until they feel like stepping into a different baseline. But even outside of packages and baselines, nice things left off (like UFS logging) by default is just maddening.
Subversion Aptitude Error
I used to have an SVN repository up and running. Then my server crashed. Nothing was really important so I never rebuilt it. However my server backup files referenced the SVN modules in apache. I suppose during the crazy rebuild time, I restored an old conf file that referenced dav_svn and I disabled the module by deleting the file.
So now I want to get SVN back because Git (while great) lacks a nice GUI. I'm not converting to SVN. I'm just going to put stuff in both places for a while and then merge them after playing with Versions.
Anyway, why you are here. You're getting a ERROR: Module dav_svn does not exist! when trying to install subversion with aptitude install libapache2-svn? That's what I was getting. I strace'd and googled it for a bit and nothing was working. Eventually I found the original files and put them in their place and that seemed to resolve the package installation however aptitude still thinks the conf files are there even after removing. So this method will get you past the aptitude install and let you install/uninstall as you like (I tested install/uninstall about fives times). And then it will remove your existing svn configs. So please don't have anything regarding mod_svn that you want to keep.
I found the originals from sysinf0.klabs.be.
Create or edit /etc/apache2/mods-available/dav_svn.conf
# dav_svn.conf - Example Subversion/Apache configuration
#
# For details and further options see the Apache user manual and
# the Subversion book.
#
# NOTE: for a setup with multiple vhosts, you will want to do this
# configuration in /etc/apache2/sites-available/*, not here.
#
# URL controls how the repository appears to the outside world.
# In this example clients access the repository as http://hostname/svn/
# Note, a literal /svn should NOT exist in your document root.
#
# Uncomment this to enable the repository
#DAV svn
# Set this to the path to your repository
#SVNPath /var/lib/svn
# Alternatively, use SVNParentPath if you have multiple repositories under
# under a single directory (/var/lib/svn/repo1, /var/lib/svn/repo2, ...).
# You need either SVNPath and SVNParentPath, but not both.
#SVNParentPath /var/lib/svn
# Access control is done at 3 levels: (1) Apache authentication, via
# any of several methods. A "Basic Auth" section is commented out
# below. (2) Apache
# below. (3) mod_authz_svn is a svn-specific authorization module
# which offers fine-grained read/write access control for paths
# within a repository. (The first two layers are coarse-grained; you
# can only enable/disable access to an entire repository.) Note that
# mod_authz_svn is noticeably slower than the other two layers, so if
# you don't need the fine-grained control, don't configure it.
# Basic Authentication is repository-wide. It is not secure unless
# you are using https. See the 'htpasswd' command to create and
# manage the password file - and the documentation for the
# 'auth_basic' and 'authn_file' modules, which you will need for this
# (enable them with 'a2enmod').
#AuthType Basic
#AuthName "Subversion Repository"
#AuthUserFile /etc/apache2/dav_svn.passwd
# To enable authorization via mod_authz_svn
#AuthzSVNAccessFile /etc/apache2/dav_svn.authz
# The following three lines allow anonymous read, but make
# committers authenticate themselves. It requires the 'authz_user'
# module (enable it with 'a2enmod').
#
#Require valid-user
#
#
Create or edit /etc/apache2/mods-available/dav_svn.load
# Depends: dav
LoadModule dav_svn_module /usr/lib/apache2/modules/mod_dav_svn.so
LoadModule authz_svn_module /usr/lib/apache2/modules/mod_authz_svn.so
After putting those in:
sudo aptitude uninstall libapache2-svn
sudo aptitude install libapache2-svn
/etc/init.d/apache2 restart
Now when you do:
sudo aptitude search libapache2-svn
c libapache2-svn - Subversion server modules for Apache
You'll see that annoying little 'c' from aptitude. That means that it's not installed but the config files are still hanging around. Purge the config files with:
sudo aptitude purge libapache2-svn
And you'll see:
aptitude search libapache2-svn
p libapache2-svn - Subversion server modules for Apache
Awk process shows up every night
We had a weird problem at work towards the end of a java development / identity management project turn up. The symptom was every night, an awk process would show up with some weird program arguments in the process table.
It was similar to this:
ps auxww|grep awk
0 13076 12539 16 0 2456 264 pipe_w S ? 0:00 awk -v progname=/etc/cron.daily/logrotate progname {????? print progname ":\n"????? progname="";???? }???? { print; }
Of course, it wouldn't happen when logrotate would be forced to run, it would simply happen at night only. So we had to wait a day, try something and wait again. So after many tests and troubleshooting we figured out that it was related to file permissions. This was a surprise. Why would file permissions cause something to hang forever like this? Permissions usually cause black/white problems like "cannot open file" or "horrific death exception".
I figured out why this happens and satisfied my inquisitiveness so that I could return to sanity. It has to do with run-parts. It has logic in it that detects executable files and runs awk. You can see the "awk -v progname" string inside the if() statement eight lines from the bottom.
#!/bin/bash
# run-parts - concept taken from Debian
# keep going when something fails
set +e
if [ $# -lt 1 ]; then
echo "Usage: run-parts "
exit 1
fi
if [ ! -d $1 ]; then
echo "Not a directory: $1"
exit 1
fi
# Ignore *~ and *, scripts
for i in $1/*[^~,] ; do
[ -d $i ] && continue
# Don't run *.{rpmsave,rpmorig,rpmnew,swp} scripts
[ "${i%.rpmsave}" != "${i}" ] && continue
[ "${i%.rpmorig}" != "${i}" ] && continue
[ "${i%.rpmnew}" != "${i}" ] && continue
[ "${i%.swp}" != "${i}" ] && continue
[ "${i%,v}" != "${i}" ] && continue
if [ -x $i ]; then
$i 2>&1 | awk -v "progname=$i" \
'progname {
print progname ":\n"
progname="";
}
{ print; }'
fi
done
exit 0
...Mystery solved.
Upgrade fest.
Following an upgrade guide on gentoo's lovely doc site. GCC was majorly out of date (3.3 to 4.1.1) and hopefully you can still read this after all is said and done.
Right now, apache is in a weird state and I need to emerge a ton of crap:
# /etc/init.d/apache2 restart
* Apache2 has detected a syntax error in your configuration files:
Syntax error on line 6 of /etc/apache2/modules.d/70_mod_php.conf:
Cannot load /usr/lib/apache2-extramodules/libphp4.so into server: libXrender.so.1: cannot open shared object file: No such file or directory
Need X11 and a million other things put back on. Cobwebs from leaving it alone for so long.
lstat test
Following an interview question that was extremely hard I went to `man lstat' and tried to code up a test just based on system documentation. It was not entirely successful, however after a tip-off from an online resource I came up with this.
A courier4 upgrade snag.
I ran into a weird error with Thunderbird. When I would reply to all in an email, some people were CC'd. When I hit send, I got a relay error. I restarted courier-imap and did all sorts of stuff and eventually decided to update all my mail software on my gentoo server.
It was a long shot but it actually fixed the CC: problem. Whether or not it directly fixed the CC: problem, I cannot prove.
I did a
# emerge courier-imap postfix
without enabling ~x86 unstable packages or bleeding edge stuff. It happily emerged and didn't work. I couldn't log into IMAP but postfix ran fine.
I was running courier3 and emerge picked up courier4.0.1. Postfix went from 2.0.4 to 2.2.5. Postfix did some upgrade bits on databases and config files (I think). But courier-imap was dead in the water.
I had (and still have) a problem getting courier to log a bit more, like more debug messages.
imapd: authentication error: Input/output error
That wasn't quite enough. But luckily this message was enough to run into a forum post:
authdaemond: /usr/lib/courier-authlib/libauthpam.so.0: undefined symbol: nscd_flush_cache
This is a shared library problem. Someone linked against a file I don't have, so it bombs at runtime. I found this bit via Google cache (the original hint has been removed..weird).
Now test the POP3 using any MUA. If you get an error message (with DEBUG_LOGIN=2)
libauthpam.so.0: undefined symbol: nscd_flush_cache
you are using broken version of courier-libpam. Try this:
# echo ">=net-libs/courier-authlib-0.57" >> /etc/portage/package.mask
# emerge courier-authlibI have had problems with 0.57 versions, version 0.55 works fine.
I did exactly that. Masked 0.57 and re-emerged. I restarted a few services:
# /etc/init.d/courier-authlib restart
# /etc/init.d/courier-imapd restart
And was able to get to my IMAP mail again.
Strip off tabs in vim
When you paste a block of text into a Putty window, many times you'll get an increasing number of leading tabs. Not so if you use gnome-terminal (IIRC). Quite annoying in a Windows world.
Strip tabs and spaces out from current position to the end of the file with:
:.,$s/^[<tab>]*\s*//
Or perhaps you only want a small block in the middle of the file changed. First, turn on line numbers.
:set number
Then search and replace on specific line numbers (in this example lines 15 through 41).
:15,41s/^[<tab>]*\s*//
Then use Ctrl-V (down arrow or h,j,k,l keys to select block) and hit ">" to re-indent. Works much better than reformatting by hand.
Verifying an ssh key fingerprint
I'm sure you have seen something like this when you have connected to a ssh host.
The authenticity of host 'host (1.2.3.4)' can't be established.
RSA key fingerprint is 44:99:ff:33:66:88:cc:66:aa:22:00:00:ee:11:99:33.
Are you sure you want to continue connecting (yes/no)?
Great. Now what? What to do with that cryptic garbage up top? Log into the box or call the admin over the phone and verify the key.
$ ssh-keygen -l -f /etc/ssh/ssh_host_key.pub