I mirrored ruby gems just to see how big it would be. I used the rubygems-mirror gem. It’s pretty simple. Just cd into a directory with a lot of space (ie: /opt/gems or something) and type
After a massive initial load of 155k gems, the size was about 45GB (currently, it grows pretty quick per week). The rubygem and gem mirror command is smart enough to just download just the deltas when you run it again:
$ gem mirror
Total gems: 170843
Fetching 16176 gems
Then I wanted to know the size of all the latest gems only. If I had to do a lazy sneakernet, this might be one method of grabbing a whole bunch of dependencies (of course this would never work). Regardless of that, I still wanted to know what percentage of ruby gems space is old versions.
So I wrote a ruby program to find all the latest versions of the gem files and total up their size. I was not very happy about my experiments with #sort and #sort_by. The biggest problem is that it took 64 HOURS to run. I knew it had lots of problems but I didn’t want to kill it. I wanted to see how bad it really ran.
I’m not going to post the actual code. You can see the old version at this git commit url. The basic gist of the crappy algorithm was something like this:
Find all the files in the gem mirror off the filesystem. Get the basename of the file name (ie: strip the path). /tmp/foo-0.1.gem -> foo-0.1.gem Go through all the basenames (gem names) find the gem family.
Here’s the problem. I had a massive list of 170k gems and then I’m trying to do a find_all right here to sort the gems into gem families. For example: there might be foo-0.1.gem, foo-0.2.gem and foo-async-0.1.gem. In this example, there are two gem families out of the three gems. Foo-async and foo are two different gems with their own versions. Later on, I would:
Do a version compare. Push the latest version name to an array. Delete the gem family name from the gem_names array.
Sounded good on paper. And then it took 65 hours to run (227305.19 seconds) and CPU was absolutely pegged the entire time. This algorithm was easy to come up with in IRB using a small test data set but scaling up in the real use case completely sucked. So I pushed it to github for versioning and rewrote the loop.
The latest version runs in 8.5 seconds and spits out a total size of all the latest ruby gems at 6.5GB. Of course, this information is useless since it’s not going to check compatibility or anything. I was just curious to know how much space is back versions.
The real key to the new version is the fact that I’m using a proper “grouped” data structure (Hash) instead of a massive flat Array. This allows the regexes and other operations to work on a smaller data set. The compound nature of the previous inefficiency is pretty amazing (hours to seconds).
So hopefully you see above that a huge array of a File glob is flat and makes regex’s or grouping operations very time consuming. Ruby’s magic group_by method sorts and groups the data structure once and then it’s much easier to regex out versions and do other things.
See below for the code inline or take a look at the github repo.
Algorithm win. Rubygem mirror size curiosity complete. 6.5GB is current gems out of 45GB (right now).