Schemaless Data Collection

Ruby — Dillon @ 10:32 am

I’ve had this idea for schemaless data collection for a while now. It seems like everyone is trying to ETL data in. Inevitably, people start writing mappers programs and documentation trying to build a massive Rosetta stone. “They call person_name name? We’ll call everything name. Let’s write this all down. Person_name = name, so say we all.” What happens is a lot of investigation and work in determining their ENTIRE schema just to make a copy of it. In the case of XML parsing, sometimes I actually do need to know what the entire source looks like just so I can loop through it. What a pain. Not to mention if the source changes formats, I have to do all this work to re-understand their schema and change my mappers to reflect their change. Maybe I even have to do a mass migration on my end to bring everything up to date. Schemaless data collection will let you copy the data when the source changes and even be able to historically tell you when the schema changed. In an RDBMS, this is impossible without blobs or something else horrible.

What this example shows is simply the collection of the data. But the advantage here, I will show that any kind of querying can be done later very easily. What I won’t show is that normalization can be done in parallel and in batches later too and the whole thing lives in a horizontally scalable database. Of course, the catch is, you need keys and a structured data source to start with. This won’t work with CSV and simple formats.

For example, if we wanted to load XML files into a database. MongoDB is great for this because it can possibly make coding dead simple. For each attribute and children, create attributes and children in the database.

In a normal database, I would have to parse out the XML file and create normalized rows all over the place or use blobs. Of course blobs are useless for search. Let me show you an example.

First, let’s take a look at the XML returned by a Wikipedia exporter URL (trimmed the XSD line a bit for readability):

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.5/" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.5.xsd" version="0.5" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.18wmf1</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>Ford Motor</title>
    <id>255240</id>
    <redirect />
    <revision>
      <id>16130856</id>
      <timestamp>2003-06-30T02:32:03Z</timestamp>
      <contributor>
        <username>Infrogmation</username>
        <id>4444</id>
      </contributor>
      <minor/>
      <comment>redir</comment>
      <text xml:space="preserve" bytes="32">#REDIRECT [[Ford Motor Company]]</text>
    </revision>
  </page>
</mediawiki>

Even though this page is just a redirect (see the text element at the end in mediawiki markup), it’s still very long. For a multitude of posts or documents, creating a mapper and tightly handling the document might be very annoying. Even worse, we might delay sucking data in because there is so much mapping work to do. We might even start writing documents detailing the source format, what we will call the attributes internally and document schema changes as they occur.

What I propose is to forget all that mapping and just load the document as-is. We will use a document database (MongoDB) to make this magic happen.

# we are going to intentionally use the vanilla mongo driver
require 'mongo'
require 'nokogiri'
require 'open-uri'
require 'active_support/core_ext' # from rails
 
include Mongo
pages = Connection.new('localhost', 27017).db('loadtest').collection('pages')
 
wikipedia_page = "http://en.wikipedia.org/wiki/Special:Export/Ford_Motor"
 
# noblanks is magic here?  had problems without it
doc = Nokogiri::XML(open(wikipedia_page)) { |config| config.noblanks }
page = Hash.from_xml(doc.to_s)    # here's the magical method from rails
pages.insert page

Ok this is pretty cool. In 10 lines of ruby, I’m downloading an XML file and inserting it into a new collection called ‘pages’ in a database that hasn’t even been created (as long as mongodb is running). Great! But it quickly falls apart.

If you run it again, you now have two documents (rows) in Mongo. Boo. Not to mention, if you actually query mongo you see this (trimmed the XSD line a bit for readability):

> db.pages.find()
{
   "_id":ObjectId("4ec6c07e5a498d64a1000001"),
   "mediawiki":{
      "xmlns":"http://www.mediawiki.org/xml/export-0.5/",
      "xmlns:xsi":"http://www.w3.org/2001/XMLSchema-instance",
      "xsi:schemaLocation":"http://www.mediawiki.org/xml/export-0.5.xsd",
      "version":"0.5",
      "xml:lang":"en",
      "siteinfo":{
         "sitename":"Wikipedia",
         "base":"http://en.wikipedia.org/wiki/Main_Page",
         "generator":"MediaWiki 1.18wmf1",
         "case":"first-letter",
         "namespaces":{
            "namespace":[
               "Media",
               "Special",
               {
                  "key":"0",
                  "case":"first-letter"
               },
               "Talk",
               "User",
               "User talk",
               "Wikipedia",
               "Wikipedia talk",
               "File",
               "File talk",
               "MediaWiki",
               "MediaWiki talk",
               "Template",
               "Template talk",
               "Help",
               "Help talk",
               "Category",
               "Category talk",
               "Portal",
               "Portal talk",
               "Book",
               "Book talk"
            ]
         }
      },
      "page":{
         "title":"Ford Motor",
         "id":"255240",
         "redirect":null,
         "revision":{
            "id":"16130856",
            "timestamp":"2003-06-30T02:32:03Z",
            "contributor":{
               "username":"Infrogmation",
               "id":"4444"
            },
            "minor":null,
            "comment":"redir",
            "text":"#REDIRECT [[Ford Motor Company]]"
         }
      }
   }
}

There’s a whole lot of metadata in there and really all I care about is the content (maybe). So in some cases, you might want to filter incoming data. I have to actually look at my data and pick which attributes I want. Now I’m tightly bound to the source document and have to worry about it changing etc.

But let’s do it anyway. All we have to do is change one line:

pages.insert page["mediawiki"]["page"]

Now our inserted document looks like this:

> db.pages.find()
{
   "_id":ObjectId("4ec6c1775a498d64f2000001"),
   "title":"Ford Motor",
   "id":"255240",
   "redirect":null,
   "revision":{
      "id":"16130856",
      "timestamp":"2003-06-30T02:32:03Z",
      "contributor":{
         "username":"Infrogmation",
         "id":"4444"
      },
      "minor":null,
      "comment":"redir",
      "text":"#REDIRECT [[Ford Motor Company]]"
   }
}

Of course, we had to clear out the pages collection and re-run it. That’s just because we haven’t written any logic yet to check for existence yet. But let’s take a break here and talk about what we could do even with this piddly little bit of 10 lines of Ruby running. We can query:

> db.pages.find({}, {'title':1} )
{ "_id" : ObjectId("4ec6c1775a498d64f2000001"), "title" : "Ford Motor" }
{ "_id" : ObjectId("4ecafffd5a498d0136000001"), "title" : "Nissan" }

Here we are just showing that we have two pages from Wikipedia stored. The title:1 is like SELECT title FROM pages; in SQL. So if we wanted to search on attributes, it’s pretty easy:

> db.pages.find({ title:/^N/ }, {title:1})
{ "_id" : ObjectId("4ecafffd5a498d0136000001"), "title" : "Nissan" }

It’s pretty forgiving on the key quotes.

In the next part, we’ll dive into handling updates and other formats.

Rspec output formats

Ruby — Dillon @ 5:59 pm

Some examples of rspec2 output formats. If you are using Guard and Spork to speed up your test suite, you pass –format blah in the Guardfile. For example:

guard 'rspec', :version => 2, :cli => '--drb --color --format doc' do
  watch(%r{^spec/.+_spec\.rb$})
  ...
end

You can specify multiple formats with --format one --format two.

Anyway, here are some shots of what the output looks like:

--format doc

--format progress

--format nested --format progress

--format nested

--format html

HTML format is also the same as the Textmate format. I couldn’t get the output to go to a file like the documentation says. Maybe it hasn’t been updated for rspec2?

Messing with Method Missing

Ruby — Dillon @ 11:46 am


We’re going to play with method_missing and less so, monkey patching. All of this code is designed to work in one source file or irb session. It will run procedurally from beginning to end. So you can copy it in pieces into a single .rb file or follow along in irb. No need to break it out into separate files or restart irb.

We’ll start with a simple person class that is initialized with a name and an age.

class Person
  attr_accessor :name, :age, :problems
 
  def initialize(name, age)
    @name = name
    @age = age
  end
end

Creating a person is as simple as passing “James” and 99 as arguments.

puts Person.new("James", 99).inspect
# => #<Person:0x007f9e2a136878 @name="James", @age=99>

Now I have a Person object as expected. The twist comes in when you look at this line by itself and realize that you can’t get what 99 is. Is it the age? Is it the problems? Of course, you might opt to simply get rid of the constructor and set the instance variables manually. But let’s try to do something more fancy.

First, we’ll try to invoke a method that doesn’t exist.

begin
  Person.create_with_name_and_problems("James", 99)
  # => undefined method `create_with_name_and_problems'
  #    for Person:Class (NoMethodError)
rescue NoMethodError
  # just continue
end

We will get an exception here. For the sake of our single source file, we’ll catch the exception and continue.

So when we try to call #create_with_name_and_problems so that 99 is clearly a problem and not an age argument, the method doesn’t exist. We could create that method but that’s not very scalable, we’d have to create every permutation of possible construction options.

Instead what we are going to do is use method_missing to handle calls to unknown methods and at the same time set the instance variables and return an object.

class Person
  def initialize
  end
 
  def self.method_missing(meth, *args, &block)
    puts "OH NO!  No method!"
    if meth.to_s =~ /^create_with_(.+)$/
      self.run_create_with_method($1, *args, &block)
    else
      super
    end
  end
 
  def respond_to?(meth, *args, &block)
    if self.meth.to_s =~ /^create_with_.*$/
      true
    else
      super
    end
  end
 
  def self.run_create_with_method(attrs, *args, &block)
    attrs = attrs.split('_and_')
    # #transpose will zip the two arrays together like so:
    #   [[:a, :b, :c], [1, 2, 3]].transpose
    #   # => [[:a, 1], [:b, 2], [:c, 3]]
    attrs_with_args = [attrs, args].transpose
    attributes = Hash[attrs_with_args]
    p = Person.new
    attributes.keys.each do |a|
      p.instance_variable_set "@#{a}", attributes[a]
    end
    return p
  end
 
end

First we reopen the Person class (monkey patch) and redefine a parameter-less initialize method. Next we create a method_missing method on the class object that looks for any method that starts with “create_with_”. If it does then it creates a new object with the correct instance variables set. Finally, the respond_to? method ensures that our Person class is advertising that #create_with_ methods are valid to outside calls.

Ok, so now our Person object is ready to be used. We can create James again this time with a name and a number of problems. We can even create a person with all three attributes and vary the order.

puts Person.create_with_name_and_problems("James",99).inspect
# <Person:0x007ffc8a835250 @name="James", @problems=99>
 
puts Person.create_with_age_and_problems_and_name(55, 99, "Jay-Z").inspect
# <Person:0x007ffc8b0ae990 @name="Jay-Z", @age=55, @problems=99>

So in actuality, this is a bit contrived. It’s cool to have these dynamic methods created for us but doing this way is a little too much work just to get parameterized constructors. The better way would be to use a hash for initialization. See below:

class Person
  attr_reader :name, :age, :problems
 
  def initialize args
    args.each do |k,v|
      instance_variable_set("@#{k}", v) unless v.nil?
    end
  end
end
 
p = Person.new(:name => "James", :age => 99, :problems => 99)
# <Person:0x007fda421044f0 @name="James", @age=99, @problems=99>
 
puts p.name       # James
puts p.age        # 99
puts p.problems   # 99
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2012 SQUARISM | powered by WordPress with Barecity