This is a question I posted to stackoverflow. It’s something I’ve been wondering about for a while. The polling nature of ETL has always bugged me but now it seems like the parsing bit of it is annoying too.
ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it’s probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can’t think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you’re just delaying the parsing. Eventually you’ll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I’ve got a printer, an ATM and a voicemail system. They’re all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
(more…)

I watched Aman Gupta’s (@tmm1) talk on debugging ruby from Ruby Conf 2010. He did a really good job of walking through real world examples of debugging he’s done on even core Rails code and other very mainstream libraries. The talk was very deep and he had to hurry along to get through all the slides. So after the talk was over I really wanted a one-pager with just some reminders of the tools he used.
I recommend watching his talk in full and then referring to this sheet. It might make only partial sense otherwise. There also many other good videos from Ruby Conf 2010 on confreaks, as well as other archive conferences. It’s made it into my watch list. DHH‘s keynote at Ruby Conf ’10 is especially funny as he goes on about freedom and monkey patching (he calls it freedom patching).
Aman’s original slides about debugging ruby are also available.
Download: Debugging Ruby Cheat Sheet PDF

Ok, so some background. I’m curious as to this post at lunarlogicpolska.com which shows mysql and mongodb living together in harmony. Some data is in mysql (nice and structured), some data is in a crazy-fast document database. It doesn’t matter. Datamapper combines many sources into a common abstraction layer that your models can pick and choose which to use. At least that’s the dream. I was going to play around with this but hit a bit of a snag.
First the post at lunarlogicpolska.com basically boils down to this bit:
require 'rubygems'
require 'dm-core'
DataMapper.setup(:default, "mysql://localhost/examples")
DataMapper.setup(:logs, "mongo://localhost/examples")
Unfortunately, this wouldn’t run on my box. I had installed my gems like so: `gem install dm-mongo-adapter –pre` and got a big ol problem while install rdoc (weird):
RDoc::Parser::Ruby failure around line 220 of
lib/dm-core/query/conditions/operation.rb
[snip]
The internal error was:
(RDoc::Error) Name or symbol expected (got #)
ERROR: While generating documentation for dm-core-0.10.2
... MESSAGE: Name or symbol expected (got #)
Doing the same gem install command got rid of the rdoc error. Ok, no problem. This is a prototype I’m trying to build, no sweat if rdoc installation is a bit flaky. Unfortunately, I didn’t get much farther than this with the actual code until I changed lunarlogicpolska’s example to this:
require 'mongo_adapter'
DataMapper.setup(:default, "mysql://user:pass@localhost/database")
DataMapper.setup(:logs, "mongo://host:port/database")
I found this out by playing around with irb. The order matters:
ruby-1.9.2-p136 :001 > require 'rubygems'
=> false
ruby-1.9.2-p136 :002 > require 'dm-core'
=> true
ruby-1.9.2-p136 :003 > DataMapper.setup(:logs, "mongo://localhost/examples")
LoadError: no such file to load -- dm-mongo-adapter
from <internal:lib/rubygems/custom_require>:29:in `require'
from <internal:lib/rubygems/custom_require>:29:in `require'
from adapters.rb:163:in `load_adapter'
from adapters.rb:133:in `adapter_class'
from adapters.rb:13:in `new'
from dm-core.rb:266:in `setup'
from (irb):3
from irb:16:in `<main>'
ruby-1.9.2-p136 :004 > require 'mongo_adapter'
LoadError: no such file to load -- mongo_adapter
from <internal:lib/rubygems/custom_require>:33:in `require'
from <internal:lib/rubygems/custom_require>:33:in `rescue in require'
from <internal:lib/rubygems/custom_require>:29:in `require'
from (irb):4
from irb:16:in `<main>'
You can see here that I required dm-core first but then it doesn’t know what dm-mongo-adapter or mongo_adapter is. Quit irb and try it a different way:
> require 'mongo_adapter'
=> true
> DataMapper.setup(:logs, "mongo://localhost:27017/examples")
=> #<DataMapper::Mongo::Adapter:0x000000015b1a70 @name=:logs ...
Works great. I have a DataMapper::Mongo::Adapter object. Unfortunately, I didn’t get any farther than this because the current prerelease isn’t compatible with mongo 1.6. It just complains about :slave_ok needs to be set to true. Even though I’m connecting to the master node of my replica set.
Mongo::ConfigurationError: Trying to connect directly to slave; if this is what
you want, specify :slave_ok => true.
So even though this isn’t a complete fix, when the new drop of dm-mongo-adapter hits, this little test will be useful. And then I can figure out if datamapper is as awesome as I think it is.