Home > data science, open source tools > Mortar Hawk: hadoop made easy

Mortar Hawk: hadoop made easy

September 30, 2011

Yesterday a couple of guys from Mortar came to explain their hadoop platform. You can see a short demo here. I wanted to explain it at a really high level because it’s cool and a big deal for someone like me. I’m not a computer scientist by training, and Mortar allows me to work with huge amounts of data relatively easily. In other words, I’m not sure what ultimately will be the interface for analytics people like me to get access to massive data, but it will be something like this, if not this.

To back up one second, for people who are nodding off, here’s the thing. If you have terabytes of data to crunch, you can’t put it on your computer to take a look at it, and then crunch, because your computer is too small. So you need to pre-crunch. That’s pretty much the problem we need to solve, and people have solved it either one of two ways.

The first is to put your data onto a big relational database, on the cloud or something, and use SQL or some such language to do the crunching (and aggregating and what have you) until it’s small enough to deal with, and then download it and finish it off on your computer. The second solution, called MapReduce (the idea started at Google), or hadoop (the open-source implementation started at Yahoo) allows you to work on the raw data directly where it lies (e.g. on the Amazon cloud (where it’s actually Elastic MapReduce, which I believe is a fork of hadoop)), in iterative steps called mappings and reduction steps.

Actually there’s an argument to be made, apparently, because I heard it at the Strata conference, that data scientists should never use hadoop at all, that we should always just use relational databases. However, that doesn’t seem economical, the way it’s set up at my work anyway. Please comment if you have an opinion about this because it’s interesting to me how split the data science community seems to be about this issue.

On the other hand, if you can make using hadoop as easy as using SQL, then who cares? That’s kind of what’s happened with Mortar. Let me explain.

Mortar has a web-based interface with two windows. On top we have the pig window and on the bottom a python editor. The pig window is in charge and you can call python functions in the pig script if you have defined them below. Pig is something like SQL but is procedural, so you tell it when to join and when to aggregate and what functions to use in what order. Then pig figures out how to turn your code into map-reduce steps, including how many iterations. They say pig is good at this but my guess is that if you really don’t know anything about how map-reduce works then it’s possible to write pig code that’s super inefficient.

One cool feature, which I think comes from pig itself but in any case is nicely viewable through the Mortar interface, is that you can ask it to “illustrate” the resulting map-reduce code and it takes a small sample of your data and shows example data (of “every type” in a certain sense) at every step of the process. This is super useful as a bug-watching feature to see that it’s looking good with small data sets.

The interface is well designed and easy to use. Overall it reduces a pretty scary and giant data job to something that would probably take me about a week to feel comfortable. And new hires who know python can get up to speed really quickly.

There are some issues right now, but the Mortar guys seem eager to improve the product quickly. To name a few:

  • it’s not yet connected to git (although you can save pig and python code you’ve already run),
  • you can’t import most python modules except super basic ones like math (including ones you’ve written; right now you have to copy and paste into their editor),
  • they won’t be able to ever let you import numpy because they are actually using jython and numpy is c-based,
  • it doesn’t automatically shut down the cluster after your job is finished, and
  • it doesn’t yet allow people to share a cluster

These last two mean that you have to be pretty on top of your stuff, which is too bad if you want to leave for the night and start a job and then bike home and feed your kids and put them to bed. Which is kind of my style.

Please tell me if any of you know other approaches that allow python-savvy (but not java savvy) analytics nerds access to hadoop in an easy way!

  1. October 2, 2011 at 12:59 am

    Assuming I want to work in Python, what is gained by going with a Hadoop implementation versus Python’s multiprocessing module?

  2. October 2, 2011 at 6:23 am

    I don’t know about python’s multiprocessing module- does that run on the cloud?

  3. October 2, 2011 at 10:13 am

    I’m not sure. A quick search didn’t turn up anything. There are instruction for using a “Manager” to use multiprocessing on multiple machines
    http://docs.python.org/library/multiprocessing.html#using-a-remote-manager
    , but nothing cloud specific, so I’m not sure how much of a pain it would be to use this with the available commercial clouds.

  4. October 7, 2011 at 12:05 pm

    Hadoop natively supports streaming to non-Java programs like Python, Ruby, C, etc. You can use whatever language you like to write map and reduce jobs that read line-based input from STDIN and write line-based output to STDOUT. You loose access to some more advanced features of the Hadoop API and incur a performance penalty, but for basic MapReduce work this is a fine way to go.

    If you want to run Python code that is more tightly integrated into Hadoop, take a look at Pydoop, which enables direct communication between the Java and Python layers without having to serialize through the STD streams. I haven’t used this, but I’ve heard good things.

    Streaming runs as native Python (no translation to Jython) so you can use C extensions like numpy if you like. I don’t know if this is the case for Pydoop.

  5. October 7, 2011 at 12:13 pm

    Currently there is a broad debate over whether very large datasets are best handled by traditional table-based databases or less, structured, searchable document stores. I don’t know the specifics, but this divide infuses the whole of Big Data and not just questions of analysis methodology. noSQL is the buzzword to start with.

  6. Johann Hibschman
    October 11, 2011 at 9:57 am

    I’m coming to this game late, but I’ve never been that impressed by the mapreduce-style frameworks. I mostly do model estimation. If I could sit a bunch of raw data on a compute node, then repeatedly send it a parameter vector and get a cost function back, I’d be happy. However, this kind of persistence seems hard to make happen. I know there are ways to do it, but it’s never seemed easy or like the first thing the framework developers had in mind. (Ideally, I want each node to suck a fraction of the data into RAM, then send it parameters and get costs back.)

    For now, though, I find that taking a random sample of the data and fitting on that works pretty well. I can run multiple samples and get some sense of the parameter error, etc.

    I did some work with kdb, which convinced me that column databases were the way to go for most analytical problems. Standard databases like Sybase always turn into a bottleneck.

  1. No trackbacks yet.
Comments are closed.
Follow

Get every new post delivered to your Inbox.

Join 980 other followers

%d bloggers like this: