Home > data science, open source tools > One language to rule them all

One language to rule them all

June 1, 2012

Right now there seems to be a choice one has to make in languages: either it’s a high level language that a data scientist knows or can learn quickly, or it’s fast and/or production ready.

So as the quant, I’ve gotten used to prototyping in matlab or python and then, if what I have been working on goes into production, it typically needs to be explained to a developer and rewritten in java or some such.

This is a pain in the ass for two reasons. First, it takes forever to explain it, and second if we later need to change it it’s very painful to work with a different developer than the one who did it originally, but people move around a lot.

Now that I’m working with huge amounts of data, it’s gotten even more complicated – there are three issues instead of two. Namely, there’s the map-reducing type part of the modeling, where you move around and aggregate data, which, if you’re a data scientist, means some kind of high-level language like pig.

Actually there are four issues – because the huge data is typically stored in the Amazon cloud or similar, there is also the technical issue of firing up nodes in a cluster and getting them to run the code and return the answers in a place where a data scientist can find it. This is kinda technical for your typical data scientist, at least one like me who specializes in model design, and has been solved only in specific situations i.e. for specific languages (Elastic-R and Mortar Data are two examples – please tell me if you know more).

Is there a big-data solution where all the modeling can be done in one open source language and then go into production as is?

People have been telling me Clojure/ Cascalog is the answer. But as far as I know there’s no super easy way to run this on the cloud. It would be great to see that happen.

  1. June 1, 2012 at 7:14 am

    You may want to take a look at languages like F# or OCaml, they are quite good at modelling (at least much better than Java), used quite a lot in the financial world and has similar perf like Java. I know people who prefer F# to both Mathematica, Matlab and Python.
    Although the F# compiler is open source and runs on Linux using Mono, it’s development is managed by Microsoft and probably runs best under .Net and on Microsofts cloud (Azure).

    Like

  2. Someone's avatar
    Someone
    June 1, 2012 at 9:45 am

    Well, you could try to use sagemath and then cython, that should be fast enough I guess? Or did you already include cython when you talked about having used Python?

    Like

  3. Itamar Turner-Trauring's avatar
    Itamar Turner-Trauring
    June 1, 2012 at 11:06 am

    PyPy holds the promise of being a version of Python that can do math very very quickly. It’s already compatible with regular Python mostly, and can do math orders of magnitude faster (the benchmarks on http://speed.pypy.org/ that tend to be faster are the ones involving calculations). They’re also working on a numpy port.

    Like

  4. Marshall Quander's avatar
    Marshall Quander
    June 1, 2012 at 1:14 pm

    Actually, Heroku can run your Clojure on the cloud, if you’re interested.

    http://blog.heroku.com/archives/2011/7/5/clojure_on_heroku/

    Like

  5. LT's avatar
    LT
    June 1, 2012 at 7:03 pm

    Julia maybe?
    http://julialang.org/

    Like

  6. June 3, 2012 at 2:47 pm

    I would say that you almost certainly want Cython. With Cython, you can write clean Python, then profile. When you find the slow parts, you can move them into their own module and annotate them – then compile to machine code. It’s a tiny bit harder than some of the other options, but it doesn’t require a major code rewrite, and it integrates well with numpy.

    PyPy is fast, but it does not yet integrate well with numpy, so it’s pretty much out at this point.

    Here are a couple of videos on how this is done:
    http://pyvideo.org/video/614/high-performance-python-i
    http://pyvideo.org/video/620/high-performance-python-ii

    There’s also an in-depth tutorial on ipython on ths same site. If you aren’t using the new IPython notebook for prototyping/exploring, you might want to look into it.

    Like

  7. June 5, 2012 at 10:44 am

    This is exactly the problem our software company has spent three years working to solve.

    http://www.broadstreetanalytics.com/technology.html

    (The website is dated, but this conveys the idea).

    Our language, FORA, is a lot like Python in syntax and scripting capability. If you like Python (we do, too), you can learn FORA in an afternoon. But FORA is designed for JIT compilation, management of big data, and parallel computation. Working inside the FORA IDE, you can boot however many machines you need on Amazon EC2, and your code and computations are automatically distributed between machines

    This is still early technology. If you have a concrete problem, and especially concerns about production-level reliability, we’re not a solution. You currently have to code ML libraries in FORA from scratch. But we’d love to hear from interested beta testers.

    Alex
    alex.leeds@broadstreetanalytics.com

    Like

  8. Simon Thornington's avatar
    Simon Thornington
    June 12, 2012 at 11:42 am

    I’m just starting in data/computational analytics from a C++ background, what are the best resources (meetups, assemblies etc) in NYC to learn more about the field and the state of the art?

    Like

  9. June 18, 2012 at 11:56 am

    Cascalog & clojure will work fine on Amazon’s Elastic Map Reduce (as cascalog just compiles to hadoop map reduce in the end). The nice thing about cascalog is that you can run the queries in a repl on a small dataset in memory and then use those same queries on a huge hadoop cluster.

    Once I’ve built it you’ll be able to do the same on Mastodon C and it will be zero carbon. 😀

    Like

  1. No trackbacks yet.
Comments are closed.