Home > open source tools > R is mostly like python but sometimes like SQL

R is mostly like python but sometimes like SQL

January 17, 2013

I’m learning a bit of R in my current stint at ThoughtWorks. Coming from python, I was happy to see most of the plotting functions are very similar, as well as many of the vector-level data handling functions. Besides the fact that lists start at 1 instead of 0, things were looking pretty familiar.

But then I came across something that totally changed my mind. In R they have these data frames, which are like massive excel spreadsheets: very structured matrices with named columns and rows, on which you can perform parallelized operations.

One thing I noticed right away about these rigid data structures is that they make handling missing data very easy. So if you have a huge data frame where a few rows are missing a few data points, then one command, na.omit, gets rid of your problem. Sometimes you don’t even need that, you can just perform your operation on your NA’s, and you just get back more NA’s where appropriate.

This ease-of-use for crappy data is good and bad: good because it’s convenient, bad because you never feel the pain of missing data. When I use python, I rely on dictionaries of dictionaries (of dictionaries) to store my data, and I have to make specific plans for missing data, which means it’s a pain but also that I have to face up to bad data directly.

But that’s not why I think R is somewhat like SQL. It’s really because of how bad “for” loops are in R.

So I was trying to add a new column to my rather large (~65,000 row) dataframe. Adding a column is very easy indeed, if the value in the new column is a simple function of the values in the current columns, because of the way you can parallelize operations. So if the new value is the square of the first column value plus the second column value, it can do it on the whole columns all at once and it’s super fast.

In my case, though, the new value required a look-up in the table itself, which may or may not work, and then required a decision depending on whether it worked. For the life of me I couldn’t figure out how to do it using iterated “apply” or “lapply” functions in the existing dataframe. Of course it’s easy to do using a “for” loop, but that is excruciatingly slow.

Finally I realized I needed to think like a SQL programmer, and build a new dataframe which consisted of the look-up row, if it existed, along with a unique identifier in common with the row I start with. Then I merged the two dataframes, which is like a SQL join, using that unique identifier as the pivot. This would never happen in python with a dataset of this size, because dictionaries are very unstructured and fast.

Easy peasy lemon squeazy, once you understand it, but it made me realize that the approach to learning a new language by translating each word really doesn’t work. You need to think like a Parisian to really speak French.

Categories: open source tools
  1. January 17, 2013 at 7:47 am

    http://pandas.pydata.org/pandas-docs/dev/dsintro.html

    Pandas is a package that provides data frame support (among other things) in python. It builds on fast data structures from numpy. Not to say you dont need R if you know python, since as things stand R has arguably more mature ML/statistics support; ggplot2 too I think is prettier than matplotlib.

  2. Marcos
    January 17, 2013 at 8:00 am

    Pandas were recommended to me also (“Python for Data Analysis” is a good reference).

  3. Leon
    January 17, 2013 at 10:17 am

    I don’t know too much about the specifics R or the exact nature of the problem you were working on, but from your description and what I know about programming languages such as Scheme and Haskell I don’t understand why lapply wouldn’t work.

    R does have higher order functions, which means you can refer to variables defined in enclosing scopes, not just those variables defined inside the function itself. So you should be able to write something along the lines of

    lapply(frame, function(x): if predicate(frame[newindex(x)]) then … else …)

    Of course I’m making up a bit of syntax here because I don’t really know R, but from what I understand of the mechanisms available in the language, this should work.

    • January 17, 2013 at 10:19 am

      You are probably right, and I certainly didn’t prove that it wasn’t possible to do that way. Even so, it was almost trivial to do, and dramatically easier to read, when I realized I should build a supplementary dataframe and then merge. And it’s of course important for code to be readable as well as fast.

  4. January 17, 2013 at 10:24 am

    Would perhaps ‘ifelse’ in R have helped you?

    • January 17, 2013 at 10:26 am

      Possibly! But really I’m trying to make a point about thinking in R, not trying to solve this specific problem :)

  5. NotRelevant
    January 17, 2013 at 1:32 pm

    It only benefits you to speak and think like a Parisian if you are in Paris, which is to say that the modifications to how you think as required by the software’s advantages and limitations are not necessarily beneficial. But they probably are.

  6. January 17, 2013 at 8:27 pm

    I really appreciate and applaud your attempt to “think in R”. I get the sense that many people criticize R because they try to use it like Python, Java, C++, etc. without recognizing R wasn’t designed as an all-purpose programming language.

    One thing to keep in mind as you learn: for loops aren’t that slow. The poor performance is caused by data.frame subsetting. I.e. the for loop will be much quicker if you convert your data.frame to a list. Loops on atomic types are relatively quick too.

  7. lpsmith
    January 18, 2013 at 9:26 am

    Readability is in the eye of the beholder, to an extent. Lexical scope is well worth training yourself to read; it is very powerful and elegant once you get the hang of it. I googled around a bit and skimmed some references, here are the most reasonable looking I found:

    Section 10.7 of “An Introduction to R”

    “Lexical Scope and Statistical Computing” by Robert Gentleman and Ross Ihaka

    Closure on wikipedia

  8. Artjam
    January 18, 2013 at 7:24 pm

    For months I couldn’t get the “apply” functions at all, then I finally figured it out: apply(x, function(x) [here is the tricky part] actual_function(x), …)

    • bodanker
      January 19, 2013 at 1:11 pm

      You don’t *need* an anonymous function; e.g. this works:
      apply(x, 2, mean, trim=0.1, na.rm=TRUE)

      You can also use *apply functions by sending the subset to something other than the first argument of a function by specifying arguments by name:
      lapply(x, gsub, pattern=”^foo*”, replacement=”bar”, ignore.case=TRUE)

      That said, anonymous functions can be useful if you want to do more complicated manipulations.

      Also worth noting is that apply() converts ‘x’ to a matrix. So if you use apply() on a data.frame with multi-type columns, your data.frame will be converted to a matrix of the highest type (often character, see the Details sections of ?as.matrix).

  9. January 19, 2013 at 12:55 am

    Cathy, if you stick with dictionaries in Python you might be able to use `defaultdict` to give you those ‘None’ values without crashing your scripts… $0.02

  10. January 20, 2013 at 5:56 pm

    Well, when you discover that you’ve outgrown your desktop’s RAM, join the R High-Performance Computing mailing list and pick up a copy of “Parallel R”. ;-)

  11. SamGG
    January 23, 2013 at 5:29 pm

    If you think R, stop thinking at cell is the way.
    If you like SQL, take a look at plyr http://plyr.had.co.nz/
    It does statistics using a kind of GROUP BY

  12. April 4, 2013 at 10:13 am

    I missed this post. I’m quite the opposite from you. I first learned R, then Python and then SQL. I don’t think of R as SQL. Actually I found a bit hard to move from thinking the R way to thinking in the SQL way.

    The R way, I thinking, is to vectorize things. Think about it. You never see recursive functions in R. And that’s because you vectorize for loops, but not recursive functions.
    Also, don’t copy data.frames, since R internally make several redundant copies of data.frames. Ifelse is a vectorized way of solve your problem (as is lapply and alike).

    Last, but not least, and contrary what I said above, with the data.table package, things in R are more like SQL. Data.table is really fast, and it seems to me, the way to go in R whenever it’s possible.

  1. No trackbacks yet.
Comments are closed.
Follow

Get every new post delivered to your Inbox.

Join 1,538 other followers

%d bloggers like this: