R is mostly like python but sometimes like SQL
I’m learning a bit of R in my current stint at ThoughtWorks. Coming from python, I was happy to see most of the plotting functions are very similar, as well as many of the vector-level data handling functions. Besides the fact that lists start at 1 instead of 0, things were looking pretty familiar.
But then I came across something that totally changed my mind. In R they have these data frames, which are like massive excel spreadsheets: very structured matrices with named columns and rows, on which you can perform parallelized operations.
One thing I noticed right away about these rigid data structures is that they make handling missing data very easy. So if you have a huge data frame where a few rows are missing a few data points, then one command, na.omit, gets rid of your problem. Sometimes you don’t even need that, you can just perform your operation on your NA’s, and you just get back more NA’s where appropriate.
This ease-of-use for crappy data is good and bad: good because it’s convenient, bad because you never feel the pain of missing data. When I use python, I rely on dictionaries of dictionaries (of dictionaries) to store my data, and I have to make specific plans for missing data, which means it’s a pain but also that I have to face up to bad data directly.
But that’s not why I think R is somewhat like SQL. It’s really because of how bad “for” loops are in R.
So I was trying to add a new column to my rather large (~65,000 row) dataframe. Adding a column is very easy indeed, if the value in the new column is a simple function of the values in the current columns, because of the way you can parallelize operations. So if the new value is the square of the first column value plus the second column value, it can do it on the whole columns all at once and it’s super fast.
In my case, though, the new value required a look-up in the table itself, which may or may not work, and then required a decision depending on whether it worked. For the life of me I couldn’t figure out how to do it using iterated “apply” or “lapply” functions in the existing dataframe. Of course it’s easy to do using a “for” loop, but that is excruciatingly slow.
Finally I realized I needed to think like a SQL programmer, and build a new dataframe which consisted of the look-up row, if it existed, along with a unique identifier in common with the row I start with. Then I merged the two dataframes, which is like a SQL join, using that unique identifier as the pivot. This would never happen in python with a dataset of this size, because dictionaries are very unstructured and fast.
Easy peasy lemon squeazy, once you understand it, but it made me realize that the approach to learning a new language by translating each word really doesn’t work. You need to think like a Parisian to really speak French.