Home > data science, hedge funds, statistics > Data science: tools vs. craft

Data science: tools vs. craft

October 4, 2011

I’ve enjoyed how many people are reading the post I wrote about hiring a data scientist for a business. It’s been interesting to see how people react to it. One consistent reaction is that I’m just saying that a data scientist needs to know undergraduate level statistics.

On some level this is true: undergrad statistics majors can learn everything they need to know to become data scientists, especially if they also take some computer science classes. But I would add that it’s really not about familiarity with a specific set of tools that defines a data scientist. Rather, it’s about being a craftsperson (and a salesman) with those tools.

To set up an analogy: I’m not a chef because I know about casserole dishes.

By the way, I’m not trying to make it sound super hard and impenetrable. First of all I hate it when people do that and second of all it’s not at all impenetrable as a field. In fact I’d say it the other way: I’d prefer smart nerdy people to think they could become data scientists even without a degree in statistics, because after all basic statistics is pretty easy to pick up. In fact I’ve never studied statistics in school.

To get to the heart of the matter, it’s more about what a data scientist does with their sometimes basic tools than what the tools are. In my experience the real challenges are things like

  1. Defining the question in the first place: are we asking the question right? Is an answer to this question going to help our business? Or should we be asking another question?
  2. Once we have defined the question, we are dealing with issues like dirty data, too little data, too much data, data that’s not at all normally distributed, or that is only a proxy to our actual problem.
  3. Once we manhandle the data into a workable form, we encounter questions like, is that signal or noise? Are the errorbars bigger than the signal? How many more weeks or months of data collection will we need to go through before we trust this signal enough to bet the business on it?
  4. Then of course we go back to: should we have asked a different question that would have not been as perfect an answer but would have definitely given us an answer?

In other words, once we boil something down to a question in statistics it’s kind of a breeze. Even so, nothing is ever as standard as you would actually find in a stats class – the chances of being asked a question similar to a stats class is zero. You always need to dig deeply enough into your data and the relevant statistics to understand what the basic goal of that t-test or statistic was and modify the standard methodology so that it’s appropriate to your problem.

My advice to the business people is to get someone who is really freaking smart and who has also demonstrated the ability to work independently and creatively, and who is very good at communicating. And now that I’ve written the above issues down, I realize that another crucial aspect to the job of the data scientist is the ability to create methodology on the spot and argue persuasively that it is kosher.

A useful thing for this last part is to have broad knowledge of the standard methods and to be able to hack together a bit of the relevant part of each; this requires lots of reading of textbooks and research papers. Next, the data scientist has to actually understand it sufficiently to implement it in code. In fact the data scientist should try a bunch of things, to see what is more convincing and what is easier to explain. Finally, the data scientist has to sell it to everyone else.

Come to think of it the same can be said about being a quant at a hedge fund. Since there’s money on the line, you can be sure that management wants you to be able to defend your methodology down to the tiniest detail (yes, I do think that being a quant at a hedge fund is a form of a data science job, and this guy woman agrees with me).

I would argue that an undergrad education probably doesn’t give enough perspective to do all of this, even though the basic mathematical tools are there. You need to be comfortable building things from scratch and dealing with people in intense situations. I’m not sure how to train someone for the latter, but for the former a Ph.D. can be a good sign, or any person that’s taken on a creative project and really made something is good too. They should also be super quantitative, but not necessarily a statistician.

  1. October 4, 2011 at 9:12 am

    Ahem, the person you referred to above as “this guy” is actually a woman. See here:


  2. October 4, 2011 at 9:13 am

    awesome! I’ve corrected the text, thanks!


  3. October 4, 2011 at 9:36 am

    I would add a few more things to Cathy’s post. As an undergrad, maybe liberal arts is more important than statistics (though some math is important). That may not mean a degree from a liberal arts school, but coursework and life experiences that show an interest in a lot of stuff. Liberal arts teaches you to identify problems..not just solve problems. Later specialization, like a PhD is good, since it shows more technical skills and dedication, that is, the ability to solve hard problems. Yet, there is an on-the-job education component too. Look for someone who has around 5 years of work experience in a pressure cooker, maybe even at times hostile work environment. Make sure they can tell you of a few monumental screw ups that they learned from. Are they still standing and smiling and playing with data? Well, that might be a good data scientist. On the soft skill side, look for someone who can finish a project…maybe they have outside interests like a family or a passionate hobby (read: they want to go home). As for the employer. Data scientists can be a pain. They need an intellectual sandbox, a mentor, and an advocate. I liked the Moneyball for many reasons, but I think it showed this well. Not every advocate needs to look like Brad Pitt (though it wouldn’t hurt), but his character did not give up on the geek even when there were losses. It seems to me that market-timing companies should steer clear of data scientists. It will not end well. The data scientists help (and like to help) value-drived companies (read: in it for the long haul). Of course, there is no magic formula. Job matches are tricky. But as my PhD advisor wisely said when I was looking for my first post-PhD job: “you only need one job”…and employers only need one employee per position (though you could help the economy and hire two).


  4. October 4, 2011 at 9:52 am

    Really really good comments, thanks!


  5. October 7, 2011 at 11:55 am

    I just mailed this around to my coworkers. It’s a concise description of issues we’re constantly grappling with.


  1. October 13, 2011 at 11:59 pm
  2. July 30, 2012 at 10:34 pm
  3. July 31, 2012 at 5:26 am
  4. August 1, 2012 at 11:17 am
  5. August 9, 2012 at 4:08 am
  6. February 15, 2013 at 12:54 pm
Comments are closed.
%d bloggers like this: