A good data scientist is hard to find
As a data scientist at an internet start-up, I am something of a quantitative handyman. I go where there is need for quantitative thinking. Since the business model of my company is super quantitative, this means I have lots of work. I have recently categorized the kind of things I do into 4 bins:
- I visualize data for business people to digest. This is a kind of fancy data science-y way of saying I design reports. It’s actually a hugely critical part of the business, since our clients are less quantitative than we are and need to feel like they understand the situation, so clear, honest, and easily digestible visuals is a priority.
- I forecast behavior using models. This means I forecast what users on a website will do, based on their attributes and historical precedent for what people who shared their attributes did in the past, and I also do things like stress test the business itself, in order to answer questions like, what would happen to our revenue stream if one of our advertisers jumped out of the auction?
- I measure. This is where the old-school statistics comes in, in deciding whether things are statistically significant and what is our confidence interval. It’s related to reporting as well, but it’s a separate task.
- I help decide whether business ideas are quantitatively reasonable. Will there be enough data to answer this question? How long will we need to collect data to have a statistically significant answer to that? This is kind of like being a McKinsey consultant on data steroids.
So why is it so hard to find a good data scientist?
Here’s why. Most data scientists don’t really think that 3 and 4 above are their job. It is far less sexy to try to honestly find the confidence interval of a prediction than it is to model behavior. Data scientists are considered magical when they forecast behavior that was hitherto unknown, and they are considered total downers when they tell their CEO, hey there’s just not enough data to start that business you want to start, or hey this data is actually really fat-tailed and our confidence intervals suck.
In other words, it’s something like what the head of risk management had to face at a big bank taking risks in 2007. There’s a responsibility to warn people that too much confidence in the models is bad, but then there’s the political reality of the situation, where you just want to be liked and you don’t actually have the power to stop the relevant decisions anyway. And there’s the added issue in a start-up that they are your models, and you want them to be liked (and to be invincible).
It’s far easier to focus on visualizing and modeling, or to stay even sexier and more mystical, just modeling itself, and let the business make decisions that could ultimately not work out, or act on data that’s pure noise.
How do you select for a good data scientist? Look for one that speaks clearly, directly, and emphasizes skepticism. Look for one that is ready to vent about how people trust models too much, and also someone who’s pushy enough to speak up at a meeting and be that annoying person who holds people back from drinking too much kool-aid.