Why and how to hire a data scientist for your business
Here are the annotated slides from my Strata talk. The audience consisted of business people interested in big data. Many of them were coming from startups that are newly formed or are currently being formed, and are wondering who to hire.
When do you need a data scientist?
When you have too much data for Excel to handle: data scientists know how to deal with large data sets.
When your data visualization skills are being stretched: as we will see, data scientists are skilled (or should be) at data visualization and should be able to figure out a way to visualize most quantitative things that you can describe with words.
When you aren’t sure if something is noise or information: this is a big one, and we will come back to it.
When you don’t know what a confidence interval is: this is related to the above; it refers to the fact that almost every number you see coming out of your business is actually an estimate of something, and the question you constantly face is, how trustworthy is that estimate?
Let’s take a step back: Should you need a data scientist?
Are you asking the right questions? Is there a business that you’re not in that you could be in if you were thinking more quantitatively? Big data is making things possible that weren’t just a few years ago.
Are you getting the most out of your data? In other words, are you sitting on a bunch of delicious data and not even trying to mine it for your business?
Are you anticipating shocks to your business? As we will see, data scientists can help you do this in ways you may be surprised at.
Are you running your business sufficiently quantitatively? Are you not collecting the data (or not collecting it in a centralized way) that would lead to opportunities for data mining?
So, you’ve decided to hire a Data Scientist (nice move!)
What do you need to get started?
Data storage. You gotta keep all your data in one place and in some unified format.
Data access — usually through a database (payoffs for different types). Specifically, you can pay for someone else to run a convenient SQL database that people know how to use walking in the door without much training, or you could set something up that’s open source and “free” but then it will probably take more time to set up and make take the data scientists longer to figure out how to use. The investment here is to create tools to make it convenient to use.
Larger-scale or less uniform data may require Hadoop access (and someone with real tech expertise to set it up). The larger your data is the more complicated and developed your skills need to be to access it. But it’s getting easier (and other people here at the conference can tell you all you need to know about services like this).
Who and how should you hire? It’s not obvious how to hire a data scientist, especially if your business so far consists of less mathematical people.
A math major? Perhaps a Masters in statistics? Or a Ph.D. in machine learning? If you’re looking for someone to implement a specific thing, then you just need proof that they’re smart and know some relevant stuff. But typically you’re asking more than that: you’re asking for them to design models to answer hard questions and even to figure out what the right questions are. For that reason you need to see that the candidate has the ability to think independently and creatively. A Ph.D. is evidence of this but not the only evidence- some people could get into grad school or even go for a while but decide they are not academically-minded, and that’s okay (but you should be looking for someone who could have gotten a Ph.D. if they’d wanted to). As long as they went somewhere and challenged themselves and did new stuff and created something, that’s what you want to see. I’ll talk about specific skills you’d like in a later section, but keep in mind that these are people who are freaking smart and can learn new skills, so you shouldn’t obsess over something small like whether they already know SQL.
What should the job description include? Things like, super quantitative, can work independently, know machine learning or time series analysis, data visualization, statistics, knows how to program, loves data.
Who even interviews someone like this? Consider getting a data scientist as a consultant just to interview a candidate to see if they are as smart as they claim to be. But at the same time you want to make sure they are good communicators, so ask them to explain their stuff to you (and ask them to explain stuff that has been on your mind lately too) and make sure they can.
Also: don’t confuse a data scientist with a software engineer! Just as software engineers focus on their craft and aren’t expected to be experts at the craft of modeling, data scientists know how to program in the sense that they typically know how to use a scripting language like python to manipulate the data into a form where they can do analytics on it. They sometimes even know a bit of java or C, but they aren’t software engineers, and asking them to be is missing the point of their value to your business.
What do you want from them?
Here are some basic skills you should be looking for when you’re hiring a data scientist. They are general enough that they should have some form of all of them (but again don’t be too choosy about exactly how they can address the below needs, because if they’re super smart they can learn more):
- Data grappling skills: they should know how to move data around and manipulate data with some programming language or languages.
- Data viz experience: they should know how to draw informative pictures of data. That should in fact be the very first thing they do when they encounter new data
- Knowledge of stats, errorbars, confidence intervals: ask them to explain this stuff to you. They should be able to.
- Experience with forecasting and prediction, both general and specific (ex): lots of variety here, and if you have more than one data scientist position open, I’d try to get people from different backgrounds (finance and machine learning for example) because you’ll get great cross-pollination that way
- Great communication skills: data scientists will be a big part of your business and will contribute to communications with big clients.
What does a Data Scientist want from you? This is an important question because data scientists are in high demand and are highly educated and can get poached easily.
Interesting, challenging work. We’re talking about nerds here, and they love puzzles, and they get bored easily. Make sure they have opportunities to work on good stuff or they’ll get other jobs. Make sure they are encouraged to think of their own projects when it’s possible.
Lots of great data (data is sexy!): data scientists love data, they play with it and become intimate with it. Make sure you have lots of data, or at least really high-quality data, or soon will, before asking a data scientist to work for you. Data science is an experimental science and cannot be done without data!
To be needed, and to have central importance to the business. Hopefully it’s obvious that you will want your data scientists to play a central role in your business.
To be part of a team that is building something: this should be true of anyone working in business, especially startups. If your candidate wants to write academic papers and sit around while they get published, then hire someone else.
A good and ethically sound work atmosphere.
Cash money. Most data scientists aren’t totally focused on money though or they would go into finance.
Further business reasons for hiring a Data Scientist
Reporting help: automatically generated daily reports can be a pain to set up and can require lots of tech work and may even require a dedicated person to generate charts. Data scientists can pull together certain kinds of reports in a matter of days or weeks and generate them every day with cronjobs. Here’s a sample picture of something I did at my job:
A/B testing: data scientists help you set up A/B testing rigorously.
Beyond A/B testing: adaptability and customization. What you really want to do is get beyond A/B testing. Instead of having the paradigm where customers come to the ad and respond in a certain way, we want to have the (right) ad come to the customer.
Knowing whether numbers are random (seasonality) or require action. If revenue goes down in a certain week, is that because of noise? Or is it because it always goes down the week after Labor Day? Data scientists can answer questions like this.
What-if analysis: you can ask data scientists to estimate what would happen to revenue (or some other stat) if a client drops you, or if you gain a new client, or if someone doubles their bid at an auction (more on this later).
Help with business planning: Will there be enough data to answer a given question? Will there be enough data to optimize on the answer? These are some of the most difficult and most important questions, and the fact that a data scientist can help you answer them means they will be central to the business.
Education for senior management: senior people who talk to and recruit new clients will need to be able to explain how to think about the data, the signals, the stats, and the errorbars in a rigorous and credible way. Data scientists can and should take on the role of an educator for situations like this.
Mathematically sound communication to clients: you may have situations where you need the data scientists to talk directly to clients or to their data scientists. This is yet another reason to make sure you hire someone with excellent communication skills, because they will be representing your business to really smart people who can see through bullshit.
Case Study: Stress Tests
We can learn from finance: the idea of a stress test is stolen directly from finance, where we look at how replays of things like the credit crisis would affect portfolios. I wanted to do something like that but for general environmental effects that a business like mine, which hosts an advertising platform, encounters.
You know how big changes will affect your business directionally and specifically. But do you know how combinations will play out? Stress tests allow you to combine changes and estimate their overall effect quantitatively. For example, say we want to know how lowering or raising their bids (by some scalar amount) will effect advertisers impression share (the number of times their ads get displayed to users). Then we can run that as a scenario (for each advertiser separately) using the last two weeks (say) of auction data with everything else kept the same, and compare it to what actually happened in the last two weeks. This gives an estimate of how such a change would affect impression change in the future. Here’s a heat map of possible results of such a “stress test”:
We could also:
- run scenarios which combine things like the above
- run scenarios which ask different questions: how would advertisers be affected if a new advertiser entered the auction? If we change the minimum bid? If one of the servers fails? If we grow into new markets?
- run scenarios from the perspective of the business: how would revenue change if the bids change?
In the end stress tests can benefit any client-facing person or anyone who wants to anticipate revenue, so across many of the verticals of the business.