All the good data nowadays is private – what’s the point of having a data science Ph.D.?
I go back and forth on whether there should be an undergrad major or Ph.D. program on data science. On the one hand, I am convinced it’s a burgeoning field which will need all the smart people it can get in the next few years or decades. On the other hand, I’m just not sure how capable academics really are at teaching the required skills. Let me explain.
It’s not that professors aren’t super smart and great at what they do. But the truth is, they typically don’t have access to the kind of data that’s now available to data scientists working in Google or Facebook or other tech companies (see this recent New York Times article on the subject). Even where I work, which is a medium sized start-up, I have access to data which many academics would kill for. This means I get to play with an incredibly rich resource, assuming I have built up the toolset to do so.
So while academics are creating (unrealistic) models of “influence” based on weird assumptions about how information gets propagated through networks, nerds at Facebook and Google and Foursquare just get to see it happen in real time. There’s an enormous advantage to having the data at your fingertips – you get good results fast. But then since it’s all proprietary you can’t publish it (a topic for another post).
Another thing: since academics typically don’t have this kind of big data, they also don’t have to create tools or methods for taming huge data. Sometimes I hear statisticians say that data science is just statistics, but they are typically missing the point of this “taming” aspect of data science. Namely, if we use state-of-the-art proven statistical methods on 15 terabytes of data and it takes 50 years to come up with an answer, then guess what, it doesn’t work.
At the same time, data science isn’t purely algorithmic time considerations either, and a computer scientist without a good statistical background would be equally wrong if they said that data science is just machine learning.
For that matter, data science also isn’t purely speculative research – there’s a bottomline business aspect to it, and the intention is (usually) to make profit. But there’s no way someone with a business degree that doesn’t know how to model can be a data scientist either.
End result: To teach data science for reals, you’d need to form a inter-disciplinary department across business, computer science, applied math, and statistics. Even so, I’m not sure how well strictly academic departments can really teach the nitty gritty of data science if they do collaborate across departments because they just don’t have good enough data (and by the way, this is a huge “if” – it seems politically impossible in some of the universities I’ve talked to).
On the other hand, I think it’s a good idea to try, because it is a great opportunity to teach at least some basic stuff and to instill a code of ethics in young data scientists.
The way things work now, the tech industry takes in former mathematicians, physicists, computer scientists, and statisticians and puts them on projects creating models of human behavior (I’ll include finance in that category) that are infinitely scalable and sometimes nearly infinitely scaled. Nobody is ever taught to stop and think about how their models are going to be used and how to think about the long-term effects of their models.
In spite of all the data problems and political obstacles, I feel that for the sake of this conversation, i.e. of personal responsibility of a modeler, we should go ahead and make a program, because it’s important and it isn’t gonna happen in your typical finance firm or tech startup.