Home > data science > All the good data nowadays is private – what’s the point of having a data science Ph.D.?

All the good data nowadays is private – what’s the point of having a data science Ph.D.?

May 25, 2012

I go back and forth on whether there should be an undergrad major or Ph.D. program on data science. On the one hand, I am convinced it’s a burgeoning field which will need all the smart people it can get in the next few years or decades. On the other hand, I’m just not sure how capable academics really are at teaching the required skills. Let me explain.

It’s not that professors aren’t super smart and great at what they do. But the truth is, they typically don’t have access to the kind of data that’s now available to data scientists working in Google or Facebook or other tech companies (see this recent New York Times article on the subject). Even where I work, which is a medium sized start-up, I have access to data which many academics would kill for. This means I get to play with an incredibly rich resource, assuming I have built up the toolset to do so.

So while academics are creating (unrealistic) models of “influence” based on weird assumptions about how information gets propagated through networks, nerds at Facebook and Google and Foursquare just get to see it happen in real time. There’s an enormous advantage to having the data at your fingertips – you get good results fast. But then since it’s all proprietary you can’t publish it (a topic for another post).

Another thing: since academics typically don’t have this kind of big data, they also don’t have to create tools or methods for taming huge data. Sometimes I hear statisticians say that data science is just statistics, but they are typically missing the point of this “taming” aspect of data science. Namely, if we use state-of-the-art proven statistical methods on 15 terabytes of data and it takes 50 years to come up with an answer, then guess what, it doesn’t work.

At the same time, data science isn’t purely algorithmic time considerations either, and a computer scientist without a good statistical background would be equally wrong if they said that data science is just machine learning.

For that matter, data science also isn’t purely speculative research – there’s a bottomline business aspect to it, and the intention is (usually) to make profit. But there’s no way someone with a business degree that doesn’t know how to model can be a data scientist either.

End result: To teach data science for reals, you’d need to form a inter-disciplinary department across business, computer science, applied math, and statistics. Even so, I’m not sure how well strictly academic departments can really teach the nitty gritty of data science if they do collaborate across departments because they just don’t have good enough data (and by the way, this is a huge “if” – it seems politically impossible in some of the universities I’ve talked to).

On the other hand, I think it’s a good idea to try, because it is a great opportunity to teach at least some basic stuff and to instill a code of ethics in young data scientists.

The way things work now, the tech industry takes in former mathematicians, physicists, computer scientists, and statisticians and puts them on projects creating models of human behavior (I’ll include finance in that category) that are infinitely scalable and sometimes nearly infinitely scaled. Nobody is ever taught to stop and think about how their models are going to be used and how to think about the long-term effects of their models.

In spite of all the data problems and political obstacles, I feel that for the sake of this conversation, i.e. of personal responsibility of a modeler, we should go ahead and make a program, because it’s important and it isn’t gonna happen in your typical finance firm or tech startup.

Categories: data science
  1. suresh
    May 25, 2012 at 9:44 am

    I saw this just after getting a quote from a private databank for 6K per newspaper for 19th century historical plaintext newspaper dumps. Three points: a) in the academy you can think about linking datasets across lots of domains, and have ready knowledge of the data your colleagues are collecting. Ultimately, the private sector can generate lots of information about its own processes and institutions, but probably isn’t so great about thinking across contexts (facebook and google, of course, contain entire societies within them,) b)the tools data scientists are developing shouldn’t stay locked up in the private sector behind a patent wall. c) The symbiosis of public and private sector has already happened in fields like computer graphics etc (I’m sure other readers know more about this than me). So you do your lab rotation, then your third year to work as a consultant for a start-up your supervisor is founding that arbitrages research into cash money.I’m betting this is already well under way in data science (it sure is in finance, no?)


    • May 25, 2012 at 9:49 am

      I like this: “Ultimately, the private sector can generate lots of information about its own processes and institutions, but probably isn’t so great about thinking across contexts”

      But I think there’s a feedback loop here – when things like google and facebook get as big as they get, then they define context. Another case of the model informing the world instead of the world informing the model.


  2. Lill Mila
    May 25, 2012 at 9:59 am

    This is my first week working as a Data Scientist (actual title: Data Analyst, which is even more ambiguous than the first). But, it’s been a long road including both private companies and a master’s degree in applied math with a cross-focus in machine learning to feel like I can call myself that.

    Even living in Boston, I had to look really hard for a program that considered working in industry as important as the actual classes you take. I ended up in a 3 semester, intensive program that encouraged professional development on many levels. My point sharing all that is, you’re right. It takes many different skills and ways of thinking to do the job well. I’ll hopefully have more insight when I’ve been doing it for a while, but I appreciate the post and look forward to more!


  3. Theory of Real-World Data
    May 25, 2012 at 10:36 am

    Data Rich Generators like Google & Facebook should form partnerships with Data Science Departments at leading Universities. This is a Natural Progression that allows Data Scientists to work with the Real World Data. Much like Medical Students do Clinical Training with REAL patients, Data Scientists need to be working with REAL Data (as much as possible). In the Medical Profession, Patient Information is protected by confidentiality laws like HIPPA (Health Insurance Portability and Accountability Act), which makes it illegal to share or discuss patient information with anyone NOT directly involved in the care of the patient.

    In theory, Data Science Confidentiality Laws, Rules, Practices & Procedures will allow Data Science Students to work with REAL-World data without compromising the EXCLUSIVITY of the data. Worst case scenario, a Data Science Student who misuses Confidential Information would be punished criminally, which would ruin their Professional & Academic Career(s). Data Scientists are EXTREMELY SMART individuals, so the POSSIBILITY of going to prison is an EXTREMELY EFFECTIVE deterrent for them.

    These same confidentiality rules, practices & procedures could be adjusted accordingly, to allow for collaboration across Disciplines, Industries, etc. The OBVIOUS first collaboration must be between Data Science Department(s), Data Rich Generators like Google/Facebook and (insert Law School or Law Firm name here)


  4. mmmmbacon
    May 25, 2012 at 12:19 pm

    Seems like kaggle.com is a good tool for students (and professionals) to do good research on real world data.


  5. May 25, 2012 at 1:07 pm

    We need a Coursera course sequence which teaches this.

    Does anyone have any comments on the relevance Jeffery Ullman’s book for teaching datascience?



  6. May 25, 2012 at 1:14 pm

    I think that much of whatever data IS available is often not being analyzed properly, either (or both) because of incompetent or clueless statisticians, or because of the poorly developed statistical methods currently dominant in financial and medical applications:

    “Odds Are, It’s Wrong”
    Science fails to face the shortcomings of statistics
    By Tom Siegfried
    Science News, March 27th, 2010; Vol.177 #7 (p. 26)

    The comments that follow the article (online only?) include more specific references that for completeness should have been cited by Siegfried, methinks.


  7. Mike Maltz
    May 25, 2012 at 2:50 pm

    I’m writing a chapter on data visualization in criminal justice. Here’s my first two paragraphs (still in draft):

    “Among the many attributes of the criminal justice system, one that stands out is the sheer quantity of data it generates. Since Western governments do not (generally) have secret arrests, trials, or imprisonments, almost every transaction is recorded in some database, most of which are computerized – and large. And these databases, although subject to restrictions on privacy and confidentiality, are (generally) open to public – and researcher – scrutiny and analysis.

    “Analyzing these data sets can be very useful, to look for patterns in offender and offending characteristics, and to determine the effect of different policies on outcomes. As the criminal justice system focuses more and more on “evidence-based” policies, developing practical methods of analyzing these large data sets becomes more necessary. One very useful set of analytic tools for this purpose falls under the heading of data visualization techniques.”

    So web-based data sets are not the only sources.


  8. Mike Maltz
    May 25, 2012 at 2:52 pm

    Oops! “Here ARE” my first two paragraphs. Mu English teacher mother would be embarrassed for me. And the chapter is for an encyclopedia.


  9. Mike Maltz
    May 25, 2012 at 3:07 pm

    Mu = My. I should read before pressing the “submit” button!


  10. May 29, 2012 at 2:56 pm

    Hi Cathy– came across your blog (love it!) when I was writing my blog post on managing data scientists…what are your thoughts on personality types and the need to distinguish the analyst, researcher and data science roles in a typical team?



  11. Simon Thornington
    June 12, 2012 at 10:58 am

    Are there any intro/refresher classes for the relevant domains in NYC? I’m looking for something in the evenings, accessible, engaging, and I don’t mind paying.


  1. May 28, 2012 at 9:58 pm
Comments are closed.
%d bloggers like this: