Home > data science, math education, open source tools > Data Science needs more pedagogy

Data Science needs more pedagogy

February 4, 2012

Yesterday Flowing Data posted an article about the history of data science (h/t Chris Wiggins). Turns out the field and the name were around at least as early as 2001, and statistician William Cleveland was all about planning it. He broke the field down into parts thus:

  • Multidisciplinary Investigation (25%) — collaboration with subject areas
  • Models and Methods for Data (20%) — more traditional applied statistics
  • Computing with Data (15%) — hardware, software, and algorithms
  • Pedagogy (15%) — how to teach the subject
  • Tool Evaluation (5%) — keeping track of new tech
  • Theory (20%) — the math behind the data

First of all this is a great list, and super prescient for the time. In fact it’s an even better description of data science than what’s actually happening.

The post mentions that we probably don’t see that much theory, but I’ve certainly seen my share of theory when I go to Meetups and such. Most of the time the theory is launched into straight away and I’m on my phone googling terms for half of the talk.

The post also mentions we don’t see much pedagogy, and here I strongly concur. By “pedagogy” I’m not talking about just teaching other people what you did or how you came up with a model, but rather how you thought about modeling and why you made the decisions you did, what the context was for those decisions and what the other options were (that you thought of). It’s more of a philosophy of modeling.

It’s not hard to pinpoint why we don’t get much in the way of philosophy. The field is teeming with super nerds who are focused on the very cool model they wrote and the very nerdy open source package they used, combined with some weird insight they gained as a physics Ph.D. student somewhere. It’s hard enough to sort out their terminology, never mind expecting a coherent explanation with broad context, explained vocabulary, and confessed pitfalls. The good news is that some of them are super smart and they share specific ideas and sometimes even code (yum).

In other words, most data scientists (who make cool models) think and talk at the level of 0.02 feet, whereas pedagogy is something you actually need to step back to see. I’m not saying that no attempt is ever made at this, but my experiences have been pretty bad. Even a simple, thoughtful comparison of how different fields (bayesian statisticians, machine learners, or finance quants) go about doing the same thing (like cleaning data, or removing outliers, or choosing a bayesian prior strength) would be useful, and would lead to insights like, why do these field do it this way whereas those fields do it that way? Is it because of the nature of the problems they are trying to solve?

A good pedagogical foundation for data science will allow us to not go down the same dead end roads as each other, not introduce the same biases in multiple models, and will make the entire field more efficient and better at communicating. If you know of a good reference for something like this, please tell me.

  1. February 4, 2012 at 11:01 am

    Our publishing system encourages 2000 incremental modeling papers for every one higher level review. When you study the generalizability of the methods across problems, it’s clear that we know very little about which specific algorithms work best and why. But there are general ideas (e.g. dimensionality reduction, classifiers, clustering) that are used to form the building blocks of solutions.

    Right now, Ph.D. programs teach the building blocks and rely on the research team to teach how to apply them.

    Like

    • February 4, 2012 at 11:02 am

      And since there are few “data science” Ph.D. programs this is a problematic approach, even if it worked consistently.

      Like

      • February 4, 2012 at 1:19 pm

        The field has a big problem because the language is so different across disciplines. It is a serious barrier to entry and can take years for someone to learn. That is a big problem.

        But the lack of agreement on methods doesn’t strike me as a huge problem. The reason is that the performance of various methods usually result in performance that is nearly equal or can be made equal. The only difference is cost of compute cycles and human time.

        Like

  2. February 8, 2012 at 12:26 pm

    This general problem applies to data analysis across domains. I work in data analysis for network security, and the idea of describing how and why we do analysis in a particular way remains a significant issue.

    Like

  1. No trackbacks yet.
Comments are closed.