Columbia data science course, week 1: what is data science?

Home > data science, math education, statistics > Columbia data science course, week 1: what is data science?

Columbia data science course, week 1: what is data science?

September 6, 2012 Cathy O'Neil, mathbabe

I’m attending Rachel Schutt’s Columbia University Data Science course on Wednesdays this semester and I’m planning to blog the class. Here’s what happened yesterday at the first meeting.

Syllabus

Rachel started by going through the syllabus. Here were her main points:

The prerequisites for this class are: linear algebra, basic statistics, and some programming.
The goals of this class are: to learn what data scientists do. and to learn to do some of those things.
Rachel will teach for a couple weeks, then we will have guest lectures.
The profiles of those speakers vary considerably, as do their backgrounds. Yet they are all data scientists.
We will be resourceful with readings: part of being a data scientist is realizing lots of stuff isn’t written down yet.
There will be 6-10 homework assignments, due every two weeks or so.
The final project will be an internal Kaggle competition. This will be a team project.
There will also be an in-class final.
We’ll use R and python, mostly R. The support will be mainly for R. Download RStudio.
If you’re only interested in learning hadoop and working with huge data, take Bill Howe’s Coursera course. We will get to big data, but not til the last part of the course.

The current landscape of data science

So, what is data science? Is data science new? Is it real? What is it?

This is an ongoing discussion, but Michael Driscoll’s answer is pretty good:

Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.

But data science is not merely hacking, because when hackers finish debugging their Bash one-liners and Pig scripts, few care about non-Euclidean distance metrics.

And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a ^A delimited file into R if their job depended on it.

Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what’s possible.

Driscoll also refers to Drew Conway’s Venn diagram of data science from 2010:

Data science Venn diagram

We also may want to look at Nathan Yau’s “sexy skills of data geeks” from his “Rise of the Data Scientist” in 2009:

Statistics – traditional analysis you’re used to thinking about
Data Munging – parsing, scraping, and formatting data
Visualization – graphs, tools, etc.

But wait, is data science a bag of tricks? Or is it just the logical extension of other fields like statistics and machine learning?

For one argument, see Cosma Shalizi’s posts here and here and my posts here and here, which constitute an ongoing discussion of the difference between a statistician and a data scientist.

Also see ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word,” where she defends statistics.

One thing’s for sure, in data science, nobody hands you a clean data set, and nobody tells you what method to use. Moreover, the development of the field is happening in industry, not academia.

In 2011, DJ Patil described how he and Jeff Hammerbacher, in 2008, coined the term data scientist. However, in 2001, William Cleveland wrote a paper about data science (see Nathan Yau’s post on it here).

So data science existed before data scientists? Is this semantics, or does it make sense?

It begs the question, can you define data science by what data scientists do? Who gets to define the field, anyway? There’s lots of buzz and hype – does the media get to define it, or should we rely on the practitioners, the self-appointed data scientists? Or is there some actual authority? Let’s leave these as open questions for now.

Columbia just decided to start an Institute for Data Sciences and Engineering with Bloomberg’s help. The only question is why there’s a picture of a chemist on the announcement. There are 465 job openings in New York for data scientists last time we checked. That’s a lot. So even if data science isn’t a real field, it has real jobs.

Note that most of the job descriptions ask data scientists to be experts in computer science, statistics, communication, data visualization, and to have expert domain expertise. Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise, which together, as a team, can specialize in all those things.

Here are other players in the ecosystem:

O’Reilly and their Strata Conference
DataKind
Meetup groups
VC firms like Union Square Ventures are pouring big money into data science startups
Kaggle hosts data science competitions
Chris Wiggins, professor of applied math at Columbia, has been instrumental in connecting techy undergrads with New York start-ups through his summer internship program HackNY.

Note: wikipedia didn’t have an entry on data science until this 2012. This is a new term if not a new subject.

How do you start a Data Science project?

Say you’re working with some website with an online product. You want to track and analyse user behavior. Here’s a way of thinking about it:

The user interacts with product.
The product has a front end and a back end.
The user starts taking actions: clicks, etc.
Those actions get logged.
The logs include timestamps; they capture all the key user activity around the product.
The logs then get processed in pipelines: that’s where data munging, joining, and mapreducing occur.
These pipelines generate nice, clean, massive data sets.
These data sets are typically keyed by user, or song (like if you work at a place like Pandora), or however you want to see your data.
These data sets then get analyzed, modeled, etc.
They ultimately give us new ways of understanding user behavior.
This new understanding gets embedded back into the product itself.
We’ve created a circular process of changing the user interaction with the product by starting with examining the user interaction with the product. This differentiates the job of the data scientist from the traditional data analyst role, which might analyze users for likelihood of purchase but probably wouldn’t change the product itself but rather retarget advertising or something to more likely buyers.
The data scientist also reports to the CEO or head of product what she’s seeing with respect to the user, what’s happening with the user experience, what are the patterns she’s seeing. This is where communication and reporting skills, as well as data viz skills and old-time story telling skills come in. The data scientist builds the narrative around the product.
Sometimes you have to scrape the web, to get auxiliary info, because either the relevant data isn’t being logged or it isn’t actually being generated by the users.

Profile yourself

Rachel then handed out index cards and asked everyone to profile themselves (on a relative rather than absolute scale) with respect to their skill levels in the following domains:

software engineering,
math,
stats,
machine learning,
domain expertise,
communication and presentation skills, and
data viz

We taped the index cards up and got to see how everyone else thought of themselves. There was quite a bit of variation, which is cool – lots of people in the class are coming from social science.

And again, a data science team works best when different skills (profiles) are represented in different people, since nobody is good at everything. It makes me think that it might be easier to define a “data science team” than to define a data scientist.

Thought experiment: can we use data science to define data science?

We broke into small groups to think about this question. Then we had a discussion. Some ideas:

Yes: google search data science and perform a text mining model
But wait, that would depend on you being a usagist rather than a prescriptionist with respect to language. Do we let the masses define data science (where “the masses” refers to whatever google’s search engine finds)? Or do we refer to an authority such as the Oxford English Dictionary?
Actually the OED probably doesn’t have an entry yet and we don’t have time to wait for it. Let’s agree that there’s a spectrum, and one authority doesn’t feel right and “the masses” doesn’t either.
How about we look at practitioners of data science, and see how they describe what they do (maybe in a word cloud for starters), and then see how people who claim to be other things like statisticians or physics or economics describe what they do, and then we can try to use a clustering algorithm or some other model and see if, when it takes as input “the stuff I do”, it gives me a good prediction on what field I’m in.

Just for comparison, check out what Harlan Harris recently did inside the field of data science: he took a survey and used clustering to define subfields of data science, which gave rise to this picture:

It was a really exciting first week, I’m looking forward to more!

Categories: data science, math education, statistics

Comments (12)

Chris Mulligan

September 6, 2012 at 1:08 pm

Great summary from the perspective of my chair.

LikeLike
Ivan

September 6, 2012 at 2:39 pm

“There are 465 job openings in New York for data scientists last time we checked.”

Have you checked properly or just used indeed.com search? If it is the latter then there are many jobs with no relation to data science and also some headhunter ads.

LikeLike
Luigi Draghi

September 7, 2012 at 8:26 am

+1000

Very interesting indeed

LikeLike
c.gutierrez

September 9, 2012 at 1:16 pm

Finally .. I have a graphic that describes what I do. No more awkward answers followed up by “So, are you a statistician then?”

LikeLike
medh2000

September 9, 2012 at 4:22 pm

Very Nice Article. I would think that Data Science could be defined as the use of applied statistics with data mining and AI in business world.

LikeLike
Ruben Castillo

September 9, 2012 at 6:48 pm

Thanks for the notes and diagrams!

LikeLike
mickwags

September 9, 2012 at 8:08 pm

Thanks for the great notes! Good job!

LikeLike
rt

September 10, 2012 at 9:25 am

Very helpful!

LikeLike
Rp

September 10, 2012 at 10:46 pm

Great notes; appreciate your effort. Thanks.

LikeLike
Ryan Swanstrom

September 11, 2012 at 6:34 pm

Reblogged this on Data Science 101 and commented:
Here is a good overview of the first week of the Columbia Data Science course.

LikeLike
Jonathan

September 14, 2012 at 10:03 am

Great summary. Thanks. Maybe you’ll inspire me to take a course like this.

But one section of the Venn Diagram provoked a very strong reaction, actually two strong reactions. The “danger zone” label on the intersection of hacking and substantive expertise.

What’s wrong with hacking without math skills? There are lots of useful things that hackers can do that have nothing to do with math. Sure, they are not part of data science but it doesn’t make them particularly dangerous.

Note the “particularly” in that last sentence because my second, stronger, reaction is that the label implies the rest of the diagram is NOT in the danger zone. But, any data analysis can be done, or interpreted badly. This can either be innocent (just not knowing better) or malicious. Your recent post on the Chicago teachers is a good illustration of the latter.

LikeLike
Justin Jacoby Smith

October 1, 2012 at 10:29 pm

You will have a very popular blog, I think. 🙂

LikeLike