Columbia data science course, week 1: what is data science?
I’m attending Rachel Schutt’s Columbia University Data Science course on Wednesdays this semester and I’m planning to blog the class. Here’s what happened yesterday at the first meeting.
Rachel started by going through the syllabus. Here were her main points:
- The prerequisites for this class are: linear algebra, basic statistics, and some programming.
- The goals of this class are: to learn what data scientists do. and to learn to do some of those things.
- Rachel will teach for a couple weeks, then we will have guest lectures.
- The profiles of those speakers vary considerably, as do their backgrounds. Yet they are all data scientists.
- We will be resourceful with readings: part of being a data scientist is realizing lots of stuff isn’t written down yet.
- There will be 6-10 homework assignments, due every two weeks or so.
- The final project will be an internal Kaggle competition. This will be a team project.
- There will also be an in-class final.
- We’ll use R and python, mostly R. The support will be mainly for R. Download RStudio.
- If you’re only interested in learning hadoop and working with huge data, take Bill Howe’s Coursera course. We will get to big data, but not til the last part of the course.
The current landscape of data science
So, what is data science? Is data science new? Is it real? What is it?
This is an ongoing discussion, but Michael Driscoll’s answer is pretty good:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.
But data science is not merely hacking, because when hackers finish debugging their Bash one-liners and Pig scripts, few care about non-Euclidean distance metrics.
And data science is not merely statistics, because when statisticians finish theorizing the perfect model, few could read a ^A delimited file into R if their job depended on it.
Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what’s possible.
Driscoll also refers to Drew Conway’s Venn diagram of data science from 2010:
We also may want to look at Nathan Yau’s “sexy skills of data geeks” from his “Rise of the Data Scientist” in 2009:
- Statistics – traditional analysis you’re used to thinking about
- Data Munging – parsing, scraping, and formatting data
- Visualization – graphs, tools, etc.
But wait, is data science a bag of tricks? Or is it just the logical extension of other fields like statistics and machine learning?
Also see ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’ word,” where she defends statistics.
One thing’s for sure, in data science, nobody hands you a clean data set, and nobody tells you what method to use. Moreover, the development of the field is happening in industry, not academia.
In 2011, DJ Patil described how he and Jeff Hammerbacher, in 2008, coined the term data scientist. However, in 2001, William Cleveland wrote a paper about data science (see Nathan Yau’s post on it here).
So data science existed before data scientists? Is this semantics, or does it make sense?
It begs the question, can you define data science by what data scientists do? Who gets to define the field, anyway? There’s lots of buzz and hype – does the media get to define it, or should we rely on the practitioners, the self-appointed data scientists? Or is there some actual authority? Let’s leave these as open questions for now.
Columbia just decided to start an Institute for Data Sciences and Engineering with Bloomberg’s help. The only question is why there’s a picture of a chemist on the announcement. There are 465 job openings in New York for data scientists last time we checked. That’s a lot. So even if data science isn’t a real field, it has real jobs.
Note that most of the job descriptions ask data scientists to be experts in computer science, statistics, communication, data visualization, and to have expert domain expertise. Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise, which together, as a team, can specialize in all those things.
Here are other players in the ecosystem:
- O’Reilly and their Strata Conference
- Meetup groups
- VC firms like Union Square Ventures are pouring big money into data science startups
- Kaggle hosts data science competitions
- Chris Wiggins, professor of applied math at Columbia, has been instrumental in connecting techy undergrads with New York start-ups through his summer internship program HackNY.
Note: wikipedia didn’t have an entry on data science until this 2012. This is a new term if not a new subject.
How do you start a Data Science project?
Say you’re working with some website with an online product. You want to track and analyse user behavior. Here’s a way of thinking about it:
- The user interacts with product.
- The product has a front end and a back end.
- The user starts taking actions: clicks, etc.
- Those actions get logged.
- The logs include timestamps; they capture all the key user activity around the product.
- The logs then get processed in pipelines: that’s where data munging, joining, and mapreducing occur.
- These pipelines generate nice, clean, massive data sets.
- These data sets are typically keyed by user, or song (like if you work at a place like Pandora), or however you want to see your data.
- These data sets then get analyzed, modeled, etc.
- They ultimately give us new ways of understanding user behavior.
- This new understanding gets embedded back into the product itself.
- We’ve created a circular process of changing the user interaction with the product by starting with examining the user interaction with the product. This differentiates the job of the data scientist from the traditional data analyst role, which might analyze users for likelihood of purchase but probably wouldn’t change the product itself but rather retarget advertising or something to more likely buyers.
- The data scientist also reports to the CEO or head of product what she’s seeing with respect to the user, what’s happening with the user experience, what are the patterns she’s seeing. This is where communication and reporting skills, as well as data viz skills and old-time story telling skills come in. The data scientist builds the narrative around the product.
- Sometimes you have to scrape the web, to get auxiliary info, because either the relevant data isn’t being logged or it isn’t actually being generated by the users.
Rachel then handed out index cards and asked everyone to profile themselves (on a relative rather than absolute scale) with respect to their skill levels in the following domains:
- software engineering,
- machine learning,
- domain expertise,
- communication and presentation skills, and
- data viz
We taped the index cards up and got to see how everyone else thought of themselves. There was quite a bit of variation, which is cool – lots of people in the class are coming from social science.
And again, a data science team works best when different skills (profiles) are represented in different people, since nobody is good at everything. It makes me think that it might be easier to define a “data science team” than to define a data scientist.
Thought experiment: can we use data science to define data science?
We broke into small groups to think about this question. Then we had a discussion. Some ideas:
- Yes: google search data science and perform a text mining model
- But wait, that would depend on you being a usagist rather than a prescriptionist with respect to language. Do we let the masses define data science (where “the masses” refers to whatever google’s search engine finds)? Or do we refer to an authority such as the Oxford English Dictionary?
- Actually the OED probably doesn’t have an entry yet and we don’t have time to wait for it. Let’s agree that there’s a spectrum, and one authority doesn’t feel right and “the masses” doesn’t either.
- How about we look at practitioners of data science, and see how they describe what they do (maybe in a word cloud for starters), and then see how people who claim to be other things like statisticians or physics or economics describe what they do, and then we can try to use a clustering algorithm or some other model and see if, when it takes as input “the stuff I do”, it gives me a good prediction on what field I’m in.
Just for comparison, check out what Harlan Harris recently did inside the field of data science: he took a survey and used clustering to define subfields of data science, which gave rise to this picture:
It was a really exciting first week, I’m looking forward to more!