### Archive

Archive for the ‘statistics’ Category

## Columbia Data Science course, week 14: Presentations

In the final week of Rachel Schutt’s Columbia Data Science course we heard from two groups of students as well as from Rachel herself.

Data Science; class consciousness

The first team of presenters consisted of Yegor, Eurry, and Adam. Many others whose names I didn’t write down contributed to the research, visualization, and writing.

First they showed us the very cool graphic explaining how self-reported skills vary by discipline. The data they used came from the class itself, which did this exercise on the first day:

so the star in the middle is the average for the whole class, and each star along the side corresponds to the average (self-reported) skills of people within a specific discipline. The dotted lines on the outside stars shows the “average” star, so it’s easier to see how things vary per discipline compared to the average.

Surprises: Business people seem to think they’re really great at everything except communication. Journalists are better at data wrangling than engineers.

We will get back to the accuracy of self-reported skills later.

We were asked, do you see your reflection in your star?

Also, take a look at the different stars. How would you use them to build a data science team? Would you want people who are good at different skills? Is it enough to have all the skills covered? Are there complementary skills? Are the skills additive, or do you need overlapping skills among team members?

Thought Experiment

If all data which had ever been collected were freely available to everyone, would we be better off?

Some ideas were offered:

• all nude photos are included. [Mathbabe interjects: it's possible to not let people take nude pics of you. Just sayin'.]
• so are passwords, credit scores, etc.
• how do we make secure transactions between a person and her bank considering this?
• what does it mean to be “freely available” anyway?

The data of power; the power of data

You see a lot of people posting crap like this on Facebook:

But here’s the thing: the Berner Convention doesn’t exist. People are posting this to their walls because they care about their privacy. People think they can exercise control over their data but they can’t. Stuff like this give one a false sense of security.

In Europe the privacy laws are stricter, and you can request data from Irish Facebook and they’re supposed to do it, but it’s still not easy to successfully do.

And it’s not just data that’s being collected about you – it’s data you’re collecting. As scientists we have to be careful about what we create, and take responsibility for our creations.

As Francois Rabelais said,

Wisdom entereth not into a malicious mind, and science without conscience is but the ruin of the soul.

Or as Emily Bell from Columbia said,

Every algorithm is editorial.

We can’t be evil during the day and take it back at hackathons at night. Just as journalists need to be aware that the way they report stories has consequences, so do data scientists. As a data scientist one has impact on people’s lives and how they think.

Here are some takeaways from the course:

• We’ve gained significant powers in this course.
• In the future we may have the opportunity to do more.
• With data power comes data responsibility.

Who does data science empower?

The second presentation was given by Jed and Mike. Again, they had a bunch of people on their team helping out.

Thought experiment

“Anything which uses science as part of its name isn’t political science, creation science, computer science.”

- Hal Abelson, MIT CS prof

Keeping this in mind, if you could re-label data science, would you? What would you call it?

Some comments from the audience:

• Let’s call it “modellurgy,” the craft of beating mathematical models into shape instead of metal
• Let’s call it “statistics”

Does it really matter what data science is? What should it end up being?

Chris Wiggins from Columbia contends there are two main views of what data science should end up being. The first stems from John Tukey, inventor of the fast fourier transform and the box plot, and father of exploratory data analysis. Tukey advocated for a style of research he called “data analysis”, emphasizing the primacy of data and therefore computation, which he saw as part of statistics. His descriptions of data analysis, which he saw as part of doing statistics, are very similar to what people call data science today.

The other prespective comes from Jim Gray, Computer Scientist from Microsoft. He saw the scientific ideals of the enlightenment age as expanding and evolving. We’ve gone from the theories of Darwin and Newton to experimental and computational approaches of Turing. Now we have a new science, a data-driven paradigm. It’s actually the fourth paradigm of all the sciences, the first three being experimental, theoretical, and computational. See more about this here.

Wait, can data science be both?

Note it’s difficult to stick Computer Science and Data Science on this line.

Statistics is a tool that everyone uses. Data science also could be seen that way, as a tool rather than a science.

Who does data science?

Here’s a graphic showing the make-up of Kaggle competitors. Teams of students collaborated to collect, wrangle, analyze and visualize this data:

The size of the blocks correspond to how many people in active competitions have an education background in a given field. We see that almost a quarter of competitors are computer scientists. The shading corresponds to how often they compete. So we see the business finance people do more competitions on average than the computer science people.

Consider this: the only people doing math competitions are math people. If you think about it, it’s kind of amazing how many different backgrounds are represented above.

We got some cool graphics created by the students who collaborated to get the data, process it, visualize it and so on.

Which universities offer courses on Data Science?

There will be 26 universities in total by 2013 that offer data science courses. The balls are centered at the center of gravity of a given state, and the balls are bigger if there are more in that state.

Where are data science jobs available?

Observations:

• We see more professional schools offering data science courses on the west coast.
• It would also would be interesting to see this corrected for population size.
• Only two states had no jobs.
• Massachusetts #1 per capita, then Maryland

McKinsey says there will be hundreds of thousands of data science jobs in the next few years. There’s a massive demand in any case. Some of us will be part of that. It’s up to us to make sure what we’re doing is really data science, rather than validating previously held beliefs.

We need to advance human knowledge if we want to take the word “scientist” seriously.

How did this class empower you?

You are one of the first people to take a data science class. There’s something powerful there.

Thank you Rachel!

Last Day of Columbia Data Science Class, What just happened? from Rachel’s perspective

Recall the stated goals of this class were:

• learn about what it’s like to be a data scientists
• be able to do some of what a data scientist does

Hey we did this! Think of all the guest lectures; they taught you a lot of what it’s like to be a data scientist, which was goal 1. Here’s what I wanted you guys to learn before the class started based on what a data scientist does, and you’ve learned a lot of that, which was goal 2:

Mission accomplished! Mission accomplished?

Thought experiment that I gave to myself last Spring

How would you design a data science class?

• It’s not a well-defined body of knowledge, subject, no textbook!
• It’s popularized and celebrated in the press and media, but there’s no “authority” to push back
• I’m intellectually disturbed by idea of teaching a course when the body of knowledge is ill-defined
• I didn’t know who would show up, and what their backgrounds and motivations would be
• Could it become redundant with a machine learning class?

My process

I asked questions of myself and from other people. I gathered information, and endured existential angst about data science not being a “real thing.” I needed to give it structure.

Then I started to think about it this way: while I recognize that data science has the potential to be a deep research area, it’s not there yet, and in order to actually design a class, let’s take a pragmatic approach: Recognize that data science exists. After all, there are jobs out there. I want to help students to be qualified for them. So let me teach them what it takes to get those jobs. That’s how I decided to approach it.

In other words, from this perspective, data science is what data scientists do. So it’s back to the list of what data scientists do. I needed to find structure on top of that, so the structure I used as a starting point were the data scientist profiles.

Data scientist profiles

This was a way to think about your strengths and weaknesses, as well as a link between speakers. Note it’s easy to focus on “technical skills,” but it can also be problematic in being too skills-based, as well as being problematic because it has no scale, and no notion of expertise. On the other hand it’s good in that it allows for and captures variability among data scientists.

I assigned weekly guest speakers topics related to their strengths. We held lectures, labs, and (optional) problem sessions. From this you got mad skillz:

• programming in R
• some python
• you learned some best practices about coding

From the perspective of machine learning,

• you know a bunch of algorithms like linear regression, logistic regression, k-nearest neighbors, k-mean, naive Bayes, random forests,
• you know what they are, what they’re used for, and how to implement them
• you learned machine learning concepts like training sets, test sets, over-fitting, bias-variance tradeoff, evaluation metrics, feature selection, supervised vs. unsupervised learning
• you learned about recommendation systems
• you’ve entered a Kaggle competition

Importantly, you now know that if there is an algorithm and model that you don’t know, you can (and will) look it up and figure it out. I’m pretty sure you’ve all improved relative to how you started.

You’ve learned some data viz by taking flowing data tutorials.

You’ve learned statistical inference, because we discussed

• observational studies,
• causal inference, and
• experimental design.
• We also learned some maximum likelihood topics, but I’d urge you to take more stats classes.

In the realm of data engineering,

• we showed you map reduce and hadoop
• we worked with 30 separate shards
• we used an api to get data
• we spent time cleaning data
• we’ve processed different kinds of data

As for communication,

• you wrote thoughts in response to blog posts
• you observed how different data scientists communicate or present themselves, and have different styles
• your final project required communicating among each other

As for domain knowledge,

• lots of examples were shown to you: social networks, advertising, finance, pharma, recommender systems, dallas art museum

I heard people have been asking the following: why didn’t we see more data science coming from non-profits, governments, and universities? Note that data science, the term, was born in for-profits. But the truth is I’d also like to see more of that. It’s up to you guys to go get that done!

How do I measure the impact of this class I’ve created? Is it possible to incubate awesome data science teams in the classroom? I might have taken you from point A to point B but you might have gone there anyway without me. There’s no counterfactual!

Can we set this up as a data science problem? Can we use a causal modeling approach? This would require finding students who were more or less like you but didn’t take this class and use propensity score matching. It’s not a very well-defined experiment.

But the goal is important: in industry they say you can’t learn data science in a university, that it has to be on the job. But maybe that’s wrong, and maybe this class has proved that.

What has been the impact on you or to the outside world? I feel we have been contributing to the broader discourse.

Does it matter if there was impact? and does it matter if it can be measured or not? Let me switch gears.

What is data science again?

Data science could be defined as:

• A set of best practices used in tech companies, which is how I chose to design the course
• A space of problems that could be solved with data
• A science of data where you can think of the data itself as units

The bottom two have the potential to be the basis of a rich and deep research discipline, but in many cases, the way the term is currently used is:

• Pure hype

But it doesn’t matter how we define it, as much as that I want for you:

• to be problem solvers
• to be question askers
• to use data responsibly and make the world better, not worse.

More on being problem solvers: cultivate certain habits of mind

Here’s a possible list of things to strive for, taken from here:

Here’s the thing. Tons of people can implement k-nearest neighbors, and many do it badly. What matters is that you cultivate the above habits, remain open to continuous learning.

In education in traditional settings, we focus on answers. But what we probably should focus on is how a student behaves when they don’t know the answer. We need to have qualities that help us find the answer.

Thought experiment

How would you design a data science class around habits of mind rather than technical skills? How would you quantify it? How would you evaluate? What would students be able to write on their resumes?

Comments from the students:

• You’d need to keep making people doing stuff they don’t know how to do while keeping them excited about it.
• have people do stuff in their own domains so we keep up wonderment and awe.
• You’d use case studies across industries to see how things work in different contexts

More on being question-askers

Some suggestions on asking questions of others:

• start with assumption that you’re smart
• don’t assume the person you’re talking to knows more or less. You’re not trying to prove anything.
• be curious like a child, not worried about appearing stupid
• ask for clarification around notation or terminology
• ask for clarification around process: where did this data come from? how will it be used? why is this the right data to use? who is going to do what? how will we work together?

Some questions to ask yourself

• does it have to be this way?
• what is the problem?
• how can I measure this?
• what is the appropriate algorithm?
• how will I evaluate this?
• do I have the skills to do this?
• how can I learn to do this?
• who can I work with? Who can I ask?
• how will it impact the real world?

Data Science Processes

In addition to being problem-solvers and question-askers, I mentioned that I want you to think about process. Here are a couple processes we discussed in this course:

(1) Real World –> Generates Data –>
–> Collect Data –> Clean, Munge (90% of your time)
–> Exploratory Data Analysis –>
–> Feature Selection –>
–> Build Model, Build Algorithm, Visualize
–> Evaluate –>Iterate–>
–> Impact Real World

(2) Asking questions of yourselves and others –>
Identifying problems that need to be solved –>
Gathering information, Measuring –>
Learning to find structure in unstructured situations–>
Framing Problem –>
Creating Solutions –> Evaluating

Thought experiment

Come up with a business that improves the world and makes money and uses data

Comments from the students:

• autonomous self-driving cars you order with a smart phone
• find all the info on people and then show them how to make it private
• social network with no logs and no data retention

10 Important Data Science Ideas

Of all the blog posts I wrote this semester, here’s one I think is important:

10 Important Data Science Ideas

Confidence and Uncertainty

Let’s talk about confidence and uncertainty from a couple perspectives.

First, remember that statistical inference is extracting information from data, estimating, modeling, explaining but also quantifying uncertainty. Data Scientists could benefit from understanding this more. Learn more statistics and read Ben’s blog post on the subject.

Second, we have the Dunning-Kruger Effect.
Have you ever wondered why don’t people say “I don’t know” when they don’t know something? This is partly explained through an unconscious bias called the Dunning-Kruger effect.

Basically, people who are bad at something have no idea that they are bad at it and overestimate their confidence. People who are super good at something underestimate their mastery of it. Actual competence may weaken self-confidence.

Thought experiment

Design an app to combat the dunning-kruger effect.

What are you optimizing for? What do you value?

• money, need some minimum to live at the standard of living you want to, might even want a lot.
• time with loved ones and friends
• doing good in the world
• personal fulfillment, intellectual fulfillment
• goals you want to reach or achieve
• being famous, respected, acknowledged
• ?
• some weighted function of all of the above. what are the weights?

What constraints are you under?

• external factors (factors outside of your control)
• your resources: money, time, obligations
• who you are, your education, strengths & weaknesses
• things you can or cannot change about yourself

There are many possible solutions that optimize what you value and take into account the constraints you’re under.

So what should you do with your life?

Remember that whatever you decide to do is not permanent so don’t feel too anxious about it, you can always do something else later –people change jobs all the time

But on the other hand, life is short, so always try to be moving in the right direction (optimizing for what you care about).

If you feel your way of thinking or perspective is somehow different than what those around you are thinking, then embrace and explore that, you might be onto something.

I’m always happy to talk to you about your individual case.

Next Gen Data Scientists

The second blog post I think is important is this “manifesto” that I wrote:

Next-Gen Data Scientists. That’s you! Go out and do awesome things, use data to solve problems, have integrity and humility.

Here’s our class photo!

## How math departments hire faculty

I just got back from a stimulating trip to Stony Brook to give the math colloquium there. I had a great time thanks to my gracious host Jason Starr (this guy, not this guy), and besides giving my talk (which I will give again in San Diego at the joint meetings next month) I enjoyed two conversations about the field of math which I think could be turned into data science projects. Maybe Ph.D. theses or something.

First, a system for deciding whether a paper on the arXiv is “good.” I will post about that on another day because it’s actually pretty involved and possible important.

Second is the way people hire in math departments. This conversation will generalize to other departments, some more than others.

So first of all, I want to think about how the hiring process actually works. There are people who look at folders of applicants, say for tenure-track jobs. Since math is a pretty disjointed field, a majority of the folders will only be understood well enough for evaluation purposes by a few people in the department.

So in other words, the department naturally splits into clusters more or less along field lines: there are the number theorists and then there are the algebraic geometers and then there are the low-dimensional topologists, say.

Each group of people reads the folders from the field or fields that they have enough expertise in to understand. Then from among those they choose some they want to go to bat for. It becomes a political battle, where each group tries to convince the other groups that their candidates are more qualified. But of course it’s really hard to know who’s telling the honest truth. There are probably lots of biases in play too, so people could be overstating their cases unconsciously.

Some potential problems with this system:

1. if you are applying to a department where nobody is in your field, nobody will read your folder, and nobody will go to bat for you, even if you are really great. An exaggeration but kinda true.
2. in order to be convincing that “your guy is the best applicant,” people use things like who the advisor is or which grad school this person went to more than the underlying mathematical content.
3. if your department grows over time, this tends to mean that you get bigger clusters rather than more clusters. So if you never had a number theorist, you tend to never get one, even if you get more positions. This is a problem for grad students who want to become number theorists, but that probably isn’t enough to affect the politics of hiring.

So here’s my data science plan: test the above hypotheses. I said them because I think they are probably true, but it would be not be impossible to create the dataset to test them thoroughly and measure the effects.

The easiest and most direct one to test is the third: cluster departments by subject by linking the people with their published or arXiv’ed papers. Watch the department change over time and see how the clusters change and grow versus how it might happen randomly. Easy peasy lemon squeazy if you have lots of data. Start collecting it now!

The first two are harder but could be related to the project of ranking papers. In other words, you have to define “is really great” to do this. It won’t mean you can say with confidence that X should have gotten a job at University Y, but it would mean you could say that if X’s subject isn’t represented in University Y’s clusters, then X’s chances of getting a job there, all other things being equal, is diminished by Z% on average. Something like that.

There are of course good things about the clustering. For example, it’s not that much fun to be the only person representing a field in your department. I’m not actually passing judgment on this fact, and I’m also not suggesting a way to avoid it (if it should be avoided).

Categories: data science, math, statistics

## Columbia Data Science course, week 12: Predictive modeling, data leakage, model evaluation

This week’s guest lecturer in Rachel Schutt’s Columbia Data Science class was Claudia Perlich. Claudia has been the Chief Scientist at m6d for 3 years. Before that she was a data analytics group at the IBM center that developed Watson, the computer that won Jeopardy!, although she didn’t work on that project. Claudia got her Ph.D. in information systems at NYU and now teaches a class to business students in data science, although mostly she addresses how to assess data science work and how to manage data scientists. Claudia also holds a masters in Computer Science.

Claudia is a famously successful data mining competition winner. She won the KDD Cup in 2003, 2007, 2008, and 2009, the ILP Challenge in 2005, the INFORMS Challenge in 2008, and the Kaggle HIV competition in 2010.

She’s also been a data mining competition organizer, first for the INFORMS Challenge in 2009 and then for the Heritage Health Prize in 2011. Claudia claims to be retired from competition.

Claudia’s advice to young people: pick your advisor first, then choose the topic. It’s important to have great chemistry with your advisor, and don’t underestimate the importance.

Background

Here’s what Claudia historically does with her time:

• predictive modeling
• data mining competitions
• publications in conferences like KDD and journals
• talks
• patents
• teaching
• digging around data (her favorite part)

Claudia likes to understand something about the world by looking directly at the data.

Here’s Claudia’s skill set:

• plenty of experience doing data stuff (15 years)
• data intuition (for which one needs to get to the bottom of the data generating process)
• dedication to the evaluation (one needs to cultivate a good sense of smell)
• model intuition (we use models to diagnose data)

Claudia also addressed being a woman. She says it works well in the data science field, where her intuition is useful and is used. She claims her nose is so well developed by now that she can smell it when something is wrong. This is not the same thing as being able to prove something algorithmically. Also, people typically remember her because she’s a woman, even when she don’t remember them. It has worked in her favor, she says, and she’s happy to admit this. But then again, she is where she is because she’s good.

Someone in the class asked if papers submitted for journals and/or conferences are blind to gender. Claudia responded that it was, for some time, typically double-blind but now it’s more likely to be one-sided. And anyway there was a cool analysis that showed you can guess who wrote a paper with 80% accuracy just by knowing the citations. So making things blind doesn’t really help. More recently the names are included, and hopefully this doesn’t make things too biased. Claudia admits to being slightly biased towards institutions – certain institutions prepare better work.

Skills and daily life of a Chief Data Scientist

Claudia’s primary skills are as follows:

• Data manipulation: unix (sed, awk, etc), Perl, SQL
• Modeling: various methods (logistic regression, nearest neighbors,  k-nearest neighbors, etc)
• Setting things up

She mentions that the methods don’t matter as much as how you’ve set it up, and how you’ve translated it into something where you can solve a question.

More recently, she’s been told that at work she spends:

• 40% of time as “contributor”: doing stuff directly with data
• 40% of time as “ambassador”: writing stuff, giving talks, mostly external communication to represent m6d, and
• 20% of time in “leadership” of her data group

At IBM it was much more focused in the first category. Even so, she has a flexible schedule at m6d and is treated well.

The goals of the audience

She asked the class, why are you here? Do you want to:

• become a data scientist? (good career choice!)
• work with data scientist?
• work for a data scientist?
• manage a data scientist?

Most people were trying their hands at the first, but we had a few in each category.

She mentioned that it matters because the way she’d talk to people wanting to become a data scientist would be different from the way she’d talk to someone who wants to manage them. Her NYU class is more like how to manage one.

So, for example, you need to be able to evaluate their work. It’s one thing to check a bubble sort algorithm or check whether a SQL server is working, but checking a model which purports to give the probability of people converting is different kettle of fish.

For example, try to answer this: how much better can that model get if you spend another week on it? Let’s face it, quality control is hard for yourself as a data miner, so it’s definitely hard for other people. There’s no easy answer.

There’s an old joke that comes to mind: What’s the difference between the scientist and a consultant? The scientists asks, how long does it take to get this right? whereas the consultant asks, how right can I get this in a week?

Insights into data

A student asks, how do you turn a data analysis into insights?

Claudia: this is a constant point of contention. My attitude is: I like to understand something, but what I like to understand isn’t what you’d consider an insight. My message may be, hey you’ve replaced every “a” by a “0″, or, you need to change the way you collect your data. In terms of useful insight, Ori’s lecture from last week, when he talked about causality, is as close as you get.

For example, decision trees you interpret, and people like them because they’re easy to interpret, but I’d ask, why does it look like it does? A slightly different data set would give you a different tree and you’d get a different conclusion. This is the illusion of understanding. I tend to be careful with delivering strong insights in that sense.

For more in this vein, Claudia suggests we look at Monica Rogati‘s talk “Lies, damn lies, and the data scientist.”

Data mining competitions

Claudia drew a distinction between different types of data mining competitions.

On the one hand you have the ”sterile” kind, where you’re given a clean, prepared data matrix, a standard error measure, and where the features are often anonymized. This is a pure machine learning problem.

Examples of this first kind are: KDD Cup 2009 and 2011 (Netflix). In such competitions, your approach would emphasize algorithms and computation. The winner would probably have heavy machines and huge modeling ensembles.

On the other hand, you have the ”real world” kind of data mining competition, where you’re handed raw data, which is often in lots of different tables and not easily joined, where you set up the model yourself and come up with task-specific evaluations. This kind of competition simulates real life more.

Examples of this second kind are: KDD cup 2007, 2008, and 2010. If you’re competing in this kind of competition your approach would involve understanding the domain, analyzing the data, and building the model. The winner might be the person who best understands how to tailor the model to the actual question.

Claudia prefers the second kind, because it’s closer to what you do in real life. In particular, the same things go right or go wrong.

How to be a good modeler

Claudia claims that data and domain understanding is the single most important skill you need as a data scientist. At the same time, this can’t really be taught – it can only be cultivated.

A few lessons learned about data mining competitions that Claudia thinks are overlooked in academics:

• Leakage: the contestants best friend and the organizers/practitioners worst nightmare. There’s always something wrong with the data, and Claudia has made an artform of figuring out how the people preparing the competition got lazy or sloppy with the data.
• Adapting learning to real-life performance measures beyond standard measures like MSE, error rate, or AUC (profit?)
• Feature construction/transformation: real data is rarely flat (i.e. given to you in a beautiful matrix) and good, practical solutions for this problem remains a challenge.

Leakage

Leakage refers to something that helps you predict something that isn’t fair. It’s a huge problem in modeling, and not just for competitions. Oftentimes it’s an artifact of reversing cause and effect.

Example 1: There was a competition where you needed to predict S&P in terms of whether it would go up or go down. The winning entry had a AUC (area under the ROC curve) of 0.999 out of 1. Since stock markets are pretty close to random, either someone’s very rich or there’s something wrong. There’s something wrong.

In the good old days you could win competitions this way, by finding the leakage.

Example 2: Amazon case study: big spenders. The target of this competition was to predict customers who spend a lot of money among customers using past purchases. The data consisted of transaction data in different categories. But a winning model identified that “Free Shipping = True” was an excellent predictor

What happened here? The point is that free shipping is an effect of big spending. But it’s not a good way to model big spending, because in particular it doesn’t work for new customers or for the future. Note: timestamps are weak here. The data that included “Free Shipping = True” was simultaneous with the sale, which is a no-no. We need to only use data from beforehand to predict the future.

Example 3: Again an online retailer, this time the target is predicting customers who buy jewelry. The data consists of transactions for different categories. A very successful model simply noted that if sum(revenue) = 0, then it predicts jewelry customers very well?

What happened here? The people preparing this data removed jewelry purchases, but only included people who bought something in the first place. So people who had sum(revenue) = 0 were people who only bought jewelry. The fact that you only got into the dataset if you bought something is weird: in particular, you wouldn’t be able to use this on customers before they finished their purchase. So the model wasn’t being trained on the right data to make the model useful. This is a sampling problem, and it’s common.

Example 4: This happened at IBM. The target was to predict companies who would be willing to buy “websphere” solutions. The data was transaction data + crawled potential company websites. The winning model showed that if the term ”websphere” appeared on the company’s website, then they were great candidates for the product.

What happened? You can’t crawl the historical web, just today’s web.

Thought experiment

You’re trying to study who has breast cancer. The patient ID, which seemed innocent, actually has predictive power. What happened?

In the above image, red means cancerous, green means not. it’s plotted by patient ID. We see three or four distinct buckets of patient identifiers. It’s very predictive depending on the bucket. This is probably a consequence of using multiple databases, some of which correspond to sicker patients are more likely to be sick.

A student suggests: for the purposes of the contest they should have renumbered the patients and randomized.

Claudia: would that solve the problem? There could be other things in common as well.

A student remarks: The important issue could be to see the extent to which we can figure out which dataset a given patient came from based on things besides their ID.

Claudia: Think about this: what do we want these models for in the first place? How well can you predict cancer?

Given a new patient, what would you do? If the new patient is in a fifth bin in terms of patient ID, then obviously don’t use the identifier model. But if it’s still in this scheme, then maybe that really is the best approach.

This discussion brings us back to the fundamental problem that we need to know what the purpose of the model is and how is it going to be used in order to decide how to do it and whether it’s working.

Pneumonia

During an INFORMS competition on pneumonia predictions in hospital records, where the goal was to predict whether a patient has pneumonia, a logistic regression which included the number of diagnosis codes as a numeric feature (AUC of 0.80) didn’t do as well as the one which included it as a categorical feature (0.90). What’s going on?

This had to do with how the person prepared the data for the competition:

The diagnosis code for pneumonia was 486. So the preparer removed that (and replaced it by a “-1″) if it showed up in the record (rows are different patients, columns are different diagnoses, there are max 4 diagnoses, “-1″ means there’s nothing for that entry).

Moreover, to avoid telling holes in the data, the preparer moved the other diagnoses to the left if necessary, so that only “-1″‘s were on the right.

There are two problems with this:

1. If the column has only “-1″‘s, then you know it started out with only pneumonia, and
2. If the column has no “-1″‘s, you know there’s no pneumonia (unless there are actually 5 diagnoses, but that’s less common).

This was enough information to win the competition.

Note: winning competition on leakage is easier than building good models. But even if you don’t explicitly understand and game the leakage, your model will do it for you. Either way, leakage is a huge problem.

How to avoid leakage

Claudia’s advice to avoid this kind of problem:

• You need a strict temporal cutoff: remove all information just prior to the event of interest (patient admission).
• There has to be a timestamp on every entry and you need to keep
• Removing columns asks for trouble
• Removing rows can introduce inconsistencies with other tables, also causing trouble
• The best practice is to start from scratch with clean, raw data after careful consideration
• You need to know how the data was created! I only work with data I pulled and prepared myself (or maybe Ori).

Evaluations

How do I know that my model is any good?

With powerful algorithms searching for patterns of models, there is a serious danger of over fitting. It’s a difficult concept, but the general idea is that “if you look hard enough you’ll find something” even if it does not generalize beyond the particular training data.

To avoid overfitting, we cross-validate and we cut down on the complexity of the model to begin with. Here’s a standard picture (although keep in mind we generally work in high dimensional space and don’t have a pretty picture to look at):

The picture on the left is underfit, in the middle is good, and on the right is overfit.

The model you use matters when it concerns overfitting:

So for the above example, unpruned decision trees are the most over fitting ones. This is a well-known problem with unpruned decision trees, which is why people use pruned decision trees.

Accuracy: meh

Claudia dismisses accuracy as a bad evaluation method. What’s wrong with accuracy? It’s inappropriate for regression obviously, but even for classification, if the vast majority is of binary outcomes are 1, then a stupid model can be accurate but not good (guess it’s always “1″), and a better model might have lower accuracy.

Probabilities matter, not 0′s and 1′s.

Nobody makes decisions on binary outcomes. I want to know the probability I have breast cancer, I don’t want to be told yes or no. It’s much more information. I care about probabilities.

How to evaluate a probability model

We separately evaluate the ranking and the calibration. To evaluate the ranking, we use the ROC curve and calculate the area under it, typically ranges from 0.5-1.0. This is independent of scaling and calibration. Here’s an example of how to draw an ROC curve:

Sometimes to measure rankings, people draw the so-called lift curve:

The key here is that the lift is calculated with respect to a baseline. You draw it at a given point, say 10%, by imagining that 10% of people are shown ads, and seeing how many people click versus if you randomly showed 10% of people ads.  A lift of 3 means it’s 3 times better.

How do you measure calibration? Are the probabilities accurate? If the model says probability of 0.57 that I have cancer, how do I know if it’s really 0.57? We can’t measure this directly. We can only bucket those predictions and then aggregately compare those in that prediction bucket (say 0.50-0.55) to the actual results for that bucket.

For example, here’s what you get when your model is an unpruned decision tree, where the blue diamonds are buckets:

A good model would show buckets right along the x=y curve, but here we’re seeing that the predictions were much more extreme than the actual probabilities. Why does this pattern happen for decision trees?

Claudia says that this is because trees optimize purity: it seeks out pockets that have only positives or negatives. Therefore its predictions are more extreme than reality. This is generally true about decision trees: they do not generally perform well with respect to calibration.

Logistic regression looks better when you test calibration, which is typical:

Takeaways:

• Accuracy is almost never the right evaluation metric.
• Probabilities, not binary outcomes.
• Separate ranking from calibration.
• Ranking you can measure with nice pictures: ROC, lift
• Calibration is measured indirectly through binning.
• Different models are better than others when it comes to calibration.
• Calibration is sensitive to outliers.
• Measure what you want to be good at.
• Have a good baseline.

Choosing an algorithm

This is not a trivial question and in particular small tests may steer you wrong, because as you increase the sample size the best algorithm might vary: often decision trees perform very well but only if there’s enough data.

In general you need to choose your algorithm depending on the size and nature of your dataset and you need to choose your evaluation method based partly on your data and partly on what you wish to be good at. Sum of squared error is maximum likelihood loss function if your data can be assumed to be normal, but if you want to estimate the median, then use absolute errors. If you want to estimate a quantile, then minimize the weighted absolute error.

We worked on predicting the number of ratings of a movie will get in the next year, and we assumed a poisson distributions. In this case our evaluation method doesn’t involve minimizing the sum of squared errors, but rather something else which we found in the literature specific to the Poisson distribution, which depends on the single parameter $\lambda$:

Charity direct mail campaign

Let’s put some of this together.

Say we want to raise money for a charity. If we send a letter to every person in the mailing list we raise about $9000. We’d like to save money and only send money to people who are likely to give – only about 5% of people generally give. How can we do that? If we use a (somewhat pruned, as is standard) decision tree, we get$0 profit: it never finds a leaf with majority positives.

If we use a neural network we still make only $7500, even if we only send a letter in the case where we expect the return to be higher than the cost. This looks unworkable. But if you model is better, it’s not. A person makes two decisions here. First, they decide whether or not to give, then they decide how much to give. Let’s model those two decisions separately, using: $E(\|person) = P(response = 'yes'| person) \cdot E(\|response = 'yes', person).$ Note we need the first model to be well-calibrated because we really care about the number, not just the ranking. So we will try logistic regression for first half. For the second part, we train with special examples where there are donations. Altogether this decomposed model makes a profit of$15,000. The decomposition made it easier for the model to pick up the signals. Note that with infinite data, all would have been good, and we wouldn’t have needed to decompose. But you work with what you got.

Moreover, you are multiplying errors above, which could be a problem if you have a reason to believe that those errors are correlated.

Parting thoughts

We are not meant to understand data. Data are outside of our sensory systems and there are very few people who have a near-sensory connection to numbers. We are instead meant to understand language.

We are not mean to understand uncertainty: we have all kinds of biases that prevent this from happening and are well-documented.

Modeling people in the future is intrinsically harder than figuring out how to label things that have already happened.

Even so we do our best, and this is through careful data generation, careful consideration of what our problem is, making sure we model it with data close to how it will be used, making sure we are optimizing to what we actually desire, and doing our homework in learning which algorithms fit which tasks.

## O’Reilly book deal signed for “Doing Data Science”

I’m very happy to say I just signed a book contract with my co-author, Rachel Schutt, to publish a book with O’Reilly called Doing Data Science.

The book will be based on the class Rachel is giving this semester at Columbia which I’ve been blogging about here.

For those of you who’ve been reading along for free as I’ve been blogging it, there might not be a huge incentive to buy it, but I can promise you more and better math, more explicit usable formulas, some sample code, and an overall better and more thought-out narrative.

It’s supposed to be published in May with a possible early release coming up at the end of February, in time for the O’Reilly Strata Santa Clara conference, where Rachel will be speaking about it and about other stuff curriculum related. Hopefully people will pick it up in time to teach their data science courses in Fall 2013.

Speaking of Rachel, she’s also been selected to give a TedXWomen talk at Barnard on December 1st, which is super exciting. She’s talking about advocating for the social good using data. Unfortunately the event is invitation-only, otherwise I’d encourage you all to go and hear her words of wisdom. Update: word on the street is that it will be video-taped.

## Columbia Data Science course, week 11: Estimating causal effects

The week in Rachel Schutt’s Data Science course at Columbia we had Ori Stitelman, a data scientist at Media6Degrees.

We also learned last night of a new Columbia course: STAT 4249 Applied Data Science, taught by Rachel Schutt and Ian Langmore. More information can be found here.

Ori’s background

Ori got his Ph.D. in Biostatistics from UC Berkeley after working at a litigation consulting firm. He credits that job with allowing him to understand data through exposure to tons of different data sets; since his job involved creating stories out of data to let experts testify at trials, e.g. for asbestos. In this way Ori developed his data intuition.

Ori worries that people ignore this necessary data intuition when they shove data into various algorithms. He thinks that when their method converges, they are convinced the results are therefore meaningful, but he’s here today to explain that we should be more thoughtful than that.

It’s very important when estimating causal parameters, Ori says, to understand the data-generating distributions and that involves gaining subject matter knowledge that allows you to understand if you necessary assumptions are plausible.

Ori says the first step in a data analysis should always be to take a step back and figure out what you want to know, write that down, and then find and use the tools you’ve learned to answer those directly. Later of course you have to decide how close you came to answering your original questions.

Thought Experiment

Ori asks, how do you know if your data may be used to answer your question of interest? Sometimes people think that because they have data on a subject matter then you can answer any question.

Students had some ideas:

• You need coverage of your parameter space. For example, if you’re studying the relationship between household income and holidays but your data is from poor households, then you can’t extrapolate to rich people. (Ori: but you could ask a different question)
• Causal inference with no timestamps won’t work.
• You have to keep in mind what happened when the data was collected and how that process affected the data itself
• Make sure you have the base case: compared to what? If you want to know how politicians are affected by lobbyists money you need to see how they behave in the presence of money and in the presence of no money. People often forget the latter.
• Sometimes you’re trying to measure weekly effects but you only have monthly data. You end up using proxies. Ori: but it’s still good practice to ask the precise question that you want, then come back and see if you’ve answered it at the end. Sometimes you can even do a separate evaluation to see if something is a good proxy.
• Signal to noise ratio is something to worry about too: as you have more data, you can more precisely estimate a parameter. You’d think 10 observations about purchase behavior is not enough, but as you get more and more examples you can answer more difficult questions.

Ori explains confounders with a dating example

Frank has an important decision to make. He’s perusing a dating website and comes upon a very desirable woman – he wants her number. What should he write in his email to her? Should he tell her she is beautiful? How do you answer that with data?

You could have him select a bunch of beautiful women and half the time chosen at random, tell them they’re beautiful. Being random allows us to assume that the two groups have similar distributions of various features (not that’s an assumption).

Our real goal is to understand the future under two alternative realities, the treated and the untreated. When we randomize we are making the assumption that the treated and untreated populations are alike.

OK Cupid looked at this and concluded:

But note:

• It could say more about the person who says “beautiful” than the word itself. Maybe they are otherwise ridiculous and overly sappy?
• The recipients of emails containing the word “beautiful” might be special: for example, they might get tons of email, which would make it less likely for Frank to get any response at all.
• For that matter, people may be describing themselves as beautiful.

Ori points out that this fact, that she’s beautiful, affects two separate things:

1. whether Frank uses the word “beautiful” or not in his email, and
2. the outcome (i.e. whether Frank gets the phone number).

For this reason, the fact that she’s beautiful qualifies as a confounder. The treatment is Frank writing “beautiful” in his email.

Causal graphs

Denote by $W$ the list of all potential confounders. Note it’s an assumption that we’ve got all of them (and recall how unreasonable this seems to be in epidemiology research).

Denote by $A$ the treatment (so, Frank using the word “beautiful” in the email). We usually assume this to have a binary (0/1) outcome.

Denote by $Y$ the binary (0/1) outcome (Frank getting the number).

We are forming the following causal graph:

In a causal graph, each arrow means that the ancestor is a cause of the descendent, where ancestor is the node the arrow is coming out of and the descendent is the node the arrow is going into (see this book for more).

In our example with Frank, the arrow from beauty means that the woman being beautiful is a cause of Frank writing “beautiful” in the message. Both the man writing “beautiful” and and the woman being beautiful are direct causes of her probability to respond to the message.

Setting the problem up formally

The building blocks in understanding the above causal graph are:

1. Ask question of interest.
2. Make causal assumptions (denote these by $P$).
3. Translate question into a formal quantity (denote this by $\Psi(P)$).
4. Estimate quantity (denote this by $\Psi(P_n)$).

We need domain knowledge in general to do this. We also have to take a look at the data before setting this up, for example to make sure we may make the

Positivity Assumption. We need treatment (i.e. data) in all strata of things we adjust for. So if think gender is a confounder, we need to make sure we have data on women and on men. If we also adjust for age, we need data in all of the resulting bins.

What is the effect of ___ on ___?

This is the natural form of a causal question. Here are some examples:

1. What is the effect of advertising on customer behavior?
2. What is the effect of beauty on getting a phone number?
3. What is the effect of censoring on outcome? (censoring is when people drop out of a study)
4. What is the effect of drug on time until viral failure?, and the general case
5. What is the effect of treatment on outcome?

Look, estimating causal parameters is hard. In fact the effectiveness of advertising is almost always ignored because it’s so hard to measure. Typically people choose metrics of success that are easy to estimate but don’t measure what they want! Everyone makes decision based on them anyway because it’s easier. This results in people being rewarded for finding people online who would have converted anyway.

Accounting for the effect of interventions

Thinking about that, we should be concerned with the effect of interventions. What’s a model that can help us understand that effect?

A common approach is the (randomized) A/B test, which involves the assumption that two populations are equivalent. As long as that assumption is pretty good, which it usually is with enough data, then this is kind of the gold standard.

But A/B tests are not always possible (or they are too expensive to be plausible). Often we need to instead estimate the effects in the natural environment, but then the problem is the guys in different groups are actually quite different from each other.

So, for example, you might find you showed ads to more people who are hot for the product anyway; it wouldn’t make sense to test the ad that way without adjustment.

The game is then defined: how do we adjust for this?

The ideal case

Similar to how we did this last week, we pretend for now that we have a “full” data set, which is to say we have god-like powers and we know what happened under treatment as well as what would have happened if we had not treated, as well as vice-versa, for every agent in the test.

Denote this full data set by $X:$

$X = (W, A, Y^*(1), Y^*(0)),$ where

• $W$ denotes the baseline variables (attributes of the agent) as above,
• $A$ denotes the binary treatment as above,
• $Y^*(1)$ denotes the binary outcome if treated, and
• $Y^*(0)$ denotes the binary outcome if untreated.

As a baseline check: if we observed this full data structure how would we measure the effect of A on Y? In that case we’d be all-powerful and we would just calculate:

$E(Y^*(1)) - E(Y^*(0)).$

Note that, since $Y^*(0)$ and $Y^*(1)$ are binary, the expected value $E(Y^*(0))$ is just the probability of a positive outcome if untreated. So in the case of advertising, the above is the conversion rate change when you show someone an ad. You could also take the ratio of the two quantities:

$E(Y^*(1))/E(Y^*(0)).$

This would be calculating how much more likely someone is to convert if they see an ad.

Note these are outcomes you can really do stuff with. If you know people convert at 30% versus 10% in the presence of an ad, that’s real information. Similarly if they convert 3 times more often.

In reality people use silly stuff like log odds ratios, which nobody understands or can interpret meaningfully.

The ideal case with functions

In reality we don’t have god-like powers, and we have to make do. We will make a bunch of assumptions. First off, denote by $U$ exogenous variables, i.e. stuff we’re ignoring. Assume there are functions $f_1, f_2,$ and $f_3$ so that:

• $W = f_1(U_W),$ i.e. the attributes $W$ are just functions of some exogenous variables,
• $A = f_2(W, U_A),$ i.e. the treatment depends in a nice way on some exogenous variables as well the attributes we know about living in $W$, and
• $Y = f_3(A, W, U_Y),$ i.e. the outcome is just a function of the treatment, the attributes, and some exogenous variables.

Note the various $U$‘s could contain confounders in the above notation. That’s gonna change.

But we want to intervene on this causal graph as though it’s the intervention we actually want to make. i.e. what’s the effect of treatment $A$ on outcome $Y$?

Let’s look at this from the point of view of the joint distribution $P(W, A, Y) = P(W)P(A|W)P(Y|A,W).$ These terms correspond to the following in our example:

1. the probability of a woman being beautiful,
2. the probability that Frank writes and email to a her saying that she’s beautiful, and
3. the probability that Frank gets her phone number.

What we really care about though is the distribution under intervention:

$P_a = P(W) P(Y_a| W),$

i.e. the probability knowing someone either got treated or not. To answer our question, we manipulate the value of $A,$ first setting it to 1 and doing the calculation, then setting it to 0 and redoing the calculation.

Assumptions

We are making a “Consistency Assumption / SUTVA” which can be expressed like this:

We have also assumed that we have no unmeasured confounders, which can be expressed thus:

We are also assuming positivity, which we discussed above.

Down to brass tacks

We only have half the information we need. We need to somehow map the stuff we have to the full data set as defined above. We make use of the following identity:

Recall we want to estimate $\Psi(P) = E(Y^*(1))/E(Y^*(0)),$ which by the above can be rewritten

$E_W(E(Y|A=1, W))/ E_W(E(Y|A=0, W)).$

We’re going to discuss three methods to estimate this quantity, namely:

1. MLE-based substitution estimator (MLE),
2. Inverse probability estimators (IPTW),
3. Double robust estimating equations (A-IPTW)

For the above models, it’s useful to think of there being two machines, called $g$ and $Q$, which generate estimates of the probability of the treatment knowing the attributes (that’s machine $g$) and the probability of the outcome knowing the treatment and the attributes (machine $Q$).

IPTW

In this method, which is also called importance sampling, we weight individuals that are unlikely to be shown an ad more than those likely. In other words, we up-sample in order to generate the distribution, to get the estimation of the actual effect.

To make sense of this, imagine that you’re doing a survey of people to see how they’ll vote, but you happen to do it at a soccer game where you know there are more young people than elderly people. You might want to up-sample the elderly population to make your estimate.

This method can be unstable if there are really small sub-populations that you’re up-sampling, since you’re essentially multiplying by a reciprocal.

The formula in IPTW looks like this:

Note the formula depends on the $g$ machine, i.e. the machine that estimates the treatment probability based on attributes. The problem is that people get the $g$ machine wrong all the time, which makes this method fail.

In words, when $a=1$ we are taking the sum of terms whose numerators are zero unless we have a treated, positive outcome, and we’re weighting them in the denominator by the probability of getting treated so each “population” has the same representation. We do the same for $a=0$ and take the difference.

MLE

This method is based on the $Q$ machine, which as you recall estimates the probability of a positive outcome given the attributes and the treatment, so the $latex P(Y|A,W)$ values.

This method is straight-forward: shove everyone in the machine and predict how the outcome would look under both treatment and non-treatment conditions, and take difference.

Note we don’t know anything about the underlying machine $latex Q$. It could be a logistic regression.

Get ready to get worried: A-IPTW

What if our machines are broken? That’s when we bring in the big guns: double robust estimators.

They adjust for confounding through the two machines we have on hand, $Q$ and $g,$ and one machine augments the other depending on how well it works. Here’s the functional form written in two ways to illustrate the hedge:

and

Note: you are still screwed if both machines are broken. In some sense with a double robust estimator you’re hedging your bet.

“I’m glad you’re worried because I’m worried too.” – Ori

Simulate and test

I’ve shown you 3 distinct methods that estimate effects in observational studies. But they often come up with different answers. We set up huge simulation studies with known functions, i.e. where we know the functional relationships between everything, and then tried to infer those using the above three methods as well as a fourth method called TMLE (targeted maximal likelihood estimation).

As a side note, Ori encourages everyone to simulate data.

We wanted to know, which methods fail with respect to the assumptions? How well do the estimates work?

We started to see that IPTW performs very badly when you’re adjusting by very small thing. For example we found that the probability of someone getting sick is 132. That’s not between 0 and 1, which is not good. But people use these methods all the time.

Moreover, as things get more complicated with lots of nodes in our causal graph, calculating stuff over long periods of time, populations get sparser and sparser and it has an increasingly bad effect when you’re using IPTW. In certain situations your data is just not going to give you a sufficiently good answer.

Causal analysis in online display advertising

An overview of the process:

1. We observe people taking actions (clicks, visits to websites, purchases, etc.).
2. We use this observed data to build list of “prospects” (people with a liking for the brand).
3. We subsequently observe same user during over the next few days.
4. The user visits a site where a display ad spot exists and bid requests are made.
5. An auction is held for display spot.
6. If the auction is won, we display the ad.
7. We observe the user’s actions after displaying the ad.

But here’s the problem: we’ve instituted confounders – if you find people who convert highly they think you’ve done a good job. In other words, we are looking at the treated without looking at the untreated.

We’d like to ask the question, what’s the effect of display advertising on customer conversion?

As a practical concern, people don’t like to spend money on blank ads. So A/B tests are a hard sell.

We performed some what-if analysis stipulated on the assumption that the group of users that sees ad is different. Our process was as follows:

1. Select prospects that we got a bid request for on day 0
2. Observe if they were treated on day 1. For those treated set $A=1$ and those not treated set $A=0.$ collect attributes $W.$
3. Create outcome window to be the next five days following treatment; observe if outcome event occurs (visit to the website whose ad was shown).
4. Estimate model parameters using the methods previously described (our three methods plus TMLE).

Here are some results:

Note results vary depending on the method. And there’s no way to know which method is working the best. Moreover, this is when we’ve capped the size of the correction in the IPTW methods. If we don’t then we see ridiculous results:

## Medical research needs an independent modeling panel

I am outraged this morning.

I spent yesterday morning writing up David Madigan’s lecture to us in the Columbia Data Science class, and I can hardly handle what he explained to us: the entire field of epidemiological research is ad hoc.

This means that people are taking medication or undergoing treatments that may do they harm and probably cost too much because the researchers’ methods are careless and random.

Of course, sometimes this is intentional manipulation (see my previous post on Vioxx, also from an eye-opening lecture by Madigan). But for the most part it’s not. More likely it’s mostly caused by the human weakness for believing in something because it’s standard practice.

In some sense we knew this already. How many times have we read something about what to do for our health, and then a few years later read the opposite? That’s a bad sign.

And although the ethics are the main thing here, the money is a huge issue. It required $25 million dollars for Madigan and his colleagues to implement the study on how good our current methods are at detecting things we already know. Turns out they are not good at this – even the best methods, which we have no reason to believe are being used, are only okay. Okay,$25 million dollars is a lot, but then again there are literally billions of dollars being put into the medical trials and research as a whole, so you might think that the “due diligence” of such a large industry would naturally get funded regularly with such sums.

But you’d be wrong. Because there’s no due diligence for this industry, not in a real sense. There’s the FDA, but they are simply not up to the task.

One article I linked to yesterday from the Stanford Alumni Magazine, which talked about the work of John Ioannidis (I blogged about his work here called “Why Most Published Research Findings Are False“), summed the situation up perfectly (emphasis mine):

When it comes to the public’s exposure to biomedical research findings, another frustration for Ioannidis is that “there is nobody whose job it is to frame this correctly.” Journalists pursue stories about cures and progress—or scandals—but they aren’t likely to diligently explain the fine points of clinical trial bias and why a first splashy result may not hold up. Ioannidis believes that mistakes and tough going are at the essence of science. ”In science we always start with the possibility that we can be wrong. If we don’t start there, we are just dogmatizing.”

It’s all about conflict of interest, people. The researchers don’t want their methods examined, the pharmaceutical companies are happy to have various ways to prove a new drug “effective”, and the FDA is clueless.

Another reason for an AMS panel to investigate public math models. If this isn’t in the public’s interest I don’t know what is.

## Columbia Data Science course, week 10: Observational studies, confounders, epidemiology

This week our guest lecturer in the Columbia Data Science class was David Madigan,  Professor and Chair of Statistics at Columbia. He received a bachelors degree in Mathematical Sciences and a Ph.D. in Statistics, both from Trinity College Dublin. He has previously worked for AT&T Inc., Soliloquy Inc., the University of Washington, Rutgers University, and SkillSoft, Inc. He has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance and probabilistic graphical models.

So Madigan is an esteemed guest, but I like to call him an “apocalyptic leprechaun”, for reasons which you will know by the end of this post. He’s okay with that nickname, I asked his permission.

Madigan came to talk to us about observation studies, of central importance in data science. He started us out with this:

Thought Experiment

We now have detailed, longitudinal medical data on tens of millions of patients. What can we do with it?

To be more precise, we have tons of phenomenological data: this is individual, patient-level medical record data. The largest of the databases has records on 80 million people: every prescription drug, every condition ever diagnosed, every hospital or doctor’s visit, every lab result, procedures, all timestamped.

But we still do things like we did in the Middle Ages; the vast majority of diagnosis and treatment is done in a doctor’s brain. Can we do better? Can you harness these data to do a better job delivering medical care?

Students responded:

1) There was a prize offered on Kaggle, called “Improve Healthcare, Win $3,000,000.” predicting who is going to go to the hospital next year. Doesn’t that give us some idea of what we can do? Madigan: keep in mind that they’ve coarsened the data for proprietary reasons. Hugely important clinical problem, especially as a healthcare insurer. Can you intervene to avoid hospitalizations? 2) We’ve talked a lot about the ethical uses of data science in this class. It seems to me that there are a lot of sticky ethical issues surrounding this 80 million person medical record dataset. Madigan: Agreed! What nefarious things could we do with this data? We could gouge sick people with huge premiums, or we could drop sick people from insurance altogether. It’s a question of what, as a society, we want to do. What is modern academic statistics? Madigan showed us Drew Conway’s Venn Diagram that we’d seen in week 1: Madigan positioned the modern world of the statistician in the green and purple areas. It used to be the case, say 20 years ago, according to Madigan, that academic statistician would either sit in their offices proving theorems with no data in sight (they wouldn’t even know how to run a t-test) or sit around in their offices and dream up a new test, or a new way of dealing with missing data, or something like that, and then they’d look around for a dataset to whack with their new method. In either case, the work of an academic statistician required no domain expertise. Nowadays things are different. The top stats journals are more deep in terms of application areas, the papers involve deep collaborations with people in social sciences or other applied sciences. Madigan is setting an example tonight by engaging with the medical community. Madigan went on to make a point about the modern machine learning community, which he is or was part of: it’s a newish academic field, with conferences and journals, etc., but is characterized by what stats was 20 years ago: invent a method, try it on datasets. In terms of domain expertise engagement, it’s a step backwards instead of forwards. Comments like the above make me love Madigan. Very few academic statisticians have serious hacking skills, with Mark Hansen being an unusual counterexample. But if all three is what’s required to be called data science, then I’m all for data science, says Madigan. Madigan’s timeline Madigan went to college in 1980, specialized on day 1 on math for five years. In final year, he took a bunch of stats courses, and learned a bunch about computers: pascal, OS, compilers, AI, database theory, and rudimentary computing skills. Then came 6 years in industry, working at an insurance company and a software company where he specialized in expert systems. It was a mainframe environment, and he wrote code to price insurance policies using what would now be described as scripting languages. He also learned about graphics by creating a graphic representation of a water treatment system. He learned about controlling graphics cards on PC’s, but he still didn’t know about data. Then he got a Ph.D. and went into academia. That’s when machine learning and data mining started, which he fell in love with: he was Program Chair of the KDD conference, among other things, before he got disenchanted. He learned C and java, R and S+. But he still wasn’t really working with data yet. He claims he was still a typical academic statistician: he had computing skills but no idea how to work with a large scale medical database, 50 different tables of data scattered across different databases with different formats. In 2000 he worked for AT&T labs. It was an “extreme academic environment”, and he learned perl and did lots of stuff like web scraping. He also learned awk and basic unix skills. It was life altering and it changed everything: having tools to deal with real data rocks! It could just as well have been python. The point is that if you don’t have the tools you’re handicapped. Armed with these tools he is afraid of nothing in terms of tackling a data problem. In Madigan’s opinion, statisticians should not be allowed out of school unless they know these tools. He then went to a internet startup where he and his team built a system to deliver real-time graphics on consumer activity. Since then he’s been working in big medical data stuff. He’s testified in trials related to medical trials, which was eye-opening for him in terms of explaining what you’ve done: “If you’re gonna explain logistical regression to a jury, it’s a different kind of a challenge than me standing here tonight.” He claims that super simple graphics help. Carrotsearch As an aside he suggests we go to this website, called carrotsearch, because there’s a cool demo on it. What is an observational study? Madigan defines it for us: An observational study is an empirical study in which the objective is to elucidate cause-and-effect relationships in which it is not feasible to use controlled experimentation. In tonight’s context, it will involve patients as they undergo routine medical care. We contrast this with designed experiment, which is pretty rare. In fact, Madigan contends that most data science activity revolves around observational data. Exceptions are A/B tests. Most of the time, the data you have is what you get. You don’t get to replay a day on the market where Romney won the presidency, for example. Observational studies are done in contexts in which you can’t do experiments, and they are mostly intended to elucidate cause-and-effect. Sometimes you don’t care about cause-and-effect, you just want to build predictive models. Madigan claims there are many core issues in common with the two. Here are some examples of tests you can’t run as designed studies, for ethical reasons: • smoking and heart disease (you can’t randomly assign someone to smoke) • vitamin C and cancer survival • DES and vaginal cancer • aspirin and mortality • cocaine and birthweight • diet and mortality Pitfall #1: confounders There are all kinds of pitfalls with observational studies. For example, look at this graph, where you’re finding a best fit line to describe whether taking higher doses of the “bad drug” is correlated to higher probability of a heart attack: It looks like, from this vantage point, the more drug you take the fewer heart attacks you have. But there are two clusters, and if you know more about those two clusters, you find the opposite conclusion: Note this picture was rigged it so the issue is obvious. This is an example of a “confounder.” In other words, the aspirin-taking or non-aspirin-taking of the people in the study wasn’t randomly distributed among the people, and it made a huge difference. It’s a general problem with regression models on observational data. You have no idea what’s going on. Madigan: “It’s the wild west out there.” Wait, and it gets worse. It could be the case that within each group there males and females and if you partition by those you see that the more drugs they take the better again. Since a given person either is male or female, and either takes aspirin or doesn’t, this kind of thing really matters. This illustrates the fundamental problem in observational studies, which is sometimes called Simpson’s Paradox. [Remark from someone in the class: if you think of the original line as a predictive model, it's actually still the best model you can obtain knowing nothing more about the aspirin-taking habits or genders of the patients involved. The issue here is really that you're trying to assign causality.] The medical literature and observational studies As we may not be surprised to hear, medical journals are full of observational studies. The results of these studies have a profound effect on medical practice, on what doctors prescribe, and on what regulators do. For example, in this paper, entitled “Oral bisphosphonates and risk of cancer of oesophagus, stomach, and colorectum: case-control analysis within a UK primary care cohort,” Madigan report that we see the very same kind of confounding problem as in the above example with aspirin. The conclusion of the paper is that the risk of cancer increased with 10 or more prescriptions of oral bisphosphonates. It was published on the front page of new york times, the study was done by a group with no apparent conflict of interest and the drugs are taken by millions of people. But the results were wrong. There are thousands of examples of this, it’s a major problem and people don’t even get that it’s a problem. Randomized clinical trials One possible way to avoid this problem is randomized studies. The good news is that randomization works really well: because you’re flipping coins, all other factors that might be confounders (current or former smoker, say) are more or less removed, because I can guarantee that smokers will be fairly evenly distributed between the two groups if there are enough people in the study. The truly brilliant thing about randomization is that randomization matches well on the possible confounders you thought of, but will also give you balance on the 50 million things you didn’t think of. So, although you can algorithmically find a better split for the ones you thought of, that quite possible wouldn’t do as well on the other things. That’s why we really do it randomly, because it does quite well on things you think of and things you don’t. But there’s bad news for randomized clinical trials as well. First off, it’s only ethically feasible if there’s something called clinical equipoise, which means the medical community really doesn’t know which treatment is better. If you know have reason to think treating someone with a drug will be better for them than giving them nothing, you can’t randomly not give people the drug. The other problem is that they are expensive and cumbersome. It takes a long time and lots of people to make a randomized clinical trial work. In spite of the problems, randomized clinical trials are the gold standard for elucidating cause-and-effect relationships. Rubin causal model The Rubin causal model is a mathematical framework for understanding what information we know and don’t know in observational studies. It’s meant to investigate the confusion when someone says something like “I got lung cancer because I smoked”. Is that true? If so, you’d have to be able to support the statement, “If I hadn’t smoked I wouldn’t have gotten lung cancer,” but nobody knows that for sure. Define: • $Z_i$ to be the treatment applied to unit $i$ (0 = control, 1= treatment), • $Y_i(1)$ to be the response for unit $i$ if $Z_i = 1$, • $Y_i(0)$ to be the response for unit $i$ if $Z_i = 0$. Then the unit level causal effect is $Y_i(1)-Y_i(0)$, but we only see one of $Y_i(0)$ and $Y_i(1).$ Example: $Z_i$ is 1 if I smoked, 0 if I didn’t (I am the unit). $Y_i(1)$ is 1 or 0 if I got cancer and I smoked, and $Y_i(0)$ is 1 or 0 depending on whether I got cancer while not smoking. The overall causal effect on me is the difference $Y_i(1)-Y_i(0).$ This is equal to 1 if I got really got cancer because I smoked, it’s 0 if I got cancer (or didn’t) independent of smoking, and it’s -1 if I avoided cancer by smoking. But I’ll never know my actual value since I only know one term out of the two. Of course, on a population level we do know how to infer that there are quite a few “1″‘s among the population, but we will never be able to assign a given individual that number. This is sometimes called the fundamental problem of causal inference. Confounding and Causality Let’s say we have a population of 100 people that takes some drug, and we screen them for cancer. Say 30 out of them get cancer, which gives them a cancer rate of 0.30. We want to ask the question, did the drug cause the cancer? To answer that, we’d have to know what would’ve happened if they hadn’t taken the drug. Let’s play God and stipulate that, had they not taken the drug, we would have seen 20 get cancer, so a rate of 0.20. We typically say the causal effect is the ration of these two numbers (i.e. the increased risk of cancer), so 1.5. But we don’t have God’s knowledge, so instead we choose another population to compare this one to, and we see whether they get cancer or not, whilst not taking the drug. Say they have a natural cancer rate of 0.10. Then we would conclude, using them as a proxy, that the increased cancer rate is the ratio 0.30 to 0.10, so 3. This is of course wrong, but the problem is that the two populations have some underlying differences that we don’t account for. If these were the “same people”, down to the chemical makeup of each other molecules, this “by proxy” calculation would work of course. The field of epidemiology attempts to adjust for potential confounders. The bad news is that it doesn’t work very well. One reason is that they heavily rely on stratification, which means partitioning the cases into subcases and looking at those. But there’s a problem here too. Stratification can introduce confounding. The following picture illustrates how stratification could make the underlying estimates of the causal effects go from good to bad: In the top box, the values of b and c are equal, so our causal effect estimate is correct. However, when you break it down by male and female, you get worse estimates of causal effects. The point is, stratification doesn’t just solve problems. There are no guarantees your estimates will be better if you stratify and all bets are off. What do people do about confounding things in practice? In spite of the above, experts in this field essentially use stratification as a major method to working through studies. They deal with confounding variables by essentially stratifying with respect to them. So if taking aspirin is believed to be a potential confounding factor, they stratify with respect to it. For example, with this study, which studied the risk of venous thromboembolism from the use of certain kinds of oral contraceptives, the researchers chose certain confounders to worry about and concluded the following: After adjustment for length of use, users of oral contraceptives were at least twice the risk of clotting compared with user of other kinds of oral contraceptives. This report was featured on ABC, and it was a big hoo-ha. Madigan asks: wouldn’t you worry about confounding issues like aspirin or something? How do you choose which confounders to worry about? Wouldn’t you worry that the physicians who are prescribing them are different in how they prescribe? For example, might they give the newer one to people at higher risk of clotting? Another study came out about this same question and came to a different conclusion, using different confounders. They adjusted for a history of clots, which makes sense when you think about it. This is an illustration of how you sometimes forget to adjust for things, and the outputs can then be misleading. What’s really going on here though is that it’s totally ad hoc, hit or miss methodology. Another example is a study on oral bisphosphonates, where they adjusted for smoking, alcohol, and BMI. But why did they choose those variables? There are hundreds of examples where two teams made radically different choices on parallel studies. We tested this by giving a bunch of epidemiologists the job to design 5 studies at a high level. There was zero consistency. And an addition problem is that luminaries of the field hear this and say: yeah yeah yeah but I would know the right way to do it. Is there a better way? Madigan and his co-authors examined 50 studies, each of which corresponds to a drug and outcome pair, e.g. antibiotics with GI bleeding. They ran about 5,000 analyses for every pair. Namely, they ran every epistudy imaginable on, and they did this all on 9 different databases. For example, they looked at ACE inhibitors (the drug) and swelling of the heart (outcome). They ran the same analysis on the 9 different standard databases, the smallest of which has records of 4,000,000 patients, and the largest of which has records of 80,000,000 patients. In this one case, for one database the drug triples the risk of heart swelling, but for another database it seems to have a 6-fold increase of risk. That’s one of the best examples, though, because at least it’s always bad news – it’s consistent. On the other hand, for 20 of the 50 pairs, you can go from statistically significant in one direction (bad or good) to the other direction depending on the database you pick. In other words, you can get whatever you want. Here’s a picture, where the heart swelling example is at the top: Note: the choice of database is never discussed in any of these published epidemiology papers. Next they did an even more extensive test, where they essentially tried everything. In other words, every time there was a decision to be made, they did it both ways. The kinds of decisions they tweaker were of the following types: which database you tested on, the confounders you accounted for, the window of time you care about examining (spoze they have a heart attack a week after taking the drug, is it counted? 6 months?) What they saw was that almost all the studies can get either side depending on the choices. Final example, back to oral bisphosphonates. A certain study concluded that it causes esophageal cancer, but two weeks later JAMA published a paper on same issue which concluded it is not associated to elevated risk of esophageal cancer. And they were even using the same database. This is not so surprising now for us. OMOP Research Experiment Here’s the thing. Billions upon billions of dollars are spent doing these studies. We should really know if they work. People’s lives depend on it. Madigan told us about his “OMOP 2010.2011 Research Experiment” They took 10 large medical databases, consisting of a mixture of claims from insurance companies and EHR (electronic health records), covering records of 200 million people in all. This is big data unless you talk to an astronomer. They mapped the data to a common data model and then they implemented every method used in observational studies in healthcare. Altogether they covered 14 commonly used epidemiology designs adapted for longitudinal data. They automated everything in sight. Moreover, there were about 5000 different “settings” on the 14 methods. The idea was to see how well the current methods do on predicting things we actually already know. To locate things they know, they took 10 old drug classes: ACE inhibitors, beta blockers, warfarin, etc., and 10 outcomes of interest: renal failure, hospitalization, bleeding, etc. For some of these the results are known. So for example, warfarin is a blood thinner and definitely causes bleeding. There were 9 such known bad effects. There were also 44 known “negative” cases, where we are super confident there’s just no harm in taking these drugs, at least for these outcomes. The basic experiment was this: run 5000 commonly used epidemiological analyses using all 10 databases. How well do they do at discriminating between reds and blues? This is kind of like a spam filter test. We have training emails that are known spam, and you want to know how well the model does at detecting spam when it comes through. Each of the models output the same thing: a relative risk (causal effect estimate) and an error. This was an attempt to empirically evaluate how well does epidemiology work, kind of the quantitative version of John Ioannidis’s work. we did the quantitative thing to show he’s right. Why hasn’t this been done before? There’s conflict of interest for epidemiology – why would they want to prove their methods don’t work? Also, it’s expensive, it cost$25 million dollars (of course that pales in comparison to the money being put into these studies). They bought all the data, made the methods work automatically, and did a bunch of calculations in the Amazon cloud. The code is open source.

In the second version, we zeroed in on 4 particular outcomes. Here’s the $25,000,000 ROC curve: To understand this graph, we need to define a threshold, which we can start with at 2. This means that if the relative risk is estimated to be above 2, we call it a “bad effect”, otherwise call it a “good effect.” The choice of threshold will of course matter. If it’s high, say 10, then you’ll never see a 10, so everything will be considered a good effect. Moreover these are old drugs and it wouldn’t be on the market. This means your sensitivity will be low, and you won’t find any real problem. That’s bad! You should find, for example, that warfarin causes bleeding. There’s of course good news too, with low sensitivity, namely a zero false-positive rate. What if you set the threshold really low, at -10? Then everything’s bad, and you have a 100% sensitivity but very high false positive rate. As you vary the threshold from very low to very high, you sweep out a curve in terms of sensitivity and false-positive rate, and that’s the curve we see above. There is a threshold (say 1.8) for which your false positive rate is 30% and your sensitivity is 50%. This graph is seriously problematic if you’re the FDA. A 30% false-positive rate is out of control. This curve isn’t good. The overall “goodness” of such a curve is usually measured as the area under the curve: you want it to be one, and if your curve lies on diagonal the area is 0.5. This is tantamount to guessing randomly. So if your area under the curve is less than 0.5, it means your model is perverse. The area above is 0.64. Moreover, of the 5000 analysis we ran, this is the single best analysis. But note: this is the best if I can only use the same method for everything. In that case this is as good as it gets, and it’s not that much better than guessing. But no epidemiology would do that! So what they did next was to specialize the analysis to the database and the outcome. And they got better results: for the medicare database, and for acute kidney injury, their optimal model gives them an AUC of 0.92. They can achieve 80% sensitivity with a 10% false positive rate. They did this using a cross-validation method. Different databases have different methods attached to them. One winning method is called “OS”, which compares within a given patient’s history (so compares times when patient was on drugs versus when they weren’t). This is not widely used now. The epidemiologists in general don’t believe the results of this study. If you go to http://elmo/omop.org, you can see the AUM for a given database and a given method. Note the data we used was up to mid-2010. To update this you’d have to get latest version of database, and rerun the analysis. Things might have changed. Moreover, an outcome for which nobody has any idea on what drugs cause what outcomes you’re in trouble. This only applies to when we have things to train on where we know the outcome pretty well. Parting remarks Keep in mind confidence intervals only account for sampling variability. They don’t capture bias at all. If there’s bias, the confidence interval or p-value can be meaningless. What about models that epidemiologists don’t use? We have developed new methods as well (SCCS). we continue to do that, but it’s a hard problem. Challenge for the students: we ran 5000 different analyses. Is there a good way of combining them to do better? weighted average? voting methods across different strategies? Note the stuff is publicly available and might make a great Ph.D. thesis. ## The zit model When my mom turned 42, I was 12 and a total wise-ass. For her present I bought her a coffee mug that had on it the phrase “Things could be worse. You could be old and still have zits”, to tease her about her bad skin. Considering how obnoxious that was, she took it really well and drank out of the mug for years. Well, I’m sure you can all see where this is going. I’m now 40 and I have zits. I was contemplating this in the bath yesterday, wondering if I’d ever get rid of my zits and wondering if taking long hot baths helps or not. They come and go, so it seems vaguely controllable. Then I had a thought: well, I could collect data and see what helps. After all, I don’t always have zits. I could keep a diary of all the things that I think might affect the situation: what I eat (I read somewhere that eating cheese makes you have zits), how often I take baths vs. showers, whether I use zit cream, my hormones, etc. and certainly whether or not I have zits on a given day or not. The first step would be to do some research on the theories people have about what causes zits, and then set up a spreadsheet where I could efficiently add my daily data. Maybe a google form! I’m wild about google forms. After collecting this data for some time I could build a model which tries to predict zittage, to see which of those many inputs actually have signal for my personal zit model. Of course I expect a lag between the thing I do or eat or use and the actual resulting zit, and I don’t know what that lag is (do you get zits the day after you eat cheese? or three days after eating cheese?), so I’ll expect some difficulty with this or even over fitting. Even so, this just might work! Then I immediately felt tired because, if you think about spending your day collecting information like that about your potential zits, then you must be totally nuts. I mean, I can imagine doing it just for fun, or to prove a point, or on a dare (there are few things I won’t do on a dare), but when it comes down to it I really don’t care that much about my zits. Then I started thinking about technology and how it could help me with my zit model. I mean, you know about those bracelets you can wear that count your steps and then automatically record them on your phone, right? Well, how long until those bracelets can be trained to collect any kind of information you can imagine? • Baths? No problem. I’m sure they can detect moisture and heat. • Cheese eating? Maybe you’d have to say out loud what you’re eating, but again not a huge problem. • Hormones? I have no idea but let’s stipulate plausible: they already have an ankle bracelet that monitors blood alcohol levels. • Whether you have zits? Hmmm. Let’s say you could add any variable you want with voice command. In other words, in 5 years this project will be a snap when I have my handy dandy techno bracelet which collects all the information I want. And maybe whatever other information as well, because information storage is cheap. I’ll have a bounty of data for my zit model. This is exciting stuff. I’m looking forward to building the definitive model, from which I can conclude that eating my favorite kind of cheese does indeed give me zits. And I’ll say to myself, worth it! ## Columbia Data Science course, week 9: Morningside Analytics, network analysis, data journalism Our first speaker this week in Rachel Schutt‘s Columbia Data Science course was John Kelly from Morningside Analytics, who came to talk to us about network analysis. John Kelly Kelly has four diplomas from Columbia, starting with a BA in 1990 from Columbia College, followed by a Masters, MPhil and Ph.D. in Columbia’s school of Journalism. He explained that studying communications as a discipline can mean lots of things, but he was interested in network sociology and statistics in political science. Kelly spent a couple of terms at Stanford learning survey design and game theory and other quanty stuff. He describes the Columbia program in communications as a pretty DIY set-up, where one could choose to focus on the role of communication in society, the impact of press, impact of information flow, or other things. Since he was interested in quantitative methods, he hunted them down, doing his master’s thesis work with Marc Smith from Microsoft. He worked on political discussions and how they evolve as networks (versus other kinds of discussions). After college and before grad school, Kelly was an artist, using computers to do sound design. He spent 3 years as the Director of Digital Media here at Columbia School of the Arts. Kelly taught himself perl and python when he spent a year in Viet Nam with his wife. Kelly’s profile Kelly spent quite a bit of time describing how he sees math, statistics, and computer science (including machine learning) as tools he needs to use and be good at in order to do what he really wants to do. But for him the good stuff is all about domain expertise. He want to understand how people come together, and when they do, what is their impact on politics and public policy. His company Morningside Analytics has clients like think tanks and political organizations and want to know how social media affects and creates politics. In short, Kelly wants to understand society, and the math and stats allows him to do that. Communication and presentations are how he makes money, so that’s important, and visualizations are integral to both domain expertise and communications, so he’s essentially a viz expert. As he points out, Morningside Analytics doesn’t get paid to just discover interesting stuff, but rather to help people use it. Whereas a company such SocialFlow is venture funded, which means you can run a staff even if you don’t make money, Morningside is bootstrapped. It’s a different life, where we eat what we sow. Case-attribute data vs. social network data Kelly has a strong opinion about standard modeling through case-attribute data, which is what you normally see people feed to models with various “cases” (think people) who have various “attributes” (think age, or operating system, or search histories). Maybe because it’s easy to store in databases or because it’s easy to collect this kind of data, there’s been a huge bias towards modeling with case-attribute data. Kelly thinks it’s missing the point of the questions we are trying to answer nowadays. It started, he said, in the 1930′s with early market research, and it was soon being applied applied to marketing as well as politicals. He named Paul Lazarsfeld and Elihu Katz as trailblazing sociologists who came here from Europe and developed the field of social network analysis. This is a theory based not only on individual people but also the relationships between them. We could do something like this for the attributes of a data scientist, and we might have an arrow point from math to stats if we think math “underlies” statistics in some way. Note the arrows don’t always mean the same thing, though, and when you specify a network model to test a theory it’s important you make the arrows well-defined. To get an idea of why network analysis is superior to case-attribute data analysis, think about this. The federal government spends money to poll people in Afghanistan. The idea is to see what citizens want and think to determine what’s going to happen in the future. But, Kelly argues, what’ll happen there isn’t a function of what individuals think, it’s a question of who has the power and what they think. Similarly, imagine going back in time and conducting a scientific poll of the citizenry of Europe in 1750 to determine the future politics. If you knew what you were doing you’d be looking at who’s marrying who among the royalty. In some sense the current focus on case-attribute data is a problem of what’s “under the streetlamp” – people are used to doing it that way. Kelly wants us to consider what he calls the micro/macro (i.e. individual versus systemic) divide: when it comes to buying stuff, or voting for a politician in a democracy, you have a formal mechanism for bridging the micro/macro divide, namely markets for buying stuff and elections for politicians. But most of the world doesn’t have those formal mechanisms, or indeed they have a fictive shadow of those things. For the most part we need to know enough about the actual social network to know who has the power and influence to bring about change. Kelly claims that the world is a network much more than it’s a bunch of cases with attributes. For example, if you only understand how individuals behave, how do you tie things together? History of social network analysis Social network analysis basically comes from two places: graph theory, where Euler solved the Seven Bridges of Konigsberg problem, and sociometry, started by Jacob Moreno in the 1970′s, just as early computers got good at making large-scale computations on large data sets. Social network analysis was germinated by Harrison White, emeritus at Columbia (emeritus), contemporaneously with Columbia sociologist Robert Merton. Their essential idea was that people’s actions have to be related to their attributes, but to really understand them you also need to look at the networks that enable them to do something. Core entities for network models Kelly gave us a bit of terminology from the world of social networks: • actors (or nodes in graph theory speak): these can be people, or websites, or what have you • relational ties (edges in graph theory speak): for example, an instance of liking someone or being friends • dyads: pairs of actors • triads: triplets of actors; there are for example, measures of triadic closure in networks • subgroups: a subset of the whole set of actors, along with their relational ties • group: the entirety of a “network”, easy in the case of Twitter but very hard in the case of e.g. “liberals” • relation: for example, liking another person • social network: all of the above Types of Networks There are different types of social networks. For example, in one-node networks, the simplest case, you have a bunch of actors connected by ties. This is a construct you’d use to display a Facebook graph for example. In two-node networks, also called bipartite graphs, the connections only exist between two formally separate classes of objects. So you might have people on the one hand and companies on the other, and you might connect a person to a company if she is on the board of that company. Or you could have people and the things they’re possibly interested in, and connect them if they really are. Finally, there are ego networks, which is typically the part of the network surrounding a single person. So for example it could be just the subnetwork of my friends on Facebook, who may also know each other in certain cases. Kelly reports that people with higher socioeconomic status have more complicated ego networks. You can see someone’s level of social status by looking at their ego network. What people do with these networks The central question people ask when given a social network is, who’s important here? This leads to various centrality measures. The key ones are: 1. degree – This counts how many people are connected to you. 2. closeness – If you are close to everyone, you have a high closeness score. 3. betweenness – People who connect people who are otherwise separate. If information goes through you, you have a high betweenness score. 4. eigenvector – A person who is popular with the popular kids has high eigenvector centrality. Google’s page rank is an example. A caveat on the above centrality measures: the measurement people form an industry that try to sell themselves as the authority. But experience tells us that each has their weaknesses and strengths. The main thing is to know you’re looking at the right network. For example, if you’re looking for a highly influential blogger in the muslim brotherhood, and you write down the top 100 bloggers in some large graph of bloggers, and start on the top of the list, and go down the list looking for a muslim brotherhood blogger, it won’t work: you’ll find someone who is both influential in the large network and who blogs for the muslim brotherhood, but they won’t be influential with the muslim brotherhood, but rather with transnational elites in the larger network. In other words, you have to keep in mind the local neighborhood of the graph. Another problem with measures: experience dictates that, although something might work with blogs, when you work with Twitter you’ll need to get out new tools. Different data and different ways people game centrality measures make things totally different. For example, with Twitter, people create 5000 Twitter bots that all follow each other and some strategic other people to make them look influential by some measure (probably eigenvector centrality). But of course this isn’t accurate, it’s just someone gaming the measures. Some network packages exist already and can compute the various centrality measures mentioned above: Thought experiment You’re part of an elite, well-funded think tank in DC. You can hire people and you have$10million to spend. Your job is to empirically predict the future political evolution of Egypt. What kinds of political parties will there be? What is the country of Egypt gonna look like in 5, 10, or 20 years? You have access to exactly two of the following datasets for all Egyptians:

1. The Facebook network,
2. The Twitter network,
3. A complete record of who went to school with who,
4. The SMS/phone records,
5. The network data on members of all political organizations and private companies, and
6. Where everyone lives and who they talk to.

Note things change over time- people might migrate off of Facebook, or political discussions might need to go underground if blogging is too public. Facebook alone gives a lot of information but sometimes people will try to be stealth. Phone records might be better representation for that reason.

If you think the above is ambitious, recall Siemens from Germany sold Iran software to monitor their national mobile networks. In fact, Kelly says, governments are putting more energy into loading field with allies, and less with shutting down the field. Pakistan hires Americans to do their pro-Pakistan blogging and Russians help Syrians.

In order to answer this question, Kelly suggests we change the order of our thinking. A lot of the reasoning he heard from the class was based on the question, what can we learn from this or that data source? Instead, think about it the other way around: what would it mean to predict politics in a society? what kind of data do you need to know to do that? Figure out the questions first, and then look for the data to help me answer them.

Morningside Analytics

Kelly showed us a network  map of 14 of the world’s largest blogospheres. To understand the pictures, you imagine there’s a force, like a wind, which sends the nodes (blogs) out to the edge, but then there’s a counteracting force, namely the links between blogs, which attach them together.

Here’s an example of the arabic blogosphere:

The different colors represent countries and clusters of blogs. The size of each dot is centrality through degree, so the number of links to other blogs in the network. The physical structure of the blogosphere gives us insight.

If we analyze text using NLP, thinking of the blog posts as a pile of text or a river of text, then we see the micro or macro picture only – we lose the most important story. What’s missing there is social network analysis (SNA) which helps us map and analyze the patterns of interaction.

The 12 different international blogospheres, for example, look different. We infer that different societies have different interests which give rise to different patterns.

But why are they different? After all, they’re representations of some higher dimensional thing projected onto two dimensions. Couldn’t it be just that they’re drawn differently? Yes, but we do lots of text analysis that convinces us these pictures really are showing us something. We put an effort into interpreting the content qualitatively.

So for example, in the French blogosphere, we see a cluster that discusses gourmet cooking. In Germany we see various blobs discussing politics and lots of weird hobbies. In English we see two big blobs [mathbabe interjects: gay porn and straight porn?] They turn out to be conservative vs. liberal blogs.

In Russian, their blogging networks tend to force people to stay within the networks, which is why we see very well defined partitioned blobs.

The proximity clustering is done using the Fruchterman-Reingold algorithm, where being in the same neighborhood means your neighbors are connected to other neighbors, so really a collective phenomenon of influence.. Then we interpret the segments. Here’s an example of English language blogs:

Think about social media companies: they are each built around the fact that they either have the data or that they have a toolkit – a patented sentiment engine or something, a machine that goes ping.

But keep in mind that social media is heavily a product of organizations that pay to move the needle (i.e. game the machine that goes ping). To decipher that game you need to see how it works, you need to visualize.

So if you are wondering about elections, look at people’s blogs within “the moms” or “the sports fans”. This is more informative than looking at partisan blogs where you already know the answer.

Kelly walked us through an analysis, once he has binned the blogosphere into its segments, of various types of links to partisan videos like MLK’s “I have a dream” speech and a gotcha video from the Romney campaign. In the case of the MLK speech, you see that it gets posted in spurts around the election cycle events all over the blogosphere, but in the case of the Romney campaign video, you see a concerted effort by conservative bloggers to post the video in unison.

That is to say, if you were just looking at a histogram of links, a pure count, it might look as if it had gone viral, but if you look at it through the lens of the understood segmentation of the blogosphere, it’s clearly a planned operation to game the “virality” measures.

Kelly also works with the Berkman Center for Internet and Society at Harvard. He analyzed the Iranian blogosphere in 2008 and again in 2011 and he found much the same in terms of clustering – young anti-government democrats, poetry, conservative pro-regime clusters dominated in both years.

However, only 15% of the blogs are the same 2008 to 2011.

So, whereas people are often concerned about individuals (case-attribute model), the individual fish are less important than the schools of fish. By doing social network analysis, we are looking for the schools, because that way we learn about the salient interests of the society and how those interests are they stable over time.

The moral of this story is that we need to focus on meso-level patterns, not micro- or macro-level patterns.

John Bruner

Our second speaker of the night was John Bruner, an editor at O’Reilly who previously worked as the data editor at Forbes. He is broad in his skills: he does research and writing on anything that involved data. Among other things at Forbes, he worked on an internal database on millionaires on which he ran simple versions of social media dynamics.

Writing technical journalism

Bruner explained the term “data journalism” to the class. He started this by way of explaining his own data scientist profile.

First of all, it involved lots of data viz. A visualization is a fast way of describing the bottomline of a data set. And at a big place like the NYTimes, data viz is its own discipline and you’ll see people with expertise in parts of dataviz – one person will focus on graphics while someone else will be in charge of interactive dataviz.

CS skills are pretty important in data journalism too. There are tight deadlines, and the data journalist has to be good with their tools and with messy data (because even federal data is messy). One has to be able to handle arcane formats or whatever, and often this means parcing stuff in python or what have you. Bruner uses javascript and python and SQL and Mongo among other tools.

Bruno was a math major in college at University of Chicago, then he went into writing at Forbes, where he slowly merged back into quantitative stuff while there. He found himself using mathematics in his work in preparing good representations of the research he was uncovering about, for example, contributions of billionaires to politicians using circles and lines.

Statistics, Bruno says, informs the way you think about the world. It inspires you to write things: e.g., the “average” person is a woman with 250 followers but the median open twitter account has 0 followers. So the median and mean are impossibly different because the data is skewed. That’s an inspiration right there for a story.

Bruno admits to being a novice in machine learning.However, he claims domain expertise as quite important. With exception to people who can specialize in one subject, say at a governmental office or a huge daily, for smaller newspaper you need to be broad, and you need to acquire a baseline layer of expertise quickly.

Of course communications and presentations are absolutely huge for data journalists. Their fundamental skill is translation: taking complicated stories and deriving meaning that readers will understand. They also need to anticipate questions, turn them into quantitative experiments, and answer them persuasively.

A bit of history of data journalism

Data journalism has been around for a while, but until recently (computer-assisted reporting) was a domain of Excel power users. Still, if you know how to write an excel program, you’re an elite.

Things started to change recently: more data became available to us in the form of API’s, new tools and less expensive computing power, so we can analyze pretty large data sets on your laptop. Of course excellent viz tools make things more compelling, flash is used for interactive viz environments, and javascript is getting way better.

Programming skills are now widely enough held so that you can find people who are both good writers and good programmers. Many people are english majors and know enough about computers to make it work, for example, or CS majors who can write.

In big publications like the NYTimes, the practice of data journalism is divided into fields: graphics vs. interactives, research, database engineers, crawlers, software developers, domain expert writers. Some people are in charge of raising the right questions but hand off to others to do the analysis. Charles Duhigg at the NYTimes, for example, studied water quality in new york, and got a FOIA request to the State of New York, and knew enough to know what would be in that FOIA request and what questions to ask but someone else did the actual analysis.

At a smaller place, things are totally different. Whereas the NYTimes has 1000 people on its newsroom floor, the Economist has maybe 130, and Forbes has 70 or 80 people in their newsrooms. If you work for anything beside a national daily, you end up doing everything by yourself: you come up with question, you go get the data, you do the analysis, then you write it up.

Of course you also help and collaborate with your colleagues when you can.

Advice Bruno has for the students in initiating a data journalism project: don’t have a strong thesis before you’ve interviewed the experts. Go in with a loose idea of what you’re searching for and be willing to change your mind and pivot if the experts lead you in a new and interesting direction.

## An AMS panel to examine public math models?

On Saturday I gave a talk at the AGNES conference to a room full of algebraic geometers.  After introducing myself and putting some context around my talk, I focused on a few models:

• VaR,
• VAM,
• Credit scoring,
• E-scores (online version of credit scores), and
• The h-score model (I threw this in for the math people and because it’s an egregious example of a gameable model).

I wanted to formalize the important and salient properties of a model, and I came up with this list:

• Name – note the name often gives off a whiff of political manipulation by itself
• Underlying model – regression? decision tree?
• Underlying assumptions – normal distribution of market returns?
• Input/output – dirty data?
• Purported/political goal – how is it actually used vs. how its advocates claim they’ll use it?
• Evaluation method – every model should come with one. Not every model does. A red flag.
• Gaming potential – how does being modeled cause people to act differently?
• Reach – how universal and impactful is the model and its gaming?

In the case of VAM, it doesn’t have an evaluation method. There’s been no way for teachers to know if the model that they get scored on every year is doing a good job, even as it’s become more and more important in tenure decisions (the Chicago strike was largely related to this issue, as I posted here).

Here was my plea to the mathematical audience: this is being done in the name of mathematics. The authority that math is given by our culture, which is enormous and possibly not deserved, is being manipulated by people with vested interests.

So when the objects of modeling, the people and the teachers who get these scores, ask how those scores were derived, they’re often told “it’s math and you wouldn’t understand it.”

That’s outrageous, and mathematicians shouldn’t stand for it. We have to get more involved, as a community, with how mathematics is wielded on the population.

On the other hand, I wouldn’t want mathematicians as a group to get co-opted by these special interest groups either and become shills for the industry. We don’t want to become economists, paid by this campaign or that to write papers in favor of their political goals.

To this end, someone in the audience suggested the AMS might want to publish a book of ethics for mathematicians akin to the ethical guidelines that are published for the society of pyschologists and lawyers. His idea is that it would be case-study based, which seems pretty standard. I want to give this some more thought.

We want to make ourselves available to understand high impact, public facing models to ensure they are sound mathematically, have reasonable and transparent evaluation methods, and are very high quality in terms of proven accuracy and understandability if they are used on people in high stakes situations like tenure.

One suggestion someone in the audience came up with is to have a mathematician “mechanical turk” service where people could send questions to a group of faceless mathematicians. Although I think it’s an intriguing idea, I’m not sure it would work here. The point is to investigate so-called math models that people would rather no mathematician laid their eyes on, whereas mechanical turks only answer questions someone else comes up with.

In other words, there’s a reason nobody has asked the opinion of the mathematical community on VAM. They are using the authority of mathematics without permission.

Instead, I think the math community should form something like a panel, maybe housed inside the American Mathematical Society (AMS), that trolls for models with the following characteristics:

• high impact – people care about these scores for whatever reason
• large reach – city-wide or national
• claiming to be mathematical – so the opinion of the mathematical community matters, or should,

After finding such a model, the panel should publish a thoughtful, third-party analysis of its underlying mathematical soundness. Even just one per year would have a meaningful effect if the models were chosen well.

As I said to someone in the audience (which was amazingly receptive and open to my message), it really wouldn’t take very long for a mathematician to understand these models well enough to have an opinion on them, especially if you compare it to how long it would take a policy maker to understand the math. Maybe a week, with the guidance of someone who is an expert in modeling.

So in other words, being a member of such a “public math models” panel could be seen as a community service job akin to being an editor for a journal: real work but not something that takes over your life.

Now’s the time to do this, considering the explosion of models on everything in sight, and I believe mathematicians are the right people to take it on, considering they know how to admit they’re wrong.

Tell me what you think.

## Columbia Data Science course, week 8: Data visualization, broadening the definition of data science, Square, fraud detection

This week in Rachel Schutt’s Columbia Data Science course we had two excellent guest speakers.

The first speaker of the night was Mark Hansen, who recently came from UCLA via the New York Times to Columbia with a joint appointment in journalism and statistics. He is a renowned data visualization expert and also an energetic and generous speaker. We were lucky to have him on a night where he’d been drinking an XXL latte from Starbucks to highlight his natural effervescence.

Mark started by telling us a bit about Gabriel Tarde (1843-1904).

Tarde was a sociologist who believed that the social sciences had the capacity to produce vastly more data than the physical sciences. His reasoning was as follows.

The physical sciences observe from a distance: they typically model or incorporate models which talk about an aggregate in some way – for example, biology talks about the aggregate of our cells. What Tarde pointed out was that this is a deficiency, basically a lack of information. We should instead be tracking every atom.

This is where Tarde points out that in the social realm we can do this, where cells are replaced by people. We can collect a huge amount of information about those individuals.

But wait, are we not missing the forest for the trees when we do this? Bruno Latour weighs in on his take of Tarde as follows:

“But the ‘whole’ is now nothing more than a provisional visualization which can be modified and reversed at will, by moving back to the individual components, and then looking for yet other tools to regroup the same elements into alternative assemblages.”

In 1903, Tarde even foresees the emergence of Facebook, although he refers to a “daily press”:

“At some point, every social event is going to be reported or observed.”

Mark then laid down the theme of his lecture using a 2009 quote of Bruno Latour:

“Change the instruments and you will change the entire social theory that goes with them.”

Kind of like that famous physics cat, I guess, Mark (and Tarde) want us to newly consider

1. the way the structure of society changes as we observe it, and
2. ways of thinking about the relationship of the individual to the aggregate.

Mark’s Thought Experiment:

As data become more personal, as we collect more data about “individuals”, what new methods or tools do we need to express the fundamental relationship between ourselves and our communities, our communities and our country, our country and the world? Could we ever be satisfied with poll results or presidential approval ratings when we can see the complete trajectory of public opinions, individuated and interacting?

What is data science?

Mark threw up this quote from our own John Tukey:

“The best thing about being a statistician is that you get to play in everyone’s backyard”

But let’s think about that again – is it so great? Is it even reasonable? In some sense, to think of us as playing in other people’s yards, with their toys, is to draw a line between “traditional data fields” and “everything else”.

It’s maybe even implying that all our magic comes from the traditional data fields (math, stats, CS), and we’re some kind of super humans because we’re uber-nerds. That’s a convenient way to look at it from the perspective of our egos, of course, but it’s perhaps too narrow and arrogant.

And it begs the question, what is “traditional” and what is “everything else” anyway?

Mark claims that everything else should include:

• social science,
• physical science,
• geography,
• architecture,
• education,
• information science,
• architecture,
• digital humanities,
• journalism,
• design,
• media art

There’s more to our practice than being technologists, and we need to realize that technology itself emerges out of the natural needs of a discipline. For example, GIS emerges from geographers and text data mining emerges from digital humanities.

In other words, it’s not math people ruling the world, it’s domain practices being informed by techniques growing organically from those fields. When data hits their practice, each practice is learning differently; their concerns are unique to that practice.

Responsible data science integrates those lessons, and it’s not a purely mathematical integration. It could be a way of describing events, for example. Specifically, it’s not necessarily a quantifiable thing.

Bottom-line: it’s possible that the language of data science has something to do with social science just as it has something to do with math.

Processing

Mark then told us a bit about his profile (“expansionist”) and about the language processing, in answer to a question about what is different when a designer takes up data or starts to code.

He explained it by way of another thought experiment: what is the use case for a language for artists? Students came up with a bunch of ideas:

• being able to specify shapes,
• faithful rendering of what visual thing you had in mind,
• being able to sketch,
• 3-d,
• animation,
• interactivity,
• Mark added publishing – artists must be able to share and publish their end results.

It’s java based, with a simple “publish” button, etc. The language is adapted to the practice of artists. He mentioned that teaching designers to code meant, for him, stepping back and talking about iteration, if statements, etc., of in other words stuff that seemed obvious to him but is not obvious to someone who is an artist. He needed to unpack his assumptions, which is what’s fun about teaching to the uninitiated.

He next moved on to close versus distant reading of texts. He mentioned Franco Moretti from Stanford. This is for Franco:

Franco thinks about “distant reading”, which means trying to get a sense of what someone’s talking about without reading line by line. This leads to PCA-esque thinking, a kind of dimension reduction of novels.

In other words, another cool example of how data science should integrate the way the experts in various fields figure it out. We don’t just go into their backyards and play, maybe instead we go in and watch themplay and formalize and inform their process with our bells and whistles. In this way they can teach us new games, games that actually expand our fundamental conceptions of data and the approaches we need to analyze them.

Mark’s favorite viz projects

1) Nuage Vert, Helen Evans & Heiko Hansen: a projection onto a power plant’s steam cloud. The size of the green projection corresponds to the amount of energy the city is using. Helsinki and Paris.

2) One Tree, Natalie Jeremijenko: The artist cloned trees and planted the genetically identical seeds in several areas. Displays among other things the environmental conditions in each area where they are planted.

3) Dusty Relief, New Territories: here the building collects pollution around it, displayed as dust.

4) Project Reveal, New York Times R&D lab: this is a kind of magic mirror which wirelessly connects using facial recognition technology and gives you information about yourself. As you stand at the mirror in the morning you get that “come-to-jesus moment” according to Mark.

5) Million Dollar Blocks, Spatial Information Design Lab (SIDL): So there are crime stats for google maps, which are typically painful to look at. The SIDL is headed by Laura Kurgan, and in this piece she flipped the statistics. She went into the prison population data, and for every incarcerated person, she looked at their home address, measuring per home how much money the state was spending to keep the people who lived there in prison. She discovered that some blocks were spending $1,000,000 to keep people in prison. Moral of the above: just because you can put something on the map, doesn’t mean you should. Doesn’t mean there’s a new story. Sometimes you need to dig deeper and flip it over to get a new story. New York Times lobby: Moveable Type Mark walked us through a project he did with Ben Rubin for the NYTimes on commission (and he later went to the NYTimes on sabbatical). It’s in the lobby of their midtown headquarters at 8th and 42nd. It consists of 560 text displays, two walls with 280 on each, and the idea is they cycle through various “scenes” which each have a theme and an underlying data science model. For example, in one there are waves upon waves of digital ticker-tape like scenes which leave behind clusters of text, and where each cluster represents a different story from the paper. The text for a given story highlights phrases which make a given story different from others in some information-theory sense. In another scene the numbers coming out of stories are highlighted, so you might see on a given box “18 gorillas”. In a third scene, crossword puzzles play themselves with sounds of pencil and paper. The display boxes themselves are retro, with embedded linux processors running python, and a sound card on each box, which makes clicky sounds or wavy sounds or typing sounds depending on what scene is playing. The data taken in is text from NY Times articles, blogs, and search engine activity. Every sentence is parsed using Stanford NLP techniques, which diagrams sentences. Altogether there are about 15 “scenes” so far, and it’s code so one can keep adding to it. Here’s an interview with them about the exhibit: Project Cascade: Lives on a Screen Mark next told us about Cascade, which was joint work with Jer Thorp data artist-in-residence at the New York Times. Cascade came about from thinking about how people share New York Times links on Twitter. It was in partnerships with bitly. The idea was to collect enough data so that we could see someone browse, encode the link in bitly, tweet that encoded link, see other people click on that tweet and see bitly decode the link, and then see those new people browse the New York Times. It’s a visualization of that entire process, much as Tarde suggested we should do. There were of course data decisions to be made: a loose matching of tweets and clicks through time, for example. If 17 different tweets have the same url they don’t know which one you clicked on, so they guess (the guess actually seemed to involve probabilistic matching on time stamps so it’s an educated guess). They used the Twitter map of who follows who. If someone you follow tweets about something before you do then it counts as a retweet. It covers any nytimes.com link. Here’s a NYTimes R&D video about Project Cascade: Note: this was done 2 years ago, and Twitter has gotten a lot bigger since then. Cronkite Plaza Next Mark told us about something he was working on which just opened 1.5 months ago with Jer and Ben. It’s also news related, but this is projecting on the outside of a building rather than in the lobby; specifically, the communications building at UT Austin, in Cronkite Plaza. The majority of the projected text is sourced from Cronkite’s broadcasts, but also have local closed-captioned news sources. One scene of this project has extracted the questions asked during local news – things like “how did she react?” or “What type of dog would you get?”. The project uses 6 projectors. Goals of these exhibits They are meant to be graceful and artistic, but should also teach something. At the same time we don’t want to be overly didactic. The aim is to live in between art and information. It’s a funny place: increasingly we see a flattening effect when tools are digitized and made available, so that statisticians can code like a designer (we can make things that look like design) and similarly designers can make something that looks like data. What data can we get? Be a good investigator: a small polite voice which asks for data usually gets it. eBay transactions and books Again working jointly with Jer Thorp, Mark investigated a day’s worth of eBay’s transactions that went through Paypal and, for whatever reason, two years of book sales. How do you visualize this? Take a look at the yummy underlying data: Here’s how they did it (it’s ingenious). They started with the text of Death of a Salesman by Arthur Miller. They used a mechanical turk mechanism to locate objects in the text that you can buy on eBay. When an object is found it moves it to a special bin, so “chair” or “flute” or “table.” When it has a few collected buy-able objects, it then takes the objects and sees where they are all for sale on the day’s worth of transactions, and looks at details on outliers and such. After examining the sales, the code will find a zipcode in some quiet place like Montana. Then it flips over to the book sales data, looks at all the books bought or sold in that zip code, picks a book (which is also on Project Gutenberg), and begins to read that book and collect “buyable” objects from that. And it keeps going. Here’s a video: Public Theater Shakespeare Machine The last thing Mark showed us is is joint work with Rubin and Thorp, installed in the lobby of the Public Theater. The piece itself is an oval structure with 37 bladed LED displays, set above the bar. There’s one blade for each of Shakespeare’s plays. Longer plays are in the long end of the oval, Hamlet you see when you come in. The data input is the text of each play. Each scene does something different – for example, it might collect noun phrases that have something to do with body from each play, so the “Hamlet” blade will only show a body phrase from Hamlet. In another scene, various kinds of combinations or linguistic constructs are mined: • “high and might” “good and gracious” etc. • “devilish-holy” “heart-sore” “ill-favored” “sea-tossed” “light-winged” “crest-fallen” “hard-favoured” etc. Note here that the digital humanities, through the MONK Project, offered intense xml descriptions of the plays. Every single word is given hooha and there’s something on the order of 150 different parts of speech. As Mark said, it’s Shakespeare so it stays awesome no matter what you do, but here we see we’re successively considering words as symbols, or as thematic, or as parts of speech. It’s all data. Ian Wong from Square Next Ian Wong, an “Inference Scientist” at Square who dropped out of an Electrical Engineering Ph.D. program at Stanford talked to us about Data Science in Risk. He conveniently started with his takeaways: 1. Machine learning is not equivalent to R scripts. ML is founded in math, expressed in code, and assembled into software. You need to be an engineer and learn to write readable, reusable code: your code will be reread more times by other people than by you, so learn to write it so that others can read it. 2. Data visualization is not equivalent to producing a nice plot. Rather, think about visualizations as pervasive and part of the environment of a good company. 3. Together, they augment human intelligence. We have limited cognitive abilities as human beings, but if we can learn from data, we create an exoskeleton, an augmented understanding of our world through data. Square Square was founded in 2009. There were 40 employees in 2010, and there are 400 now. The mission of the company is to make commerce easy. Right now transactions are needlessly complicated. It takes too much to understand and to do, even to know where to start for a vendor. For that matter, it’s too complicated for buyers as well. The question we set out to ask is, how do we make transactions simple and easy? We send out a white piece of plastic, which we refer to as the iconic square. It’s something you can plug into your phone or iPad. It’s simple and familiar, and it makes it easy to use and to sell. It’s even possible to buy things hands-free using the square. A buyer can open a tab on their phone so that they can pay by saying their name.. Then the merchant taps your name on their screen. This makes sense if you are a frequent visitor to a certain store like a coffee shop. Our goal is to make it easy for sellers to sign up for Square and accept payments. Of course, it’s also possible that somebody may sign up and try to abuse the service. We are therefore very careful at Square to avoid losing money on sellers with fraudulent intentions or bad business models. The Risk Challenge At Square we need to balance the following goals: 1. to provide a frictionless and delightful experience for buyers and sellers, 2. to fuel rapid growth, and in particular to avoid inhibiting growth through asking for too much information of new sellers, which adds needless barriers to joining, and 3. to maintain low financial loss. Today we’ll just focus on the third goal through detection of suspicious activity. We do this by investing in machine learning and viz. We’ll first discuss the machine learning aspects. Part 1: Detecting suspicious activity using machine learning First of all, what’s suspicious? Examples from the class included: 1. lots of micro transactions occurring, 2. signs of money laundering, 3. high frequency or inconsistent frequency of transactions. Example: Say Rachel has a food truck, but then for whatever reason starts to have$1000 transactions (mathbabe can’t help but insert that Rachel might be a food douche which would explain everything).

On the one hand, if we let money go through, Square is liable in the case it was unauthorized. Technically the fraudster, so in this case Rachel would be liable, but our experience is that usually fraudsters are insolvent, so it ends up on Square.

On the other hand, the customer service is bad if we stop payment on what turn out to be real payments. After all, what if she’s innocent and we deny the charges? She will probably hate us, may even sully our reputation, and in any case our trust is lost with her after that.

This example crystallizes the important challenges we face: false positives erode customer trust, false negatives make us lose money.

And since Square processes millions of dollars worth of sales per day, we need to do this systematically and automatically. We need to assess the risk level of every event and entity in our system.

So what do we do?

First of all, we take a look at our data. We’ve got three types:

1. payment data, where the fields are transaction_id, seller_id, buyer_id, amount, success (0 or 1), timestamp,
2. seller data, where the fields are seller_id, sign_up_date, business_name, business_type, business_location,
3. settlement data, where the fields are settlement_id, state, timestamp.

Important fact: we settle to our customers the next day so we don’t have to make our decision within microseconds. We have a few hours. We’d like to do it quickly of course, but in certain cases we have time for a phone call to check on things.

So here’s the process: given a bunch (as in hundreds or thousands) of payment events, we throw each through the risk engine, and then send some iffy looking ones on to a “manual review”. An ops team will then review the cases on an individual basis. Specifically, anything that looks rejectable gets sent to ops, which make phone calls to double check unless it’s super outrageously obviously fraud.

Also, to be clear, there are actually two kinds of fraud to worry about, seller-side fraud and buyer-side fraud. For the purpose of this discussion, we’ll focus on the former.

So now it’s a question of how we set up the risk engine. Note that we can think of the risk engine as putting things in bins, and those bins each have labels. So we can call this a labeling problem.

But that kind of makes it sound like unsupervised learning, like a clustering problem, and although it shares some properties with that, it’s certainly not that simple – we don’t reject a payment and then merely stand pat with that label, because as we discussed we send it on to an ops team to assess it independently. So in actuality we have a pretty complicated set of labels, including for example:

• initially rejected but ok,
• initially rejected and bad,
• initially accepted but on further consideration might have been bad,
• initially accepted and things seem ok,
• initially accepted and later found to be bad, …

So in other words we have ourselves a semi-supervised learning problem, straddling the worlds of supervised and unsupervised learning. We first check our old labels, and modify them, and then use them to help cluster new events using salient properties and attributes common to historical events whose labels we trust. We are constantly modifying our labels even in retrospect for this reason.

We estimate performance  using precision and recall. Note there are very few positive examples so accuracy is not a good metric of success, since the “everything looks good” model is dumb but has good accuracy.

Labels are what Ian considered to be the “neglected half of the data” (recall T = {(x_i, y_i)}). In undergrad statistics education and in data mining competitions, the availability of labels is often taken for granted. In reality, labels are tough to define and capture. Labels are really important. It’s not just objective function, it is the objective.

As is probably familiar to people, we have a problem with sparsity of features. This is exacerbated by class imbalance (i.e., there are few positive samples). We also don’t know the same information for all of our sellers, especially when we have new sellers. But if we are too conservative we start off on the wrong foot with new customers.

Also, we might have a data point, say zipcode, for every seller, but we don’t have enough information in knowing the zipcode alone because so few sellers share zipcodes. In this case we want to do some clever binning of the zipcodes, which is something like sub model of our model.

Finally, and this is typical for predictive algorithms, we need to tweak our algorithm to optimize it- we need to consider whether features interact linearly or non-linearly, and to account for class imbalance.. We also have to be aware of adversarial behavior. An example of adversarial behavior in e-commerce is new buyer fraud, where a given person sets up 10 new accounts with slightly different spellings of their name and address.

Since models degrade over time, as people learn to game them, we need to continually retrain models. The keys to building performance models are as follows:

• it’s not a black box. You can’t build a good model by assuming that the algorithm will take care of everything. For instance, I need to know why I am misclassifying certain people, so I’ll need to roll up my sleeves and dig into my model.
• We need to perform rapid iterations of testing, with experiments like you’d do in a science lab. If you’re not sure whether to try A or B, then try both.
• When you hear someone say, “So which models or packages do you use?” then you’ve got someone who doesn’t get it. Models and/or packages are not magic potion.

Mathbabe cannot resist paraphrasing Ian here as saying “It’s not about the package. it’s about what you do with it.” But what Ian really thinks it’s about, at least for code, is:

• reusability
• correctness
• structure
• hygiene

So, if you’re coding a random forest algorithm and you’ve hardcoded the number of trees: you’re an idiot. put a friggin parameter there so people can reuse it. Make it tweakable. And write the tests for pity’s sake; clean code and clarity of thought go together.

At Square we try to maintain reusability and readability — we structure our code in different folders with distinct, reusable components that provide semantics around the different parts of building a machine learning model: model, signal, error, experiment.

We only write scripts in the experiments folder where we either tie together components from model, signal and error or we conduct exploratory data analysis. It’s more than just a script, it’s a way of thinking, a philosophy of approach.

What does such a discipline give you? Every time you run an experiment your should incrementally increase your knowledge. This discipline helps you make sure you don’t do the same work again. Without it you can’t even figure out the things you or someone else has already attempted.

For more on what every project directory should contain, see Project Template, written by John Myles White.

We had a brief discussion of how reading other people’s code is a huge problem, especially when we don’t even know what clean code looks like. Ian stayed firm on his claim that “if you don’t write production code then you’re not productive.”

In this light, Ian suggests exploring and actively reading Github’s repository of R code. He says to try writing your own R package after reading this. Also, he says that developing an aesthetic sense for code is analogous to acquiring the taste for beautiful proofs; it’s done through rigorous practice and feedback from peers and mentors. The problem is, he says, that statistics instructors in schools usually do not give feedback on code quality, nor are they qualified to.

For extra credit, Ian suggests the reader contrasts the implementations of the caret package (poor code) with scikit-learn (clean code).

Important things Ian skipped

• how is a model “productionized”?
• how are features computed in real-time to support these models?
• how do we make sure “what we see is what we get”, meaning the features we build in a training environment will be the ones we see in real-time. Turns out this is a pretty big problem.
• how do you test a risk engine?

Next Ian talked to us about how Square uses visualization.

Data Viz at Square

Ian talked to us about a bunch of different ways the Inference Team at Square use visualizations to monitor the transactions going on at any given time. He mentioned that these monitors aren’t necessarily trying to predict fraud per se but rather provides a way of keeping an eye on things to look for trends and patterns over time and serves as the kind of “data exoskeleton” that he mentioned at the beginning. People at Square believe in ambient analytics, which means passively ingesting data constantly so you develop a visceral feel for it.

After all, it is only by becoming very familiar with our data that we even know what kind of patterns are unusual or deserve their own model. To go further into the philosophy of this approach, he said two thing:

“What gets measured gets managed,” and “You can’t improve what you don’t measure.”

He described a workflow tool to review users, which shows features of the seller, including the history of sales and geographical information, reviews, contact info, and more. Think mission control.

In addition to the raw transactions, there are risk metrics that Ian keeps a close eye on. So for example he monitors the “clear rates” and “freeze rates” per day, as well as how many events needed to be reviewed. Using his fancy viz system he can get down to which analysts froze the most today and how long each account took to review, and what attributes indicate a long review process.

In general people at Square are big believers in visualizing business metrics (sign-ups, activations, active users, etc.) in dashboards; they think it leads to more accountability and better improvement of models as they degrade. They run a kind of constant EKG of their business through ambient analytics.

Ian ended with his data scientist profile. He thinks it should be on a logarithmic scale, since it doesn’t take very long to be okay at something (good enough to get by) but it takes lots of time to get from good to great. He believes that productivity should also be measured in log-scale, and his argument is that leading software contributors crank out packages at a much higher rate than other people.

Ian’s advice to aspiring data scientists

1. play with real data
2. build a good foundation in school
3. get an internship
4. be literate, not just in statistics
5. stay curious

Ian’s thought experiment

Suppose you know about every single transaction in the world as it occurs. How would you use that data?

## Strata: one down, one to go

Yesterday I gave a talk called “Finance vs. Machine Learning” at Strata. It was meant to be a smack-down, but for whatever reason I couldn’t engage people to personify the two disciplines and have a wrestling match on stage. For the record, I offered to be on either side. Either they were afraid to hurt a girl or they were afraid to lose to a girl, you decide.

Unfortunately I didn’t actually get to the main motivation for the genesis of this talk, namely the realization I had a while ago that when machine learners talk about “ridge regression” or “Tikhonov regularization” or even “L2 regularization” it comes down to the same thing that quants call a very simple bayesian prior that your coefficients shouldn’t be too large. I talked about this here.

What I did have time for: I talked about “causal modeling” in the finance-y sense (discussion of finance vs. statistician definition of causal here), exponential downweighting with a well-chosen decay, storytelling as part of feature selection, and always choosing to visualize everything, and always visualizing the evolution of a statistic rather than a snapshot statistic.

They videotaped me but I don’t see it on the strata website yet. I’ll update if that happens.

This morning, at 9:35, I’ll be in a keynote discussion with Julie Steele for 10 minutes entitled “You Can’t Learn That in School”, which will be live streamed. It’s about whether data science can and should be taught in academia.

For those of you wondering why I haven’t blogged the Columbia Data Science class like I usually do Thursday, these talks are why. I’ll get to it soon, I promise! Last night’s talks by Mark Hansen, data vizzer extraordinaire and Ian Wong, Inference Scientist from Square, were really awesome.

## Columbia Data Science course, week 7: Hunch.com, recommendation engines, SVD, alternating least squares, convexity, filter bubbles

Last night in Rachel Schutt’s Columbia Data Science course we had Matt Gattis come and talk to us about recommendation engines. Matt graduated from MIT in CS, worked at SiteAdvisor, and co-founded hunch as its CTO, which recently got acquired by eBay. Here’s what Matt had to say about his company:

Hunch

Hunch is a website that gives you recommendations of any kind. When we started out it worked like this: we’d ask you a bunch of questions (people seem to love answering questions), and then you could ask the engine questions like, what cell phone should I buy? or, where should I go on a trip? and it would give you advice. We use machine learning to learn and to give you better and better advice.

Later we expanded into more of an API where we crawled the web for data rather than asking people direct questions. We can also be used by third party to personalize content for a given site, a nice business proposition which led eBay to acquire us. My role there was doing the R&D for the underlying recommendation engine.

Matt has been building code since he was a kid, so he considers software engineering to be his strong suit. Hunch is a cross-domain experience so he doesn’t consider himself a domain expert in any focused way, except for recommendation systems themselves.

The best quote Matt gave us yesterday was this: “Forming a data team is kind of like planning a heist.” He meant that you need people with all sorts of skills, and that one person probably can’t do everything by herself. Think Ocean’s Eleven but sexier.

A real-world recommendation engine

You have users, and you have items to recommend. Each user and each item has a node to represent it. Generally users like certain items. We represent this as a bipartite graph. The edges are “preferences”. They could have weights: they could be positive, negative, or on a continuous scale (or discontinuous but many-valued like a star system). The implications of this choice can be heavy but we won’t get too into them today.

So you have all this training data in the form of preferences. Now you wanna predict other preferences. You can also have metadata on users (i.e. know they are male or female, etc.) or on items (a product for women).

For example, imagine users came to your website. You may know each user’s gender, age, whether they’re liberal or conservative, and their preferences for up to 3 items.

We represent a given user as a vector of features, sometimes including only their meta data, sometimes including only their preferences (which would lead to a sparse vector since you don’t know all their opinions) and sometimes including both, depending on what you’re doing with the vector.

Nearest Neighbor Algorithm?

Let’s review nearest neighbor algorithm (discussed here): if we want to predict whether a user A likes something, we just look at the user B closest to user A who has an opinion and we assume A’s opinion is the same as B’s.

To implement this you need a definition of a metric so you can measure distance. One example: Jaccard distance, i.e. the number of things preferences they have in common divided by the total number of things. Other examples: cosine similarity or euclidean distance. Note: you might get a different answer depending on which metric you choose.

What are some problems using nearest neighbors?

• There are too many dimensions, so the closest neighbors are too far away from each other. There are tons of features, moreover, that are highly correlated with each other. For example, you might imagine that as you get older you become more conservative. But then counting both age and politics would mean you’re double counting a single feature in some sense. This would lead to bad performance, because you’re using redundant information. So we need to build in an understanding of the correlation and project onto smaller dimensional space.
• Some features are more informative than others. Weighting features may therefore be helpful: maybe your age has nothing to do with your preference for item 1. Again you’d probably use something like covariances to choose your weights.
• If your vector (or matrix, if you put together the vectors) is too sparse, or you have lots of missing data, then most things are unknown and the Jaccard distance means nothing because there’s no overlap.
• There’s measurement (reporting) error: people may lie.
• There’s a calculation cost – computational complexity.
• Euclidean distance also has a scaling problem: age differences outweigh other differences if they’re reported as 0 (for don’t like) or 1 (for like). Essentially this means that raw euclidean distance doesn’t explicitly optimize.
• Also, old and young people might think one thing but middle-aged people something else. We seem to be assuming a linear relationship but it may not exist
• User preferences may also change over time, which falls outside the model. For example, at Ebay, they might be buying a printer, which makes them only want ink for a short time.
• Overfitting is also a problem. The one guy is closest, but it could be noise. How do you adjust for that? One idea is to use k-nearest neighbor, with say k=5.
• It’s also expensive to update the model as you add more data.

Matt says the biggest issues are overfitting and the “too many dimensions” problem. He’ll explain how he deals with them.

Going beyond nearest neighbor: machine learning/classification

In its most basic form, we’ve can model separately for each item using a linear regression. Denote by $f_{i, j}$ user $i$‘s preference for item $j$ (or attribute, if item $j$ is a metadata item). Say we want to model a given user’s preferences for a given item using only the 3 metadata properties of that user, which we assume are numeric. Then we can look for the best choice of $\beta_k$ as follows:

$p_i = \beta_1 f_{1, i} + \beta_2 f_{2, i} + \beta_3 f_{3, i} +$ $\epsilon$

Remember, this model only works for one item. We need to build as many models as we have items. We know how to solve the above per item by linear algebra. Indeed one of the drawbacks is that we’re not using other items’ information at all to create the model for a given item.

This solves the “weighting of the features” problem we discussed above, but overfitting is still a problem, and it comes in the form of having huge coefficients when we don’t have enough data (i.e. not enough opinions on given items). We have a bayesian prior that these weights shouldn’t be too far out of whack, and we can implement this by adding a penalty term for really large coefficients.

This ends up being equivalent to adding a prior matrix to the covariance matrix. how do you choose lambda? Experimentally: use some data as your training set, evaluate how well you did using particular values of lambda, and adjust.

Important technical note: You can’t use this penalty term for large coefficients and assume the “weighting of the features” problem is still solved, because in fact you’re implicitly penalizing some coefficients more than others. The easiest way to get around this is to normalize your variables before entering them into the model, similar to how we did it in this earlier class.

The dimensionality problem

We still need to deal with this very large problem. We typically use both Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

To understand how this works, let’s talk about how we reduce dimensions and create “latent features” internally every day. For example, we invent concepts like “coolness” – but I can’t directly measure how cool someone is, like I could weigh them or something. Different people exhibit pattern of behavior which we internally label to our one dimension of “coolness”.

We let the machines do the work of figuring out what the important “latent features” are. We expect them to explain the variance in the answers to the various questions. The goal is to build a model which has a representation in a lower dimensional subspace which gathers “taste information” to generate recommendations.

SVD

Given a matrix $X,$ compose it into three matrices:

$X = U S V^{\tau}.$

Here $X$ is $m \times n, U$ is $m \times k, S$ is $k\times k,$ and $V$ is $k\times n,$ where $m$ is the number of users, $n$ is the number of items, and $k$ is the rank of $X.$

The rows of $U$ correspond to users, whereas $V$ has a row for each item. The square matrix $S$ is diagonal where each entry is a singular value, which measure the importance of each dimension. If we put them in decreasing order, which we do, then the dimensions are ordered by importance from highest to lowest. Every matrix has such a decomposition.

Important properties:

• The columns of $U$ and $V$ are orthogonal to each other.
• So we can order the columns by singular values.
• We can take lower rank approximation of X by throwing away part of $S.$ In this way we might have $k$ much smaller than either $n$ or $m$, and this is what we mean by compression.
• There is an important interpretation to the values in the matrices $U$ and $V.$ For example, we can see, by using SVD, that “the most important latent feature” is often something like seeing if you’re a man or a woman.

[Question: did you use domain expertise to choose questions at Hunch? Answer: we tried to make them as fun as possible. Then, of course, we saw things needing to be asked which would be extremely informative, so we added those. In fact we found that we could ask merely 20 questions and then predict the rest of them with 80% accuracy. They were questions that you might imagine and some that surprised us, like competitive people v. uncompetitive people, introverted v. extroverted, thinking v. perceiving, etc., not unlike MBTI.]

More details on our encoding:

• Most of the time the questions are binary (yes/no).
• We create a separate variable for every variable.
• Comparison questions may be better at granular understanding, and get to revealed preferences, but we don’t use them.

Note if we have a rank $k$ matrix $X$ and we use the SVD above, we can take the approximation with only $k-3$ rows of the middle matrix $S,$ so in other words we take the top $k-3$ most important latent features, and the corresponding rows of $U$ and $V,$ and we get back something very close to $X.$

Note that the problem of sparsity or missing data is not fixed by the above SVD approach, nor is the computational complexity problem; SVD is expensive.

PCA

Now we’re still looking for $U$ and $V$ as above, but we don’t have $S$ anymore, so $X = U \cdot V^{\tau},$ and we have a more general optimization problem. Specifically, we want to minimize:

$argmin \sum_{i, j \in P} (p_{i, j} - u_i \cdot v_j)^2.$

Let me explain. We denote by $u_i$ the row of $U$ corresponding to user $i,$ and similarly we denote by $v_j$ the row of $V$ corresponding to item $j.$ Items can include meta-data information (so the age vectors of all the users will be a row in $V$).

Then the dot product $u_i \cdot v_j$ is taken to mean the predicted value of user $i$‘s preference for item $j,$ and we compare that to the actual preference $p_{i, j}$. The set $P$ is just the set of all actual known preferences or meta-data attribution values.

So, we want to find the best choices of $U$ and $V$ which overall minimize the squared differences between prediction and observation on everything we actually know, and the idea is that if it’s really good on stuff we know, it will also be good on stuff we’re guessing.

Now we have a parameter, namely the number $D$ which is how may latent features we want to use. The matrix $U$ will have a row for each user and a column for each latent feature, and the matrix $V$ will have a row for each item and a column for each latent features.

How do we choose $D?$ It’s typically about 100, since it’s more than 20 (we already know we had a pretty good grasp on someone if we ask them 20 questions) and it’s as much as we care to add before it’s computational too much work. Note the resulting latent features will be uncorrelated, since they are solving an efficiency problem (not a proof).

But how do we actually find $U$ and $V?$

Alternating Least Squares

This optimization doesn’t have a nice closed formula like ordinary least squares with one set of coefficients. Instead, we use an iterative algorithm like with gradient descent. As long as your problem is convex you’ll converge ok (i.e. you won’t find yourself at a local but not global maximum), and we will force our problem to be convex using regularization.

Algorithm:

• Pick a random $V$
• Optimize $U$ while $V$ is fixed
• Optimize $V$ while $U$ is fixed
• Keep doing the above two steps until you’re not changing very much at all.

Example: Fix $V$ and update $U.$

The way we do this optimization is user by user. So for user $i,$ we want to find

$argmin_{u_i} \sum_{j \in P_i} (p_{i, j} - u_i * v_j)^2,$

where $v_j$ is fixed. In other words, we just care about this user for now.

But wait a minute, this is the same as linear least squares, and has a closed form solution! In other words, set:

$u_i = (V_{*, i}^{\tau} V_{*, i})^{-1} V_{*, i}^{\tau} P_{* i},$

where $V_{*, i}$ is the subset of $V$ for which we have preferences coming from user $i.$ Taking the inverse is easy since it’s $D \times D,$ which is small. And there aren’t that many preferences per user, so solving this many times is really not that hard. Overall we’ve got a do-able update for $U.$

When you fix U and optimize V, it’s analogous; you only ever have to consider the users that rated that movie, which may be pretty large, but you’re only ever inverting a $D \times D$ matrix.

Another cool thing: since each user is only dependent on their item’s preferences, we can parallelize this update of $U$ or $V.$ We can run it on as many different machines as we want to make it fast.

There are lots of different versions of this. Sometimes you need to extend it to make it work in your particular case.

Note: as stated this is not actually convex, but similar to the regularization we did for least squares, we can add a penalty for large entries in $U$ and $V,$ depending on some parameter $\lambda,$ which again translates to the same thing, i.e. adding a diagonal matrix to the covariance matrix, when you solve least squares. This makes the problem convex if $\lambda$ is big enough.

You can add new users, new data, keep optimizing U and V. You can choose which users you think need more updating. Or if they have enough ratings, you can decide not to update the rest of them.

As with any machine learning model, you should perform cross-validation for this model – leave out a bit and see how you did. This is a way of testing overfitting problems.

Thought experiment – filter bubbles

What are the implications of using error minimization to predict preferences? How does presentation of recommendations affect the feedback collected?

For example, can we end up in local maxima with rich-get-richer effects? In other words, does showing certain items at the beginning “give them an unfair advantage” over other things? And so do certain things just get popular or not based on luck?

How do we correct for this?

## Causal versus causal

Today I want to talk about the different ways the word “causal” is thrown around by statisticians versus finance quants, because it’s both confusing and really interesting.

But before I do, can I just take a moment to be amazed at how pervasive Gangnam Style has become? When I first posted the video on August 1st, I had no idea how much of a sensation it was destined to become. Here’s the Google trend graph for “Gangnam” versus “Obama”:

It really hit home last night as I was reading a serious Bloomberg article take on the economic implications of Gangnam Style whilst the song was playing in the background at the playoff game between the Cardinals and the Giants.

Back to our regularly scheduled program. I’m first going to talk about how finance quants think about “causal models” and second how statisticians do. This has come out of conversations with Suresh Naidu and Rachel Schutt.

Causal modeling in finance

Causal modeling in statistics

By contrast, when statisticians talk about a causal model, they mean something very different. Namely, they mean whether the model shows that something caused something else to happen. For example, if we saw certain plants in a certain soil all died but those in a different soil lived, then they’d want to know if the soil caused the death of the plants. Usually to answer this kind of questions, in an ideal situation, statisticians set up randomly chosen experiments where the only difference between the treatments  is that one condition (i.e. the type of soil, but not how often you water it or the type of sunlight it gets). When they can’t set it up perfectly (say because it involves people dying instead of plants) they do the best they can.

The differences and commonalities

On the one hand both concepts refer and depend on time. There’s no way X caused Y to happen if X happened after Y. But whereas in finance we only care about time, in statistics there’s more to it.

So for example, if there’s a third underlying thing that causes both X and Y, but X happens before Y, then the finance people are psyched because they have a way of betting on the direction of Y: just keep an eye on X! But the statisticians are not amused, since there’s no way to prove causality in this case unless you get your hands on that third thing.

Although I understand wanting to know the underlying reasons things happen, I have a personal preference for the finance definition, which is just plain easier to understand and test, and usually the best we can do with real world data. In my experience the most interesting questions relate to things that you can’t set up experiments for. So, for example, it’s hard to know whether blue-collar presidents would be impose less elitist policy than millionaires, because we only have millionaires.

Moreover, it usually is interesting to know what you can predict for the future knowing what you know now, even if there’s no proof of causation, and not only because you can maybe make money betting on something (but that’s part of it).

Categories: data science, statistics

## Columbia Data Science course, week 6: Kaggle, crowd-sourcing, decision trees, random forests, social networks, and experimental design

Yesterday we had two guest lecturers, who took up approximately half the time each. First we welcomed William Cukierski from Kaggle, a data science competition platform.

Will went to Cornell for a B.A. in physics and to Rutgers to get his Ph.D. in biomedical engineering. He focused on cancer research, studying pathology images. While working on writing his dissertation, he got more and more involved in Kaggle competitions, finishing very near the top in multiple competitions, and now works for Kaggle. Here’s what Will had to say.

Crowd-sourcing in Kaggle

What is a data scientist? Some say it’s someone who is better at stats than an engineer and better at engineering than a statistician. But one could argue it’s actually someone who is worse at stats than a statistician. Being a data scientist is when you learn more and more about more and more until you know nothing about everything.

Kaggle using prizes to induce the public to do stuff. This is not a new idea:

There are two kinds of crowdsourcing models. First, we have the distributive crowdsourcing model, like wikipedia, which as for relatively easy but large amounts of contributions. Then, there’s the singular, focused difficult problems that Kaggle, DARPA, InnoCentive and other companies specialize in.

Somee of the problems with some crowdsourcing projects include:

• they don’t always evaluate your submission objectively. Instead they have a subjective measure, so they might just decide your design is bad or something. This leads to high barrier to entry, since people don’t trust the evaluation criterion.
• Also, one doesn’t get recognition until after they’ve won or ranked highly. This leads to high sunk costs for the participants.
• Also, bad competitions often conflate participants with mechanical turks: in other words, they assume you’re stupid. This doesn’t lead anywhere good.
• Also, the competitions sometimes don’t chunk the work into bite size pieces, which means it’s too big to do or too small to be interesting.

A good competition has a do-able, interesting question, with an evaluation metric which is transparent and entirely objective. The problem is given, the data set is given, and the metric of success is given. Moreover, prizes are established up front.

The participants are encouraged to submit their models up to twice a day during the competitions, which last on the order of a few days. This encourages a “leapfrogging” between competitors, where one ekes out a 5% advantage, giving others incentive to work harder. It also establishes a band of accuracy around a problem which you generally don’t have- in other words, given no other information, you don’t know if your 75% accurate model is the best possible.

The test set y’s are hidden, but the x’s are given, so you just use your model to get your predicted y’s for the test set and upload them into the Kaggle machine to see your evaluation score. This way you don’t share your actual code with Kaggle unless you win the prize (and Kaggle doesn’t have to worry about which version of python you’re running).

Note this leapfrogging effect is good and bad. It encourages people to squeeze out better performing models but it also tends to make models much more complicated as they get better. One reason you don’t want competitions lasting too long is that, after a while, the only way to inch up performance is to make things ridiculously complicated. For example, the original Netflix Prize lasted two years and the final winning model was too complicated for them to actually put into production.

The hole that Kaggle is filling is the following: there’s a mismatch between those who need analysis and those with skills. Even though companies desperately need analysis, they tend to hoard data; this is the biggest obstacle for success.

They have had good results so far. Allstate, with a good actuarial team, challenged their data science competitors to improve their actuarial model, which, given attributes of drivers, approximates the probability of a car crash. The 202 competitors improved Allstate’s internal model by 271%.

There were other examples, including one where the prize was $1,000 and it benefited the company$100,000.

A student then asked, is that fair? There are actually two questions embedded in that one. First, is it fair to the data scientists working at the companies that engage with Kaggle? Some of them might lose their job, for example. Second, is it fair to get people to basically work for free and ultimately benefit a for-profit company? Does it result in data scientists losing their fair market price?

Of course Kaggle charges a fee for hosting competitions, but is it enough?

[Mathbabe interjects her view: personally, I suspect this is a model which seems like an arbitrage opportunity for companies but only while the data scientists of the world haven't realized their value and have extra time on their hands. As soon as they price their skills better they'll stop working for free, unless it's for a cause they actually believe in.]

Facebook is hiring data scientists, they hosted a Kaggle competition, where the prize was an interview. There were 422 competitors.

[Mathbabe can't help but insert her view: it's a bit too convenient for Facebook to have interviewees for data science positions in such a posture of gratitude for the mere interview. This distracts them from asking hard questions about what the data policies are and the underlying ethics of the company.]

There’s a final project for the class, namely an essay grading contest. The students will need to build it, train it, and test it, just like any other Kaggle competition. Group work is encouraged.

Thought Experiment: What are the ethical implications of a robo-grader?

Some of the students’ thoughts:

• It depends on how much you care about your grade.
• Actual human graders aren’t fair anyway.
• Is this the wrong question? The goal of a test is not to write a good essay but rather to do well in a standardized test. The real profit center for standardized testing is, after all, to sell books to tell you how to take the tests. It’s a screening, you follow the instructions, and you get a grade depending on how well you follow instructions.
• There are really two question: 1) Is it wise to move from the human to the machine version of same thing for any given thing? and 2) Are machines making things more structured, and is this inhibiting creativity? One thing is for sure, robo-grading prevents me from being compared to someone more creative.
• People want things to be standardized. It gives us a consistency that we like. People don’t want artistic cars, for example.
• Will: We used machine learning to research cancer, where the stakes are much higher. In fact this whole field of data science has to be thinking about these ethical considerations sooner or later, and I think it’s sooner. In the case of doctors, you could give the same doctor the same slide two months apart and get different diagnoses. We aren’t consistent ourselves, but we think we are. Let’s keep that in mind when we talk about the “fairness” of using machine learning algorithms in tricky situations.

Introduction to Feature Selection

“Feature extraction and selection are the most important but underrated step of machine learning. Better features are better than better algorithms.” – Will

“We don’t have better algorithms, we just have more data” -Peter Norvig

Will claims that Norvig really wanted to say we have better features.

We are getting bigger and bigger data sets, but that’s not always helpful. The danger is if the number of features is larger than the number of samples or if we have a sparsity problem.

We improve our feature selection process to try to improve performance of predictions. A criticism of feature selection is that it’s no better than data dredging. If we just take whatever answer we get that correlates with our target, that’s not good.

There’s a well known bias-variance tradeoff: a model is ”high bias” if it’s is too simple (the features aren’t encoding enough information). In this case lots more data doesn’t improve your model. On the other hand, if your model is too complicated, then “high variance” leads to overfitting. In this case you want to reduce the number of features you are using.

We will take some material from a famous paper by Isabelle Guyon published in 2003 entitled “An Introduction to Variable and Feature Selection”.

There are three categories of feature selection methods: filters, wrappers, and embedded methods. Filters order variables (i.e. possible features) with respect to some ranking (e.g. correlation with target). This is sometimes good on a first pass over the space of features. Filters take account of the predictive power of individual features, and estimate mutual information or what have you. However, the problem with filters is that you get correlated features. In other words, the filter doesn’t care about redundancy.

This isn’t always bad and it isn’t always good. On the one hand, two redundant features can be more powerful when they are both used, and on the other hand something that appears useless alone could actually help when combined with another possibly useless-looking feature.

Wrapper feature selection tries to find subsets of features that will do the trick. However, as anyone who has studied the binomial coefficients knows, the number of possible size $k$ subsets of $n$ things, called $n\choose k$, grows exponentially. So there’s a nasty opportunity for over fitting by doing this. Most subset methods are capturing some flavor of minimum-redundancy-maximum-relevance. So, for example, we could have a greedy algorithm which starts with the best feature, takes a few more highly ranked, removes the worst, and so on. This a hybrid approach with a filter method.

We don’t have to retrain models at each step of such an approach, because there are fancy ways to see how objective function changes as we change the subset of features we are trying out. These are called “finite differences” and rely essentially on Taylor Series expansions of the objective function.

One last word: if you have a domain expertise on hand, don’t go into the machine learning rabbit hole of feature selection unless you’ve tapped into your expert completely!

Decision Trees

We’ve all used decision trees. They’re easy to understand and easy to use. How do we construct? Choosing a feature to pick at each step is like playing 20 questions. We take whatever the most informative thing is first. For the sake of this discussion, assume we break compound questions into multiple binary questions, so the answer is “+” or “-”.

To quantify “what is the most informative feature”, we first define entropy for a random variable $X$ to mean:

$H(X) = - p(x_+) log_2(p(x_+)) - p(x_-) log_2(p(x_-)).$

Note when $p(x_*) = 0,$ we define the term to vanish. This is consistent with the fact that

$\lim_{t\to 0} t log(t) = 0.$

In particular, if either option has probability zero, the entropy is 0. It is maximized at 0.5 for binary variables:

which we can easily compute using the fact that in the binary case, $p(x_+) = 1- p(x_-)$ and a bit of calculus.

Using this definition, we define the information gain for a given feature, which is defined as the entropy we lose if we know the value of that feature.

To make a decision tree, then, we want to maximize information gain, and make a split on that. We keep going until all the points at the end are in the same class or we end up with no features left. In this case we take the majority vote. Optionally we prune the tree to avoid overfitting.

This is an example of an embedded feature selection algorithm. We don’t need to use a filter here because the “information gain” method is doing our feature selection for us.

How do you handle continuous variables?

In the case of continuous variables, you need to ask for the correct threshold of value so that it can be though of as a binary variable. So you could partition a user’s spend into “less than $5″ and “at least$5″ and you’d be getting back to the binary variable case. In this case it takes some extra work to decide on the information gain because it depends on the threshold as well as the feature.

Random Forests

Random forests are cool. They incorporate “bagging” (bootstrap aggregating) and trees to make stuff better. Plus they’re easy to use: you just need to specify the number of trees you want in your forest, as well as the number of features to randomly select at each node.

A bootstrap sample  is a sample with replacement, which we usually take to be 80% of the actual data, but of course can be adjusted depending on how much data we have.

To construct a random forest, we construct a bunch of decision trees (we decide how many). For each tree, we take a bootstrap sample of our data, and for each node we randomly select (a second point of bootstrapping actually) a few features, say 5 out of the 100 total features. Then we use our entropy-information-gain engine to decide which among those features we will split our tree on, and we keep doing this, choosing a different set of five features for each node of our tree.

Note we could decide beforehand how deep the tree should get, but we typically don’t prune the trees, since a great feature of random forests is that it incorporates idiosyncratic noise.

Here’s what does a decision tree looks like for surviving on the Titanic.

David Huffaker, Google: Hybrid Approach to Social Research

David is one of Rachel’s collaborators in Google. They had a successful collaboration, starting with complementary skill sets, an explosion of goodness ensued when they were put together to work on Google+ with a bunch of other people, especially engineers. David brings a social scientist perspective to the analysis of social networks. He’s strong in quantitative methods for understanding and analyzing online social behavior. He got a Ph.D. in Media, Technology, and Society from Northwestern.

Google does a good job of putting people together. They blur the lines between research and development. The researchers are embedded on product teams. The work is iterative, and the engineers on the team strive to have near-production code from day 1 of a project. They leverage cloud infrastructure to deploy experiments to their mass user base and to rapidly deploy a prototype at scale.

Note that, considering the scale of Google’s user base, redesign as they scaling up is not a viable option. They instead do experiments with smaller groups of users.

David suggested that we, as data scientists, consider how to move into an experimental design so as to move to a causal claim between variables rather than a descriptive relationship. In other words, to move from the descriptive to the predictive.

As an example, he talked about the genesis of the “circle of friends” feature of Google+. They know people want to selectively share; they’ll send pictures to their family, whereas they’d probably be more likely to send inside jokes to their friends. They came up with the idea of circles, but it wasn’t clear if people would use them. How do they answer the question: will they use circles to organize their social network? It’s important to know what motivates them when they decide to share.

They took a mixed-method approach, so they used multiple methods to triangulate on findings and insights. Given a random sample of 100,000 users, they set out to determine the popular names and categories of names given to circles. They identified 168 active users who filled out surveys and they had longer interviews with 12.

They found that the majority were engaging in selective sharing, that most people used circles, and that the circle names were most often work-related or school-related, and that they had elements of a strong-link (“epic bros”) or a weak-link (“acquaintances from PTA”)

They asked the survey participants why they share content. The answers primarily came in three categories: first, the desire to share about oneself – personal experiences, opinions, etc. Second, discourse: people wanna participate in a conversation. Third, evangelism: people wanna spread information.

Next they asked participants why they choose their audiences. Again, three categories: first, privacy – many people were public or private by default. Second, relevance – they wanted to share only with those who may be interested, and they don’t wanna pollute other people’s data stream. Third, distribution – some people just want to maximize their potential audience.

The takeaway from this study was this: people do enjoy selectively sharing content, depending on context, and the audience. So we have to think about designing features for the product around content, context, and audience.

Network Analysis

We can use large data and look at connections between actors like a graph. For Google+, the users are the nodes and the edges (directed) are “in the same circle”.

Other examples of networks:

After you define and draw a network, you can hopefully learn stuff by looking at it or analyzing it.

As you may have noticed, “social” is a layer across all of Google. Search now incorporates this layer: if you search for something you might see that your friend “+1″‘ed it. This is called a social annotation. It turns out that people care more about annotation when it comes from someone with domain expertise rather than someone you’re very close to. So you might care more about the opinion of a wine expert at work than the opinion of your mom when it comes to purchasing wine.

Note that sounds obvious but if you started the other way around, asking who you’d trust, you might start with your mom. In other words, “close ties,” even if you can determine those, are not the best feature to rank annotations. But that begs the question, what is? Typically in a situation like this we use click-through rate, or how long it takes to click.

In general we need to always keep in mind a quantitative metric of success. This defines success for us, so we have to be careful.

Privacy

Human facing technology has thorny issues of privacy which makes stuff hard. We took a survey of how people felt uneasy about content. We asked, how does it affect your engagement? What is the nature of your privacy concerns?

Turns out there’s a strong correlation between privacy concern and low engagement, which isn’t surprising. It’s also related to how well you understand what information is being shared, and the question of when you post something, where does it go and how much control do you have over it. When you are confronted with a huge pile of complicated all settings, you tend to start feeling passive.

Again, we took a survey and found broad categories of concern as follows:

identity theft

• financial loss

digital world

• really private stuff I searched on
• unwanted spam
• provocative photo (oh shit my boss saw that)
• unwanted solicitation
• unwanted ad targeting

physical world

• offline threats
• harm to my family
• stalkers
• employment risks
• hassle

What is the best way to decrease concern and increase undemanding and control?

Possibilities:

• Write and post a manifesto of your data policy (tried that, nobody likes to read manifestos)
• Educate users on our policies a la the Netflix feature “because you liked this, we think you might like this”
• Get rid of all stored data after a year

Rephrase: how do we design setting to make it easier for people? how do you make it transparent?

• make a picture or graph of where data is going.
• give people a privacy switchboard
• give people access to quick settings
• make the settings you show them categorized by things you don’t have a choice about vs. things you do
• make reasonable default setting so people don’t have to worry about it.

David left us with these words of wisdom: as you move forward and have access to big data, you really should complement them with qualitative approaches. Use mixed methods to come to a better understanding of what’s going on. Qualitative surveys can really help.

## Suresh Naidu: analyzing the language of political partisanship

I was lucky enough to attend Suresh Naidu‘s lecture last night on his recent work analyzing congressional speeches with co-authors Jacob Jensen, Ethan Kaplan, and Laurence Wilse-Samson.

Namely, along with his co-authors, he found popular three-word phrases, measured and ranked their partisanship (by how often a democrat uttered the phrase versus a republican), and measured the extent to which those phrases were being used in the public discussion before congress started using them or after congress started using them.

Note this means that phrases that were uttered often by both parties were ignored. Only phrases that were uttered more by one party than the other like “free market system” were counted. Also, the words were reduced to their stems and small common words were ignored, so the phrase “united states of america” was reduced to “unite.state.america”. So if parties were talking about the same issue but insisted on using certain phrases (“death tax” for example), then it would show up. This certainly jives with my sense of how partisanship is established by politicians, and for the sake of the paper it can be taken to be the definition.

The first data set he used was a digitized version of all of the speeches from the House since the end of the Civil War, which was also the beginning of the “two-party” system as we know it. Third party politicians were ignored. The proxy for “the public discussion” was taken from Google Book N-grams. It consists of books that were published in English in a given year.

Some of the conclusions that I can remember are as follows:

1. The three-word phrases themselves are a super interesting data set; their prevalence, how the move from one side of the aisle to the other over time, and what they discuss (so for example, they don’t discuss international issues that much – which doesn’t mean the politicians don’t discuss international issues, but that it’s not a particularly partisan issue or at least their language around this issue is similar).
2. When the issue is economic and highly partisan, it tends to show up “in the public” via Google Books before it shows up in Congress. Which is to say, there’s been a new book written by some economist, presumably, who introduces language into the public discussion that later gets picked up by Congress.
3. When the issue is non-economic or only somewhat partisan, it tends to show up in Congress before or at the same time as in the public domain. Members of Congress seem to feel comfortable making up their own phrases and repeating them in such circumstances.

So the cult of the economic expert has been around for a while now.

Suresh and his crew also made an overall measurement of the partisanship of a given 2-year session of congress. It was interesting to discuss how this changed over time, and how having large partisanship, in terms of language, did not necessarily correlate with having stalemate congresses. Indeed if I remember correctly, a moment of particularly high partisanship, as defined above via language, was during the time the New Deal was passed.

Also, as we also discussed (it was a lively audience), language may be a marker of partisan identity without necessarily pointing to underlying ideological differences. For example, the phrase “Martin Luther King” has been ranked high as a partisan democratic phrase since the civil rights movement but then again it’s customary (I’ve been told) for democrats to commemorate MLK’s birthday, but not for republicans to do so.

Given their speech, this analysis did a good job identifying which party a politician belonged to, but the analysis was not causal in the sense of time: we needed to know the top partisan phrases of that session of Congress to be able to predict the party of a given politician. Indeed the “top phrases” changed so quickly that the predictive power may be mostly lost between sessions.

Not that this is a big deal, since of course we know what party a politician is from, but it would be interesting to use this as a measure of how radical or centered a given politician is or will be.

Even if you aren’t interested in the above results and discussion, the methodology is very cool. Suresh and his co-authors view text as its own data set and analyze it as such.

And after all, the words historical politicians spoke is what we have on record – we can’t look into their brain and see what they were thinking. It’s of course interesting and important to have historians (domain experts) inform the process as well, e.g. for the “Martin Luther King” phrase above, but barring expert knowledge this is lots better than nothing. One thing it tells us, just in case we didn’t study political history, is that we’ve seen way worse partisanship in the past than we see now, although things have consistently been getting worse since the 1980′s.

Here’s a wordcloud from the 2007 session; blue and red are what you think, and bigger means more partisan:

## Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

I was happy to be giving Rachel Schutt’s Columbia Data Science course this week, where I discussed time series, financial modeling, and ethics. I blogged previous classes here.

The first few minutes of class were for a case study with GetGlue, a New York-based start-up that won the mashable breakthrough start-up of the year in 2011 and is backed by some of the VCs that also fund big names like Tumblr, etsy, foursquare, etc. GetGlue is part of the social TV space. Lead Scientist, Kyle Teague, came to tell the class a little bit about GetGlue, and some of what he worked on there. He also came to announce that GetGlue was giving the class access to a fairly large data set of user check-ins to tv shows and movies. Kyle’s background is in electrical engineering, he placed in the 2011 KDD cup (which we learned about last week from Brian), and he started programming when he was a kid.

GetGlue’s goal is to address the problem of content discovery within the movie and tv space, primarily. The usual model for finding out what’s on TV is the 1950′s TV Guide schedule, and that’s still how we’re supposed to find things to watch. There are thousands of channels and it’s getting increasingly difficult to find out what’s good on. GetGlue wants to change this model, by giving people personalized TV recommendations and personalized guides. There are other ways GetGlue uses Data Science but for the most part we focused on how this the recommendation system works. Users “check-in” to tv shows, which means they can tell people they’re watching a show. This creates a time-stamped data point. They can also do other actions such as like, or comment on the show. So this is a -tuple: {user, action, object} where the object is a tv show or movie. This induces a bi-partite graph. A bi-partite graph or network contains two types of nodes: users and tv shows. An edges exist between users and an tv shows, but not between users and users or tv shows and tv shows. So Bob and Mad Men are connected because Bob likes Mad Men, and Sarah and Mad Men and Lost are connected because Sarah liked Mad Men and Lost. But Bob and Sarah aren’t connected, nor are Mad Men and Lost. A lot can be learned from this graph alone.

But GetGlue finds ways to create edges between users and between objects (tv shows, or movies.) Users can follow each other or be friends on GetGlue, and also GetGlue can learn that two people are similar[do they do this?]. GetGlue also hires human evaluators to make connections or directional edges between objects. So True Blood and Buffy the Vampire Slayer might be similar for some reason and so the humans create an edge in the graph between them. There were nuances around the edge being directional. They may draw an arrow pointing from Buffy to True Blood but not vice versa, for example, so their notion of “similar” or “close” captures both content and popularity. (That’s a made-up example.) Pandora does something like this too.

Another important aspect is time. The user checked-in or liked a show at a specific time, so the -tuple extends to have a time-stamp: {user,action,object,timestamp}. This is essentially the data set the class has access to, although it’s slightly more complicated and messy than that. Their first assignment with this data will be to explore it, try to characterize it and understand it, gain intuition around it and visualize what they find.

Students in the class asked him questions around topics of the value of formal education in becoming a data scientist (do you need one? Kyle’s time spent doing signal processing in research labs was valuable, but so was his time spent coding for fun as a kid), what would be messy about a data set, why would the data set be messy (often bugs in the code), how would they know? (their QA and values that don’t make sense), what language does he use to prototype algorithms (python), how does he know his algorithm is good.

Then it was my turn. I started out with my data scientist profile:

As you can see, I feel like I have the most weakness in CS. Although I can use python pretty proficiently, and in particular I can scrape and parce data, prototype models, and use matplotlib to draw pretty pictures, I am no java map-reducer and I bow down to those people who are. I am also completely untrained in data visualization but I know enough to get by and give presentations that people understand.

Thought Experiment

I asked the students the following question:

What do you lose when you think of your training set as a big pile of data and ignore the timestamps?

They had some pretty insightful comments. One thing they mentioned off the bat is that you won’t know cause and effect if you don’t have any sense of time. Of course that’s true but it’s not quite what I meant, so I amended the question to allow you to collect relative time differentials, so “time since user last logged in” or “time since last click” or “time since last insulin injection”, but not absolute timestamps.

What I was getting at, and what they came up with, was that when you ignore the passage of time through your data, you ignore trends altogether, as well as seasonality. So for the insulin example, you might note that 15 minutes after your insulin injection your blood sugar goes down consistently, but you might not notice an overall trend of your rising blood sugar over the past few months if your dataset for the past few months has no absolute timestamp on it.

This idea, of keeping track of trends and seasonalities, is very important in financial data, and essential to keep track of if you want to make money, considering how small the signals are.

How to avoid overfitting when you model with time series

After discussing seasonality and trends in the various financial markets, we started talking about how to avoid overfitting your model.

Specifically, I started out with having a strict concept of in-sample (IS) and out-of-sample (OOS) data. Note the OOS data is not meant as testing data- that all happens inside OOS data. It’s meant to be the data you use after finalizing your model so that you have some idea how the model will perform in production.

Next, I discussed the concept of causal modeling. Namely, we should never use information in the future to predict something now. Similarly, when we have a set of training data, we don’t know the “best fit coefficients” for that training data until after the last timestamp on all the data. As we move forward in time from the first timestamp to the last, we expect to get different sets of coefficients as more events happen.

One consequence of this is that, instead of getting on set of coefficients, we actually get an evolution of each coefficient. This is helpful because it gives us a sense of how stable those coefficients are. In particular, if one coefficient has changed sign 10 times over the training set, then we expect a good estimate for it is zero, not the so-called “best fit” at the end of the data.

One last word on causal modeling and IS/OOS. It is consistent with production code. Namely, you are always acting, in the training and in the OOS simulation, as if you’re running your model in production and you’re seeing how it performs. Of course you fit your model in sample, so you expect it to perform better there than in production.

Another way to say this is that, once you have a model in production, you will have to make decisions about the future based only on what you know now (so it’s causal) and you will want to update your model whenever you gather new data. So your coefficients of your model are living organisms that continuously evolve.

Submodels of Models

We often “prepare” the data before putting it into a model. Typically the way we prepare it has to do with the mean or the variance of the data, or sometimes the log (and then the mean or the variance of that transformed data).

But to be consistent with the causal nature of our modeling, we need to make sure our running estimates of mean and variance are also causal. Once we have causal estimates of our mean $\overline{y}$ and variance $\sigma_y^2$, we can normalize the next data point with these estimates just like we do to get from a gaussian distribution to the normal gaussian distribution:

$y \mapsto \frac{y - \overline{y}}{\sigma_y}$

Of course we may have other things to keep track of as well to prepare our data, and we might run other submodels of our model. For example we may choose to consider only the “new” part of something, which is equivalent to trying to predict something like $y_t - y_{t-1}$ instead of $y_t.$ Or we may train a submodel to figure out what part of $y_{t-1}$ predicts $y_t,$ so a submodel which is a univariate regression or something.

There are lots of choices here, but the point is it’s all causal, so you have to be careful when you train your overall model how to introduce your next data point and make sure the steps are all in order of time, and that you’re never ever cheating and looking ahead in time at data that hasn’t happened yet.

Financial time series

In finance we consider returns, say daily. And it’s not percent returns, actually it’s log returns: if $F_t$ denotes a close on day $t,$ then the return that day is defined as $log(F_t/F_{t-1}).$ See more about this here.

So if you start with S&P closing levels:

Then you get the following log returns:

What’s that mess? It’s crazy volatility caused by the financial crisis. We sometimes (not always) want to account for that volatility by normalizing with respect to it (described above). Once we do that we get something like this:

Which is clearly better behaved. Note this process is discussed in this post.

We could also normalize with respect to the mean, but we typically assume the mean of daily returns is 0, so as to not bias our models on short term trends.

Financial Modeling

One thing we need to understand about financial modeling is that there’s a feedback loop. If you find a way to make money, it eventually goes away- sometimes people refer to this as the fact that the “market learns over time”.

One way to see this is that, in the end, your model comes down to knowing some price is going to go up in the future, so you buy it before it goes up, you wait, and then you sell it at a profit. But if you think about it, your buying it has actually changed the process, and decreased the signal you were anticipating. That’s how the market learns – it’s a combination of a bunch of algorithms anticipating things and making them go away.

The consequence of this learning over time is that the existing signals are very weak. We are happy with a 3% correlation for models that have a horizon of 1 day (a “horizon” for your model is how long you expect your prediction to be good). This means not much signal, and lots of noise! In particular, lots of the machine learning “metrics of success” for models, such as measurements of precision or accuracy, are not very relevant in this context.

So instead of measuring accuracy, we generally draw a picture to assess models, namely of the (cumulative) PnL of the model. This generalizes to any model as well- you plot the cumulative sum of the product of demeaned forecast and demeaned realized. In other words, you see if your model consistently does better than the “stupidest” model of assuming everything is average.

If you plot this and you drift up and to the right, you’re good. If it’s too jaggedy, that means your model is taking big bets and isn’t stable.

Why regression?

From above we know the signal is weak. If you imagine there’s some complicated underlying relationship between your information and the thing you’re trying to predict, get over knowing what that is – there’s too much noise to find it. Instead, think of the function as possibly complicated, but continuous, and imagine you’ve written it out as a Taylor Series. Then you can’t possibly expect to get your hands on anything but the linear terms.

Don’t think about using logistic regression, either, because you’d need to be ignoring size, which matters in finance- it matters if a stock went up 2% instead of 0.01%. But logistic regression forces you to have an on/off switch, which would be possible but would lose a lot of information. Considering the fact that we are always in a low-information environment, this is a bad idea.

Note that although I’m claiming you probably want to use linear regression in a noisy environment, the actual terms themselves don’t have to be linear in the information you have. You can always take products of various terms as x’s in your regression. but you’re still fitting a linear model in non-linear terms.

The first thing I need to explain is the exponential downweighting of old data, which I already used in a graph above, where I normalized returns by volatility with a decay of 0.97. How do I do this?

Working from this post again, the formula is given by essentially a weighted version of the normal one, where I weight recent data more than older data, and where the weight of older data is a power of some parameter $s$ which is called the decay. The exponent is the number of time intervals since that data was new. Putting that together, the formula we get is:

$V_{old} = (1-s) \cdot \sum_i r_i^2 s^i.$

We are actually dividing by the sum of the weights, but the weights are powers of some number s, so it’s a geometric sum and the sum is given by $1/(1-s).$

One cool consequence of this formula is that it’s easy to update: if we have a new return $r_0$ to add to the series, then it’s not hard to show we just want

$V_{new} = s \cdot V_{old} + (1-s) \cdot r_0^2.$

In fact this is the general rule for updating exponential downweighted estimates, and it’s one reason we like them so much- you only need to keep in memory your last estimate and the number $s.$

How do you choose your decay length? This is an art instead of a science, and depends on the domain you’re in. Think about how many days (or time periods) it takes to weight a data point at half of a new data point, and compare that to how fast the market forgets stuff.

This downweighting of old data is an example of inserting a prior into your model, where here the prior is “new data is more important than old data”. What are other kinds of priors you can have?

Priors

Priors can be thought of as opinions like the above. Besides “new data is more important than old data,” we may decide our prior is “coefficients vary smoothly.” This is relevant when we decide, say, to use a bunch of old values of some time series to help predict the next one, giving us a model like:

$y = F_t = \alpha_0 + \alpha_1 F_{t-1} + \alpha_2 F_{t-2} + \epsilon,$

which is just the example where we take the last two values of the time series $F$ to predict the next one. But we could use more than two values, of course.

[Aside: in order to decide how many values to use, you might want to draw an autocorrelation plot for your data.]

The way you’d place the prior about the relationship between coefficients (in this case consecutive lagged data points) is by adding a matrix to your covariance matrix when you perform linear regression. See more about this here.

Ethics

I then talked about modeling and ethics. My goal is to get this next-gen group of data scientists sensitized to the fact that they are not just nerds sitting in the corner but have increasingly important ethical questions to consider while they work.

People tend to overfit their models. It’s human nature to want your baby to be awesome. They also underestimate the bad news and blame other people for bad news, because nothing their baby has done or is capable of is bad, unless someone else made them do it. Keep these things in mind.

I then described what I call the deathspiral of modeling, a term I coined in this post on creepy model watching.

I counseled the students to

• try to maintain skepticism about their models and how their models might get used,
• shoot holes in their own ideas,
• accept challenges and devise tests as scientists rather than defending their models using words – if someone thinks they can do better, than let them try, and agree on an evaluation method beforehand,
• In general, try to consider the consequences of their models.

I then showed them Emanuel Derman’s Hippocratic Oath of Modeling, which was made for financial modeling but fits perfectly into this framework. I discussed the politics of working in industry, namely that even if they are skeptical of their model there’s always the chance that it will be used the wrong way in spite of the modeler’s warnings. So the Hippocratic Oath is, unfortunately, insufficient in reality (but it’s a good start!).

Finally, there are ways to do good: I mentioned stuff like DataKind. There are also ways to be transparent: I mentioned Open Models, which is so far just an idea, but Victoria Stodden is working on RunMyCode, which is similar and very awesome.

## What is a model?

September 28, 2012 9 comments

I’ve been thinking a lot recently about mathematical models and how to explain them to people who aren’t mathematicians or statisticians. I consider this increasingly important as more and more models are controlling our lives, such as:

• employment models, which help large employers screen through applications,
• political ad models, which allow political groups to personalize their ads,
• credit scoring models, which allow consumer product companies and loan companies to screen applicants, and,
• if you’re a teacher, the Value-Added Model.
• See more models here and here.

It’s a big job, to explain these, because the truth is they are complicated – sometimes overly so, sometimes by construction.

The truth is, though, you don’t really need to be a mathematician to know what a model is, because everyone uses internal models all the time to make decisions.

For example, you intuitively model everyone’s appetite when you cook a meal for your family. You know that one person loves chicken (but hates hamburgers), while someone else will only eat the pasta (with extra cheese). You even take into account that people’s appetites vary from day to day, so you can’t be totally precise in preparing something – there’s a standard error involved.

To explain modeling at this level, then, you just need to imagine that you’ve built a machine that knows all the facts that you do and knows how to assemble them together to make a meal that will approximately feed your family. If you think about it, you’ll realize that you know a shit ton of information about the likes and dislikes of all of your family members, because you have so many memories of them grabbing seconds of the asparagus or avoiding the string beans.

In other words, it would be actually incredibly hard to give a machine enough information about all the food preferences for all your family members, and yourself, along with the constraints of having not too much junky food, but making sure everyone had something they liked, etc. etc.

So what would you do instead? You’d probably give the machine broad categories of likes and dislikes: this one likes meat, this one likes bread and pasta, this one always drinks lots of milk and puts nutella on everything in sight. You’d dumb it down for the sake of time, in other words. The end product, the meal, may not be perfect but it’s better than no guidance at all.

That’s getting closer to what real-world modeling for people is like. And the conclusion is right too- you aren’t expecting your model to do a perfect job, because you only have a broad outline of the true underlying facts of the situation.

Plus, when you’re modeling people, you have to a priori choose the questions to ask, which will probably come in the form of “does he/she like meat?” instead of “does he/she put nutella on everything in sight?”; in other words, the important but idiosyncratic rules won’t even be seen by a generic one-size-fits-everything model.

Finally, those generic models are hugely scaled- sometimes there’s really only one out there, being used everywhere, and its flaws are compounded that many times over because of its reach.

So, say you’ve got a CV with a spelling error. You’re trying to get a job, but the software that screens for applicants automatically rejects you because of this spelling error. Moreover, the same screening model is used everywhere, and you therefore don’t get any interviews because of this one spelling error, in spite of the fact that you’re otherwise qualified.

I’m not saying this would happen – I don’t know how those models actually work, although I do expect points against you for spelling errors. My point is there’s some real danger in using such models on a very large scale that we know are simplified versions of reality.

One last thing. The model fails in the example above, because the qualified person doesn’t get a job. But it fails invisibly; nobody knows exactly how it failed or even that it failed. Moreover, it only really fails for the applicant who doesn’t get any interviews. For the employer, as long as some qualified applicants survive the model, they don’t see failure at all.

## Columbia Data Science course, week 4: K-means, Classifiers, Logistic Regression, Evaluation

September 27, 2012 4 comments

This week our guest lecturer for the Columbia Data Science class was Brian Dalessandro. Brian works at Media6Degrees as a VP of Data Science, and he’s super active in the research community. He’s also served as co-chair of the KDD competition.

Before Brian started, Rachel threw us a couple of delicious data science tidbits.

The Process of Data Science

First we have the Real World. Inside the Real World we have:

• Users using Google+
• People competing in the Olympics
• Spammers sending email

From this we draw raw data, e.g. logs, all the olympics records, or Enron employee emails. We want to process this to make it clean for analysis. We use pipelines of data munging, joining, scraping, wrangling or whatever you want to call it and we use tools such as:

• python
• shell scripts
• R
• SQL

We eventually get the data down to a nice format, say something with columns:

name event year gender event time

Note: this is where you typically start in a standard statistics class. But it’s not where we typically start in the real world.

Once you have this clean data set, you should be doing some kind of exploratory data analysis (EDA); if you don’t really know what I’m talking about then look at Rachel’s recent blog post on the subject. You may realize that it isn’t actually clean.

Next, you decide to apply some algorithm you learned somewhere:

• k-nearest neighbor
• regression
• Naive Bayes
• (something else),

depending on the type of problem you’re trying to solve:

• classification
• prediction
• description

You then:

• interpret
• visualize
• report
• communicate

At the end you have a “data product”, e.g. a spam classifier.

K-means

So far we’ve only seen supervised learning. K-means is the first unsupervised learning technique we’ll look into. Say you have data at the user level:

• G+ data
• survey data
• medical data
• SAT scores

Assume each row of your data set corresponds to a person, say each row corresponds to information about the user as follows:

age gender income Geo=state household size

Your goal is to segment them, otherwise known as stratify, or group, or cluster. Why? For example:

• you might want to give different users different experiences. Marketing often does this.
• you might have a model that works better for specific groups
• hierarchical modeling in statistics does something like this.

One possibility is to choose the groups yourself. Bucket users using homemade thresholds. Like by age, 20-24, 25-30, etc. or by income. In fact, say you did this, by age, gender, state, income, marital status. You may have 10 age buckets, 2 gender buckets, and so on, which would result in 10x2x50x10x3 = 30,000 possible bins, which is big.

You can picture a five dimensional space with buckets along each axis, and each user would then live in one of those 30,000 five-dimensional cells. You wouldn’t want 30,000 marketing campaigns so you’d have to bin the bins somewhat.

Wait, what if you want to use an algorithm instead where you could decide on the number of bins? K-means is a “clustering algorithm”, and k is the number of groups. You pick k, a hyper parameter.

2-d version

Say you have users with #clicks, #impressions (or age and income – anything with just two numerical parameters). Then k-means looks for clusters on the 2-d plane. Here’s a stolen and simplistic picture that illustrates what this might look like:

The general algorithm is just the same picture but generalized to d dimensions, where d is the number of features for each data point.

Here’s the actual algorithm:

• randomly pick K centroids
• assign data to closest centroid.
• move the centroids to the average location of the users assigned to it
• repeat until the assignments don’t change

It’s up to you to interpret if there’s a natural way to describe these groups.

This is unsupervised learning and it has issues:

• choosing an optimal k is also a problem although $1 \leq k \leq n$ , where n is number of data points.
• convergence issues – the solution can fail to exist (the configurations can fall into a loop) or “wrong”
• but it’s also fast
• interpretability can be a problem – sometimes the answer isn’t useful
• in spite of this, there are broad applications in marketing, computer vision (partition an image), or as a starting point for other models.

One common tool we use a lot in our systems is logistic regression.

Thought Experiment

Brian now asked us the following:

How would data science differ if we had a “grand unified theory of everything”?

He gave us some thoughts:

• Would we even need data science?
• Theory offers us a symbolic explanation of how the world works.
• What’s the difference between physics and data science?
• Is it just accuracy? After all, Newton wasn’t completely precise, but was pretty close.

If you think of the sciences as a continuum, where physics is all the way on the right, and as you go left, you get more chaotic, then where is economics on this spectrum? Marketing? Finance? As we go left, we’re adding randomness (and as a clever student points out, salary as well).

Bottomline: if we could model this data science stuff like we know how to model physics, we’d know when people will click on what ad. The real world isn’t this understood, nor do we expect to be able to in the future.

Does “data science” deserve the word “science” in its name? Here’s why maybe the answer is yes.

We always have more than one model, and our models are always changing.

The art in data science is this: translating the problem into the language of data science

The science in data science is this: given raw data, constraints and a problem statement, you have an infinite set of models to choose from, with which you will use to maximize performance on some evaluation metric, that you will have to specify. Every design choice you make can be formulated as an hypothesis, upon which you will use rigorous testing and experimentation to either validate or refute.

Never underestimate the power of creativity: usually people have vision but no method. As the data scientist, you have to turn it into a model within the operational constraints. You need to optimize a metric that you get to define. Moreover, you do this with a scientific method, in the following sense.

Namely, you hold onto your existing best performer, and once you have a new idea to prototype, then you set up an experiment wherein the two best models compete. You therefore have a continuous scientific experiment, and in that sense you can justify it as a science.

Classifiers

Given

• data
• a problem, and
• constraints,

we need to determine:

• a classifier,
• an optimization method,
• a loss function,
• features, and
• an evaluation metric.

Today we will focus on the process of choosing a classifier.

Classification involves mapping your data points into a finite set of labels or the probability of a given label or labels. Examples of when you’d want to use classification:

• will someone click on this ad?
• what number is this?
• what is this news article about?
• is this spam?
• is this pill good for headaches?

From now on we’ll talk about binary classification only (0 or 1).

Examples of classification algorithms:

• decision tree
• random forests
• naive bayes
• k-nearest neighbors
• logistic regression
• support vector machines
• neural networks

Which one should we use?

One possibility is to try them all, and choose the best performer. This is fine if you have no constraints or if you ignore constraints. But usually constraints are a big deal – you might have tons of data or not much time or both.

If I need to update 500 models a day, I do need to care about runtime. these end up being bidding decisions. Some algorithms are slow – k-nearest neighbors for example. Linear models, by contrast, are very fast.

One under-appreciated constraint of a data scientist is this: your own understanding of the algorithm.

Ask yourself carefully, do you understand it for real? Really? Admit it if you don’t. You don’t have to be a master of every algorithm to be a good data scientist. The truth is, getting the “best-fit” of an algorithm often requires intimate knowledge of said algorithm. Sometimes you need to tweak an algorithm to make it fit your data. A common mistake for people not completely familiar with an algorithm is to overfit.

Another common constraint: interpretability. You often need to be able to interpret your model, for the sake of the business for example. Decision trees are very easy to interpret. Random forests, on the other hand, not so much, even though it’s almost the same thing, but can take exponentially longer to explain in full. If you don’t have 15 years to spend understanding a result, you may be willing to give up some accuracy in order to have it easy to understand.

Note that credit cards have to be able to explain their models by law so decision trees make more sense than random forests.

How about scalability? In general, there are three things you have to keep in mind when considering scalability:

• learning time: how much time does it take to train the model?
• scoring time: how much time does it take to give a new user a score once the model is in production?
• model storage: how much memory does the production model use up?

Here’s a useful paper to look at when comparing models: “An Empirical Comparison of Supervised Learning Algorithms”, from which we learn:

• Simpler models are more interpretable but aren’t as good performers.
• The question of which algorithm works best is problem dependent
• It’s also constraint dependent

At M6D, we need to match clients (advertising companies) to individual users. We have logged the sites they have visited on the internet. Different sites collect this information for us. We don’t look at the contents of the page – we take the url and hash it into some random string and then we have, say, the following data about a user we call “u”:

u = <xyz, 123, sdqwe, 13ms>

This means u visited 4 sites and their urls hashed to the above strings. Recall last week we learned spam classifier where the features are words. We aren’t looking at the meaning of the words. So the might as well be strings.

At the end of the day we build a giant matrix whose columns correspond to sites and whose rows correspond to users, and there’s a “1″ if that user went to that site.

To make this a classifier, we also need to associate the behavior “clicked on a shoe ad”. So, a label.

Once we’ve labeled as above, this looks just like spam classification. We can now rely on well-established methods developed for spam classification – reduction to a previously solved problem.

Logistic Regression

We have three core problems as data scientists at M6D:

• feature engineering,
• user level conversion prediction,
• bidding.

We will focus on the second. We use logistic regression- it’s highly scalable and works great for binary outcomes.

What if you wanted to do something else? You could simply find a threshold so that, above you get 1, below you get 0. Or you could use a linear model like linear regression, but then you’d need to cut off below 0 or above 1.

What’s better: fit a function that is bounded in side [0,1]. For example, the logit function

$P(t)= \frac{1}{(1+ e^{-t})}.$

wanna estimate

$P(c_i | x) = f(x) = \frac{1}{1 + e^{-(\alpha + \beta^t*x)}}$.

To make this a linear model in the outcomes $c_i$, we take the log of the odds ratio:

$ln(P(c_i | x)/(1-P(c_i | x))) = \alpha + \beta^t *x.$

The parameter $\alpha$ keeps shape of the logit curve but shifts it back and forth. To interpret $\alpha$ further, consider what we call the base rate, the unconditional probability of “1″ (so, in the case of ads, the base rate would correspond to the click-through rate, i.e. the overall tendency for people to click on ads; this is typically on the order of 1%).

If you had no information except the base rate, the average prediction would be just that. In a logistical regression, $\alpha$ defines the base rate. Specifically, the base rate is approximately equal to $\frac{1}{1+e^{-\alpha}}.$

The slope $\beta$ defines the slope of the logit function. Note in general it’s a vector which is as long as the number of features we are using for each data point.

Our immediate modeling goal is to use our training data to find the best choices for $\alpha$ and $\beta.$ We will use a maximum likelihood estimation or convex optimization to achieve this; we can’t just use derivatives and vector calculus like we did with linear regression because it’s a complicated function of our data.

The likelihood function $L$ is defined by:

$L(\Theta | X_1, X_2, \dots , X_n) = P(X | \Theta) =$ $P(X_1 | \Theta) \cdot \dots \cdot P(X_n | \Theta),$

where we are assuming the data points $X_i$ are independent and where $\Theta = \{\alpha, \beta\}.$

We then search for the parameters that maximize this having observed our data:

$\Theta_{MLE} = argmax_{\Theta} \prod_1^n P(X_i | \Theta).$

The probability of a single observation is

$p_i^{Y_i} \cdot (1-p_i)^{1-Y_i},$

where $p_i = 1/(1+e^{-(\alpha + \beta^t x)})$ is the modeled probability of a “1″ for the binary outcome $Y_i.$ Taking the product of all of these we get our likelihood function which we want to maximize.

Similar to last week, we now take the log and get something convex, so it has to have a global maximum. Finally, we use numerical techniques to find it, which essentially follow the gradient like Newton’s method from calculus. Computer programs can do this pretty well. These algorithms depend on a step size, which we will need to adjust as we get closer to the global max or min – there’s an art to this piece of numerical optimization as well. Each step of the algorithm looks something like this:

$x_{n+1} = x_n - \gamma_n \Delta F(x_n),$

where remember we are actually optimizing our parameters $\alpha$ and $\beta$ to maximize the (log) likelihood function, so the $x$ you see above is really a vector of $\beta$s and the function $F$ corresponds to our $log(L).$

“Flavors” of Logistic Regression for convex optimization.

The Newton’s method we described above is also called Iterative Reweighted Least Squares. It uses the curvature of log-likelihood to choose appropriate step direction. The actual calculation involves the Hessian matrix, and in particular requires its inversion, which is a kxk matrix. This is bad when there’s lots of features, as in 10,000 or something. Typically we don’t have that many features but it’s not impossible.

Another possible method to maximize our likelihood or log likelihood is called Stochastic Gradient Descent. It approximates gradient using a single observation at a time. The algorithm updates the current best-fit parameters each time it sees a new data point. The good news is that there’s no big matrix inversion, and it works well with huge data and with sparse features; it’s a big deal in Mahout and Vowpal Wabbit. The bad news is it’s not such a great optimizer and it’s very dependent on step size.

Evaluation

We generally use different evaluation metrics for different kind of models.

First, for ranking models, where we just want to know a relative rank versus and absolute score, we’d look to one of:

Second, for classification models, we’d look at the following metrics:

• lift: how much more people are buying or clicking because of a model
• accuracy: how often the correct outcome is being predicted
• f-score
• precision
• recall

Finally, for density estimation, where we need to know an actual probability rather than a relative score, we’d look to:

In general it’s hard to compare lift curves but you can compare AUC (area under the receiver operator curve) – they are “base rate invariant.” In other words if you bring the click-through rate from 1% to 2%, that’s 100% lift but if you bring it from 4% to 7% that’s less lift but more effect. AUC does a better job in such a situation when you want to compare.

Density estimation tests tell you how well are you fitting for conditional probability. In advertising, this may arise if you have a situation where each ad impression costs $c and for each conversion you receive$q. You will want to target every conversion that has a positive expected value, i.e. whenever

$P(Conversion | X) \cdot \q > \c.$

But to do this you need to make sure the probability estimate on the left is accurate, which in this case means something like the mean squared error of the estimator is small. Note a model can give you good rankings but bad P estimates.

Similarly, features that rank highly on AUC don’t necessarily rank well with respect to mean absolute error. So feature selection, as well as your evaluation method, is completely context-driven.

## Evaluating professor evaluations

September 24, 2012 19 comments

I recently read this New York Times “Room for Debate” on professor evaluations. There were some reasonably good points made, with people talking about the trend that students generally give better grades to attractive professors and easy grading professors, and that they were generally more interested in the short-term than in the long-term in this sense.

For these reasons, it was stipulated, it would be better and more informative to have anonymous evaluations, or have students come back after some time to give evaluations, or interesting ideas like that.

Then there was a crazy crazy man named Jeff Sandefer, co-founder and master teacher at the Acton School of Business in Austin, Texas. He likes to call his students “customers” and here’s how he deals with evaluations:

Acton, the business school that I co-founded, is designed and is led exclusively by successful chief executives. We focus intently on customer feedback. Every week our students rank each course and professor, and the results are made public for all to see. We separate the emotional venting from constructive criticism in the evaluations, and make frequent changes in the program in real time.

We also tie teacher bonuses to the student evaluations and each professor signs an individual learning covenant with each student. We have eliminated grade inflation by using a forced curve for student grades, and students receive their grades before evaluating professors. Not only do we not offer tenure, but our lowest rated teachers are not invited to return.

First of all, I’m not crazy about the idea of weekly rankings and public shaming going on here. And how do you separate emotional venting from constructive criticism anyway? Isn’t the customer always right? Overall the experience of the teachers doesn’t sound good – if I have a choice as a teacher, I teach elsewhere, unless the pay and the students are stellar.

On the other hand, I think it’s interesting that they have a curve for student grades. This does prevent the extra good evaluations coming straight from grade inflation (I’ve seen it, it does happen).

Here’s one think I didn’t see discussed, which is students themselves and how much they want to be in the class. When I taught first semester calculus at Barnard twice in consecutive semesters, my experience was vastly different in the two classes.

The first time I taught, in the Fall, my students were mostly straight out of high school, bright eyed and bushy tailed, and were happy to be there, and I still keep in touch with some of them. It was a great class, and we all loved each other by the end of it. I got crazy good reviews.

By contrast, the second time I taught the class, which was the next semester, my students were annoyed, bored, and whiny. I had too many students in the class, partly because my reviews were so good. So the class was different on that score, but I don’t think that mattered so much to my teaching.

My theory, which was backed up by all the experienced Profs in the math department, was that I had the students who were avoiding calculus for some reason. And when I thought about it, they weren’t straight out of high school, they were all over the map. They generally were there only because they needed some kind of calculus to fulfill a requirement for their major.

Unsurprisingly, I got mediocre reviews, with some really pretty nasty ones. The nastiest ones, I noticed, all had some giveaway that they had a bad attitude- something like, “Cathy never explains anything clearly, and I hate calculus.” My conclusion is that I get great evaluations from students who want to learn calculus and nasty evaluations from students who resent me asking them to really learn calculus.

What should we do about prof evaluations?

The problem with using evaluations to measure professor effectiveness is that you might be a prof that only has ever taught calculus in the Spring, and then you’d be wrongfully punished. That’s where we are now, and people know it, so instead of using them they just mostly ignore them. Of course, the problem with not ever using these evaluations is that they might actually contain good information that you could use to get better at teaching.

We have a lot of data collected on teacher evaluations, so I figure we should be analyzing it to see if there really is a useful signal or not. And we should use domain expertise from experienced professors to see if there are any other effects besides the “Fall/Spring attitude towards math” effect to keep in mind.

It’s obviously idiosyncratic depending on field and even which class it is, i.e. Calc II versus Calc III. If there even is a signal after you extract the various effects and the “attractiveness” effect, I expect it to be very noisy and so I’d hate to see someone’s entire career depend on evaluations, unless there was something really outrageous going on.

In any case it would be fun to do that analysis.