### Archive

Archive for the ‘data science’ Category

## On trusting experts, climate change research, and scientific translators

Stephanie Tai has written a thoughtful response on Jordan Ellenberg’s blog to my discussion with Jordan regarding trusting experts (see my Nate Silver post and the follow-up post for more context).

Trusting experts

Stephanie asks three important questions about trusting experts, which I paraphrase here:

1. What does it take to look into a model yourself? How deeply must you probe?
2. How do you avoid being manipulated when you do so?
3. Why should we bother since stuff is so hard and we each have a limited amount of time?

I must confess I find the first two questions really interesting and I want to think about them, but I have a very little patience with the last question.

Here’s why:

• I’ve seen too many people (individual modelers) intentionally deflect investigations into models by setting them up as so hard that it’s not worth it (or at least it seems not worth it). They use buzz words and make it seem like there’s a magical layer of their model which makes it too difficult for mere mortals. But my experience (as an arrogant, provocative, and relentless questioner) is that I can always understand a given model if I’m talking to someone who really understands it and actually wants to communicate it.
• It smacks of an excuse rather than a reason. If it’s our responsibility to understand something, then by golly we should do it, even if it’s hard.
• Too many things are left up to people whose intentions are not reasonable using this “too hard” argument, and it gives those people reason to make entire systems seem too difficult to penetrate. For a great example, see the financial system, which is consistently too complicated for regulators to properly regulate.

I’m sure I seem unbelievably cynical here, but that’s where I got by working in finance, where I saw first-hand how manipulative and manipulated mathematical modeling can become. And there’s no reason at all such machinations wouldn’t translate to the world of big data or climate modeling.

Climate research

Speaking of climate modeling: first, it annoys me that people are using my “distrust the experts” line to be cast doubt on climate modelers.

People: I’m not asking you to simply be skeptical, I’m saying you should look into the models yourself! It’s the difference between sitting on a couch and pointing at a football game on TV and complaining about a missed play and getting on the football field yourself and trying to figure out how to throw the ball. The first is entertainment but not valuable to anyone but yourself. You are only adding to the discussion if you invest actual thoughtful work into the matter.

To that end, I invited an expert climate researcher to my house and asked him to explain the climate models to me and my husband, and although I’m not particularly skeptical of climate change research (more on that below when I compare incentives of the two sides), I asked obnoxious, relentless questions about the model until I was satisfied. And now I am satisfied. I am considering writing it up as a post.

As an aside, if climate researchers are annoyed by the skepticism, I can understand that, since football fans are an obnoxious group, but they should not get annoyed by people who want to actually do the work to understand the underlying models.

Another thing about climate research. People keep talking about incentives, and yes I agree wholeheartedly that we should follow the incentives to understand where manipulation might be taking place. But when I followed the incentives with respect to climate modeling, they bring me straight to climate change deniers, not to researchers.

Do we really think these scientists working with their research grants have more at stake than multi-billion dollar international companies who are trying to ignore the effect of their polluting factories on the environment? People, please. The bulk of the incentives are definitely with the business owners. Which is not to say there are no incentives on the other side, since everyone always wants to feel like their research is meaningful, but let’s get real.

Scientific translators

I like this idea Stephanie comes up with:

Some sociologists of science suggest that translational “experts”–that is, “experts” who aren’t necessarily producing new information and research, but instead are “expert” enough to communicate stuff to those not trained in the area–can help bridge this divide without requiring everyone to become “experts” themselves. But that can also raise the question of whether these translational experts have hidden agendas in some way. Moreover, one can also raise questions of whether a partial understanding of the model might in some instances be more misleading than not looking into the model at all–examples of that could be the various challenges to evolution based on fairly minor examples that when fully contextualized seem minor but may pop out to someone who is doing a less systematic inquiry.

First, I attempt to make my blog something like a platform for this, and I also do my best to make my agenda not at all hidden so people don’t have to worry about that.

This raises a few issues for me:

• Right now we depend mostly on press to do our translations, but they aren’t typically trained as scientists. Does that make them more prone to being manipulated? I think it does.
• How do we encourage more translational expertise to emerge from actual experts? Currently, in academia, the translation to the general public of one’s research is not at all encouraged or rewarded, and outside academia even less so.
• Like Stephanie, I worry about hidden agendas and partial understandings, but I honestly think they are secondary to getting a robust system of translation started to begin with, which would hopefully in turn engage the general public with the scientific method and current scientific knowledge. In other words, the good outweighs the bad here.

## Open data is not a panacea

I’ve talked a lot recently about how there’s an information war currently being waged on consumers by companies that troll the internet and collect personal data, search histories, and other “attributes” in data warehouses which then gets sold to the highest bidders.

It’s natural to want to balance out this information asymmetry somehow. One such approach is open data, defined in Wikipedia as the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyrightpatents or other mechanisms of control.

I’m going to need more than one blog post to think this through, but I wanted to make two points this morning.

The first is my issue with the phrase “freely available to everyone to use”. What does that mean? Having worked in futures trading, where we put trading machines and algorithms in close proximity with exchanges for large fees so we can get to the market data a few nanoseconds before anyone else, it’s clear to me that availability and access to data is an incredibly complicated issue.

And it’s not just about speed. You can have hugely important, rich, and large data sets sitting in a lump on a publicly available website like wikipedia, and if you don’t have fancy parsing tools and algorithms you’re not going to be able to make use of it.

When important data goes public, the edge goes to the most sophisticated data engineer, not the general public. The Goldman Sachs’s of the world will always know how to make use of “freely available to everyone” data before the average guy.

Which brings me to my second point about open data. It’s general wisdom that we should hope for the best but prepare for the worst. My feeling is that as we move towards open data we are doing plenty of the hoping part but not enough of the preparing part.

If there’s one thing I learned working in finance, it’s not to be naive about how information will be used. You’ve got to learn to think like an asshole to really see what to worry about. It’s a skill which I don’t regret having.

So, if you’re giving me information on where public schools need help, I’m going to imagine using that information to cut off credit for people who live nearby. If you tell me where environmental complaints are being served, I’m going to draw a map and see where they aren’t being served so I can take my questionable business practices there.

I’m not saying proponents of open data aren’t well-meaning, they often seem to be. And I’m not saying that the bad outweighs the good, because I’m not sure. But it’s something we should figure out how to measure, and in this information war it’s something we should keep a careful eye on.

## Consumer segmentation taken to the extreme

I’m up in Western Massachusetts with the family, hidden off in a hotel with a pool and a nearby yarn superstore. My blogging may be spotty for the next few days but rest assured I haven’t forgotten about mathbabe (or Aunt Pythia).

I have just enough time this morning to pose a thought experiment. It’s in three steps. First, read this Reuters article which ends with:

Imagine if Starbucks knew my order as I was pulling into the parking lot, and it was ready the second I walked in. Or better yet, if a barista could automatically run it out to my car the exact second I pulled up. I may not pay more for that everyday, but I sure as hell would if I were late to a meeting with a screaming baby in the car. A lot more. Imagine if my neighborhood restaurants knew my local, big-tipping self was the one who wanted a reservation at 8 pm, not just an anonymous user on OpenTable. They might find some room. And odds are, I’d tip much bigger to make sure I got the preferential treatment the next time. This is why Uber’s surge pricing is genius when it’s not gouging victims of a natural disaster. There are select times when I’ll pay double for a cab. Simply allowing me to do so makes everyone happy.

In a world where the computer knows where we are and who we are and can seamlessly charge us, the world might get more expensive. But it could also get a whole lot less annoying. ”This is what big data means to me,” Rosensweig says.

Second, think about just how not “everyone” is happy. It’s a pet peeve of mine that people who like their personal business plan consistently insist that everybody wins, when clearly there are often people (usually invisible) who are definitely losing. In this case the losers are people whose online personas don’t correlate (in a given model) with big tips. Should those people not be able to reserve a table at a restaurant now? How is that model going to work?

And now I’ve gotten into the third step. It used to be true that if you went to a restaurant enough, the chef and the waitstaff would get to know you and might even keep a table open for you. It was old-school personalization.

What if that really did start to happen at every restaurant and store automatically, based on your online persona? On the one hand, how weird would that be, and on the other hand how quickly would we all get used to it? And what would that mean for understanding each other’s perspectives?

Categories: data science, modeling, musing

## Whom can you trust?

My friend Jordan has written a response to yesterday’s post about Nate Silver. He is a major fan of Silver and contends that I’m not fair to him:

I think Cathy’s distrust is warranted, but I think Silver shares it.  The central concern of his chapter on weather prediction is the vast difference in accuracy between federal hurricane forecasters, whose only job is to get the hurricane track right, and TV meteorologists, whose very different incentive structure leads them to get the weather wrong on purpose.  He’s just as hard on political pundits and their terrible, terrible predictions, which are designed to be interesting, not correct.

To this I’d say, Silver mocks TV meteorologists and political pundits in a dismissive way, as not being scientific enough. That’s not the same as taking them seriously and understanding their incentives, and it doesn’t translate to the much more complicated world of finance.

In any case, he could have understood incentives in every field except finance and I’d still be mad, because my direct experience with finance made me understand it, and the outsized effect it has on our economy makes it hugely important.

But Jordan brings up an important question about trust:

But what do you do with cases like finance, where the only people with deep domain knowledge are the ones whose incentive structure is socially suboptimal?  (Cathy would use saltier language here.)  I guess you have to count on mavericks like Cathy, who’ve developed the domain knowledge by working in the financial industry, but who are now separated from the incentives that bind the insiders.

But why do I trust what Cathy says about finance?

Because she’s an expert.

Is Cathy OK with this?

No, Cathy isn’t okay with this. The trust problem is huge, and I address it directly in my post:

This raises a larger question: how can the public possibly sort through all the noise that celebrity-minded data people like Nate Silver hand to them on a silver platter? Whose job is it to push back against rubbish disguised as authoritative scientific theory?

It’s not a new question, since PR men disguising themselves as scientists have been around for decades. But I’d argue it’s a question that is increasingly urgent considering how much of our lives are becoming modeled. It would be great if substantive data scientists had a way of getting together to defend the subject against sensationalist celebrity-fueled noise.

One hope I nurture is that, with the opening of the various data science institutes such as the one at Columbia which was a announced a few months ago, there will be a way to form exactly such a committee. Can we get a little peer review here, people?

I do think domain-expertise-based peer review will help, but not when the entire field is captured, like in some subfields of medical research and in some subfields of economics and finance (for a great example see Glen Hubbard get destroyed in Matt Taibbi’s recent blogpost for selling his economic research).

The truth is, some fields are so yucky that people who want to do serious research just leave because they are disgusted. Then the people who remain are the “experts”, and you can’t trust them.

The toughest part is that you don’t know which fields are like this until you try to work inside them.

Bottomline: I’m telling you not to trust Nate Silver, and I would also urge you not to trust any one person, including me. For that matter don’t necessarily trust crowds of people either. Instead, carry a healthy dose of skepticism and ask hard questions.

This is asking a lot, and will get harder as time goes on and as the world becomes more complicated. On the one hand, we need increased transparency for scientific claims like projects such as runmycode provide. On the other, we need to understand the incentive structure inside a field like finance to make sure it is aligned with its stated mission.

Categories: data science, finance, modeling

## When accurate modeling is not good

I liked Andrew Gelman’s recent post (hat tip Suresh Naidu) about predatory modeling going on in casinos, specifically Caesars in Iowa. The title of the post is already good, and is a riff on Caesars Entertainment CEO Gary Loveman said:

There are four ways to get fired from Caesars: (1) theft, (2) sexual harassment, (3) running an experiment without a control group, and (4) keeping a gambling addict away from the casino

He tells a story about a woman who loses lots of money at the casino, but who moreover gets manipulated to come back and lose more and more based on the data the people collected at Caesars and based on the models built by the quants there. You should read the whole thing, which as usual with Gelman is quirky and fun. His main point comes here (emphasis mine):

The Caesars case (I keep wanting to write Caesar’s but apparently no, it’s Caesars, just like Starbucks) interested me because of the role of statistics. I’m used to thinking of probability and statistics as a positive social force (helping medical research or, in earlier days, helping the allies in World War 2), or mildly positive (for example, helping design measures to better evaluate employees), or maybe neutral (exotic financial instruments which serve no redeeming social value but presumably don’t do much harm) or moderately negative (“Moneyball”-style strategies such as going for slow sluggers who foul off endless pitches and walk a lot; it may win games but it makes for boring baseball). And then there are statisticians who do fishy analyses, for example trying to hide that some drug causes damage so it can stay on the market. But that’s a bit different because such a statistical analysis, no matter how crafty, is inherently a bad analysis, trying to obscure rather than learn.

The Caesars case seems different, in that there is a very direct tradeoff: the better the statistics and the better the science, the worse the human outcomes. These guys are directly optimizing their ability to ruin some people’s lives.

It’s not the only one, but they are not usually this clear-cut.

It’s time we started scoring models on various dimensions. Accuracy is one, predatoriness is another. They’re distinct.

Categories: data science

## Fighting the information war (but only on behalf of rich people)

There’s an information war out there which we have to be prepared for. Actually there a few of them.

And according to this New York Times piece, there’s now a way to fight against the machine, for a fee. Companies like Reputation.com will try to scour the web and remove data you don’t want floating around about you, and when that’s impossible they’ll flood the web with other good data to balance out the bad stuff.

At least that’s what I’m assuming they do, because they of course don’t really explain their techniques. And that’s the other information war, where they scare rich people with technical sounding jargon and tell them unlikely stories to get their money.

I’m not claiming predatory information-gatherers aren’t out there. But this is the wrong way to deal with it.

First of all, most of the data out there systematically being used for nefarious purposes, at least in this country, is used against the poor, denying them reasonable terms on their loans and other services. So the idea that people will need to pay for a service to protect their information is weird. It’s like saying the air quality is bad for poor people, so let’s charge rich people for better air.

So what kind of help is Reputation.com actually providing? Here’s my best guess.

First it targets people to get overly scared in the spirit of this recent BusinessWeek article, which explains that cosmetic companies have gone to China and started a campaign to convince Chinese women they are too hairy so they’ll start buying products to remove hair. From that article, which is guaranteed to make you understand something about American beauty culture too:

Despite such plays on women’s fears of embarrassment, Reckitt Benckiser’s Sehgal says that Chinese women are too “independent-minded” to be coaxed into using a product they don’t really need. Others aren’t so sure. Veet’s Chinese marketing “plays a role that is very similar to that of the apple in the Bible,” says Benjamin Voyer, a social psychologist and assistant professor of marketing at ESCP Europe business school. “It creates an awareness, which subsequently creates a feeling of shame and need.”

Second, Reputation.com gets their clients off nuisance lists, like the modern version of a do-not-call program (which, importantly, is run by the government). This is probably equivalent to setting up a bunch of email filters and clearing their cookies every now and then, but they can’t tell their clients that.

Finally, for those rich people who are also super vain, they will try to do things like replace the unflattering photos of them that come up in a google image search with better-looking ones they choose. Things like that, image issues.

I just want to point out one more salient fact about Reputation.com. It’s not just in their interest to scare-monger, it’s actually in their interest to make the data warehouses more complete (they have themselves amassed an enormous database on people), and to have people who don’t pay for their services actually need their services more. They could well create a problem to produce a market for their product.

What drives me nuts about this is how elitist it is.

There are very real problems in the information-gathering space, and we need to address them, but one of the most important issues is that the very people who can’t afford to pay for their reputation to be kept clean are the real victims of the system.

There is literally nobody who will make good money off of actually solving this problem: I challenge any libertarian to explain how the free market will address this. It has to be addressed through policy, and specifically through legislating what can and cannot be done with personal data.

Probably the worst part is that, through using the services from companies Reputation.com and because of the nature of the personalization of internet usage, the very legislators who need to act on behalf of their most vulnerable citizens won’t even see the problem since they don’t share it.

Categories: data science, rant

## Columbia Data Science course, week 14: Presentations

In the final week of Rachel Schutt’s Columbia Data Science course we heard from two groups of students as well as from Rachel herself.

Data Science; class consciousness

The first team of presenters consisted of Yegor, Eurry, and Adam. Many others whose names I didn’t write down contributed to the research, visualization, and writing.

First they showed us the very cool graphic explaining how self-reported skills vary by discipline. The data they used came from the class itself, which did this exercise on the first day:

so the star in the middle is the average for the whole class, and each star along the side corresponds to the average (self-reported) skills of people within a specific discipline. The dotted lines on the outside stars shows the “average” star, so it’s easier to see how things vary per discipline compared to the average.

Surprises: Business people seem to think they’re really great at everything except communication. Journalists are better at data wrangling than engineers.

We will get back to the accuracy of self-reported skills later.

We were asked, do you see your reflection in your star?

Also, take a look at the different stars. How would you use them to build a data science team? Would you want people who are good at different skills? Is it enough to have all the skills covered? Are there complementary skills? Are the skills additive, or do you need overlapping skills among team members?

Thought Experiment

If all data which had ever been collected were freely available to everyone, would we be better off?

Some ideas were offered:

• all nude photos are included. [Mathbabe interjects: it's possible to not let people take nude pics of you. Just sayin'.]
• so are passwords, credit scores, etc.
• how do we make secure transactions between a person and her bank considering this?
• what does it mean to be “freely available” anyway?

The data of power; the power of data

You see a lot of people posting crap like this on Facebook:

But here’s the thing: the Berner Convention doesn’t exist. People are posting this to their walls because they care about their privacy. People think they can exercise control over their data but they can’t. Stuff like this give one a false sense of security.

In Europe the privacy laws are stricter, and you can request data from Irish Facebook and they’re supposed to do it, but it’s still not easy to successfully do.

And it’s not just data that’s being collected about you – it’s data you’re collecting. As scientists we have to be careful about what we create, and take responsibility for our creations.

As Francois Rabelais said,

Wisdom entereth not into a malicious mind, and science without conscience is but the ruin of the soul.

Or as Emily Bell from Columbia said,

Every algorithm is editorial.

We can’t be evil during the day and take it back at hackathons at night. Just as journalists need to be aware that the way they report stories has consequences, so do data scientists. As a data scientist one has impact on people’s lives and how they think.

Here are some takeaways from the course:

• We’ve gained significant powers in this course.
• In the future we may have the opportunity to do more.
• With data power comes data responsibility.

Who does data science empower?

The second presentation was given by Jed and Mike. Again, they had a bunch of people on their team helping out.

Thought experiment

“Anything which uses science as part of its name isn’t political science, creation science, computer science.”

- Hal Abelson, MIT CS prof

Keeping this in mind, if you could re-label data science, would you? What would you call it?

Some comments from the audience:

• Let’s call it “modellurgy,” the craft of beating mathematical models into shape instead of metal
• Let’s call it “statistics”

Does it really matter what data science is? What should it end up being?

Chris Wiggins from Columbia contends there are two main views of what data science should end up being. The first stems from John Tukey, inventor of the fast fourier transform and the box plot, and father of exploratory data analysis. Tukey advocated for a style of research he called “data analysis”, emphasizing the primacy of data and therefore computation, which he saw as part of statistics. His descriptions of data analysis, which he saw as part of doing statistics, are very similar to what people call data science today.

The other prespective comes from Jim Gray, Computer Scientist from Microsoft. He saw the scientific ideals of the enlightenment age as expanding and evolving. We’ve gone from the theories of Darwin and Newton to experimental and computational approaches of Turing. Now we have a new science, a data-driven paradigm. It’s actually the fourth paradigm of all the sciences, the first three being experimental, theoretical, and computational. See more about this here.

Wait, can data science be both?

Note it’s difficult to stick Computer Science and Data Science on this line.

Statistics is a tool that everyone uses. Data science also could be seen that way, as a tool rather than a science.

Who does data science?

Here’s a graphic showing the make-up of Kaggle competitors. Teams of students collaborated to collect, wrangle, analyze and visualize this data:

The size of the blocks correspond to how many people in active competitions have an education background in a given field. We see that almost a quarter of competitors are computer scientists. The shading corresponds to how often they compete. So we see the business finance people do more competitions on average than the computer science people.

Consider this: the only people doing math competitions are math people. If you think about it, it’s kind of amazing how many different backgrounds are represented above.

We got some cool graphics created by the students who collaborated to get the data, process it, visualize it and so on.

Which universities offer courses on Data Science?

There will be 26 universities in total by 2013 that offer data science courses. The balls are centered at the center of gravity of a given state, and the balls are bigger if there are more in that state.

Where are data science jobs available?

Observations:

• We see more professional schools offering data science courses on the west coast.
• It would also would be interesting to see this corrected for population size.
• Only two states had no jobs.
• Massachusetts #1 per capita, then Maryland

McKinsey says there will be hundreds of thousands of data science jobs in the next few years. There’s a massive demand in any case. Some of us will be part of that. It’s up to us to make sure what we’re doing is really data science, rather than validating previously held beliefs.

We need to advance human knowledge if we want to take the word “scientist” seriously.

How did this class empower you?

You are one of the first people to take a data science class. There’s something powerful there.

Thank you Rachel!

Last Day of Columbia Data Science Class, What just happened? from Rachel’s perspective

Recall the stated goals of this class were:

• learn about what it’s like to be a data scientists
• be able to do some of what a data scientist does

Hey we did this! Think of all the guest lectures; they taught you a lot of what it’s like to be a data scientist, which was goal 1. Here’s what I wanted you guys to learn before the class started based on what a data scientist does, and you’ve learned a lot of that, which was goal 2:

Mission accomplished! Mission accomplished?

Thought experiment that I gave to myself last Spring

How would you design a data science class?

• It’s not a well-defined body of knowledge, subject, no textbook!
• It’s popularized and celebrated in the press and media, but there’s no “authority” to push back
• I’m intellectually disturbed by idea of teaching a course when the body of knowledge is ill-defined
• I didn’t know who would show up, and what their backgrounds and motivations would be
• Could it become redundant with a machine learning class?

My process

I asked questions of myself and from other people. I gathered information, and endured existential angst about data science not being a “real thing.” I needed to give it structure.

Then I started to think about it this way: while I recognize that data science has the potential to be a deep research area, it’s not there yet, and in order to actually design a class, let’s take a pragmatic approach: Recognize that data science exists. After all, there are jobs out there. I want to help students to be qualified for them. So let me teach them what it takes to get those jobs. That’s how I decided to approach it.

In other words, from this perspective, data science is what data scientists do. So it’s back to the list of what data scientists do. I needed to find structure on top of that, so the structure I used as a starting point were the data scientist profiles.

Data scientist profiles

This was a way to think about your strengths and weaknesses, as well as a link between speakers. Note it’s easy to focus on “technical skills,” but it can also be problematic in being too skills-based, as well as being problematic because it has no scale, and no notion of expertise. On the other hand it’s good in that it allows for and captures variability among data scientists.

I assigned weekly guest speakers topics related to their strengths. We held lectures, labs, and (optional) problem sessions. From this you got mad skillz:

• programming in R
• some python
• you learned some best practices about coding

From the perspective of machine learning,

• you know a bunch of algorithms like linear regression, logistic regression, k-nearest neighbors, k-mean, naive Bayes, random forests,
• you know what they are, what they’re used for, and how to implement them
• you learned machine learning concepts like training sets, test sets, over-fitting, bias-variance tradeoff, evaluation metrics, feature selection, supervised vs. unsupervised learning
• you learned about recommendation systems
• you’ve entered a Kaggle competition

Importantly, you now know that if there is an algorithm and model that you don’t know, you can (and will) look it up and figure it out. I’m pretty sure you’ve all improved relative to how you started.

You’ve learned some data viz by taking flowing data tutorials.

You’ve learned statistical inference, because we discussed

• observational studies,
• causal inference, and
• experimental design.
• We also learned some maximum likelihood topics, but I’d urge you to take more stats classes.

In the realm of data engineering,

• we showed you map reduce and hadoop
• we worked with 30 separate shards
• we used an api to get data
• we spent time cleaning data
• we’ve processed different kinds of data

As for communication,

• you wrote thoughts in response to blog posts
• you observed how different data scientists communicate or present themselves, and have different styles
• your final project required communicating among each other

As for domain knowledge,

• lots of examples were shown to you: social networks, advertising, finance, pharma, recommender systems, dallas art museum

I heard people have been asking the following: why didn’t we see more data science coming from non-profits, governments, and universities? Note that data science, the term, was born in for-profits. But the truth is I’d also like to see more of that. It’s up to you guys to go get that done!

How do I measure the impact of this class I’ve created? Is it possible to incubate awesome data science teams in the classroom? I might have taken you from point A to point B but you might have gone there anyway without me. There’s no counterfactual!

Can we set this up as a data science problem? Can we use a causal modeling approach? This would require finding students who were more or less like you but didn’t take this class and use propensity score matching. It’s not a very well-defined experiment.

But the goal is important: in industry they say you can’t learn data science in a university, that it has to be on the job. But maybe that’s wrong, and maybe this class has proved that.

What has been the impact on you or to the outside world? I feel we have been contributing to the broader discourse.

Does it matter if there was impact? and does it matter if it can be measured or not? Let me switch gears.

What is data science again?

Data science could be defined as:

• A set of best practices used in tech companies, which is how I chose to design the course
• A space of problems that could be solved with data
• A science of data where you can think of the data itself as units

The bottom two have the potential to be the basis of a rich and deep research discipline, but in many cases, the way the term is currently used is:

• Pure hype

But it doesn’t matter how we define it, as much as that I want for you:

• to be problem solvers
• to be question askers
• to use data responsibly and make the world better, not worse.

More on being problem solvers: cultivate certain habits of mind

Here’s a possible list of things to strive for, taken from here:

Here’s the thing. Tons of people can implement k-nearest neighbors, and many do it badly. What matters is that you cultivate the above habits, remain open to continuous learning.

In education in traditional settings, we focus on answers. But what we probably should focus on is how a student behaves when they don’t know the answer. We need to have qualities that help us find the answer.

Thought experiment

How would you design a data science class around habits of mind rather than technical skills? How would you quantify it? How would you evaluate? What would students be able to write on their resumes?

Comments from the students:

• You’d need to keep making people doing stuff they don’t know how to do while keeping them excited about it.
• have people do stuff in their own domains so we keep up wonderment and awe.
• You’d use case studies across industries to see how things work in different contexts

More on being question-askers

Some suggestions on asking questions of others:

• start with assumption that you’re smart
• don’t assume the person you’re talking to knows more or less. You’re not trying to prove anything.
• be curious like a child, not worried about appearing stupid
• ask for clarification around notation or terminology
• ask for clarification around process: where did this data come from? how will it be used? why is this the right data to use? who is going to do what? how will we work together?

Some questions to ask yourself

• does it have to be this way?
• what is the problem?
• how can I measure this?
• what is the appropriate algorithm?
• how will I evaluate this?
• do I have the skills to do this?
• how can I learn to do this?
• who can I work with? Who can I ask?
• how will it impact the real world?

Data Science Processes

In addition to being problem-solvers and question-askers, I mentioned that I want you to think about process. Here are a couple processes we discussed in this course:

(1) Real World –> Generates Data –>
–> Collect Data –> Clean, Munge (90% of your time)
–> Exploratory Data Analysis –>
–> Feature Selection –>
–> Build Model, Build Algorithm, Visualize
–> Evaluate –>Iterate–>
–> Impact Real World

(2) Asking questions of yourselves and others –>
Identifying problems that need to be solved –>
Gathering information, Measuring –>
Learning to find structure in unstructured situations–>
Framing Problem –>
Creating Solutions –> Evaluating

Thought experiment

Come up with a business that improves the world and makes money and uses data

Comments from the students:

• autonomous self-driving cars you order with a smart phone
• find all the info on people and then show them how to make it private
• social network with no logs and no data retention

10 Important Data Science Ideas

Of all the blog posts I wrote this semester, here’s one I think is important:

10 Important Data Science Ideas

Confidence and Uncertainty

Let’s talk about confidence and uncertainty from a couple perspectives.

First, remember that statistical inference is extracting information from data, estimating, modeling, explaining but also quantifying uncertainty. Data Scientists could benefit from understanding this more. Learn more statistics and read Ben’s blog post on the subject.

Second, we have the Dunning-Kruger Effect.
Have you ever wondered why don’t people say “I don’t know” when they don’t know something? This is partly explained through an unconscious bias called the Dunning-Kruger effect.

Basically, people who are bad at something have no idea that they are bad at it and overestimate their confidence. People who are super good at something underestimate their mastery of it. Actual competence may weaken self-confidence.

Thought experiment

Design an app to combat the dunning-kruger effect.

What are you optimizing for? What do you value?

• money, need some minimum to live at the standard of living you want to, might even want a lot.
• time with loved ones and friends
• doing good in the world
• personal fulfillment, intellectual fulfillment
• goals you want to reach or achieve
• being famous, respected, acknowledged
• ?
• some weighted function of all of the above. what are the weights?

What constraints are you under?

• external factors (factors outside of your control)
• your resources: money, time, obligations
• who you are, your education, strengths & weaknesses
• things you can or cannot change about yourself

There are many possible solutions that optimize what you value and take into account the constraints you’re under.

So what should you do with your life?

Remember that whatever you decide to do is not permanent so don’t feel too anxious about it, you can always do something else later –people change jobs all the time

But on the other hand, life is short, so always try to be moving in the right direction (optimizing for what you care about).

If you feel your way of thinking or perspective is somehow different than what those around you are thinking, then embrace and explore that, you might be onto something.

I’m always happy to talk to you about your individual case.

Next Gen Data Scientists

The second blog post I think is important is this “manifesto” that I wrote:

Next-Gen Data Scientists. That’s you! Go out and do awesome things, use data to solve problems, have integrity and humility.

Here’s our class photo!

## How math departments hire faculty

I just got back from a stimulating trip to Stony Brook to give the math colloquium there. I had a great time thanks to my gracious host Jason Starr (this guy, not this guy), and besides giving my talk (which I will give again in San Diego at the joint meetings next month) I enjoyed two conversations about the field of math which I think could be turned into data science projects. Maybe Ph.D. theses or something.

First, a system for deciding whether a paper on the arXiv is “good.” I will post about that on another day because it’s actually pretty involved and possible important.

Second is the way people hire in math departments. This conversation will generalize to other departments, some more than others.

So first of all, I want to think about how the hiring process actually works. There are people who look at folders of applicants, say for tenure-track jobs. Since math is a pretty disjointed field, a majority of the folders will only be understood well enough for evaluation purposes by a few people in the department.

So in other words, the department naturally splits into clusters more or less along field lines: there are the number theorists and then there are the algebraic geometers and then there are the low-dimensional topologists, say.

Each group of people reads the folders from the field or fields that they have enough expertise in to understand. Then from among those they choose some they want to go to bat for. It becomes a political battle, where each group tries to convince the other groups that their candidates are more qualified. But of course it’s really hard to know who’s telling the honest truth. There are probably lots of biases in play too, so people could be overstating their cases unconsciously.

Some potential problems with this system:

1. if you are applying to a department where nobody is in your field, nobody will read your folder, and nobody will go to bat for you, even if you are really great. An exaggeration but kinda true.
2. in order to be convincing that “your guy is the best applicant,” people use things like who the advisor is or which grad school this person went to more than the underlying mathematical content.
3. if your department grows over time, this tends to mean that you get bigger clusters rather than more clusters. So if you never had a number theorist, you tend to never get one, even if you get more positions. This is a problem for grad students who want to become number theorists, but that probably isn’t enough to affect the politics of hiring.

So here’s my data science plan: test the above hypotheses. I said them because I think they are probably true, but it would be not be impossible to create the dataset to test them thoroughly and measure the effects.

The easiest and most direct one to test is the third: cluster departments by subject by linking the people with their published or arXiv’ed papers. Watch the department change over time and see how the clusters change and grow versus how it might happen randomly. Easy peasy lemon squeazy if you have lots of data. Start collecting it now!

The first two are harder but could be related to the project of ranking papers. In other words, you have to define “is really great” to do this. It won’t mean you can say with confidence that X should have gotten a job at University Y, but it would mean you could say that if X’s subject isn’t represented in University Y’s clusters, then X’s chances of getting a job there, all other things being equal, is diminished by Z% on average. Something like that.

There are of course good things about the clustering. For example, it’s not that much fun to be the only person representing a field in your department. I’m not actually passing judgment on this fact, and I’m also not suggesting a way to avoid it (if it should be avoided).

Categories: data science, math, statistics

## How do we quantitatively foster leadership?

I was really impressed with yesterday’s Tedx Women at Barnard event yesterday, organized by Nathalie Molina, who organizes the Athena Mastermind group I’m in at Barnard. I went to the morning talks to see my friend and co-author Rachel Schutt‘s presentation and then came home to spend the rest of the day with my kids, but they other three I saw were also interesting and food for thought.

Unfortunately the videos won’t be available for a month or so, and I plan to blog again when they are for content, but I wanted to  discuss an issue that came up during the Q&A session, namely:

what we choose to quantify and why that matters, especially to women.

This may sound abstract but it isn’t. Here’s what I mean. The talks were centered around the following 10 themes:

• Inspiration: Motivate, and nurture talented people and build collaborative teams
• Advocacy: Speak up for yourself and on behalf of others
• Communication: Listen actively; speak persuasively and with authority
• Vision: Develop strategies, make decisions and act with purpose
• Leverage: Optimize your networks, technology, and financing to meet strategic goals; engage mentors and sponsors
• Entrepreneurial Spirit: Be innovative, imaginative, persistent, and open to change
• Ambition: Own your power, expertise and value
• Courage: Experiment and take bold, strategic risks
• Negotiation: Bridge differences and find solutions that work effectively for all parties
• Resilience: Bounce back and learn from adversity and failure

The speakers were extraordinary and embodied their themes brilliantly. So Rachel spoke about advocating for humanity through working with data, and this amazing woman named Christa Bell spoke about inspiration, and so on. Again, the actual content is for another time, but you get the point.

A high school teacher was there with five of her female students. She spoke eloquently of how important and inspiring it was that these girls saw these talk. She explained that, at their small-town school, there’s intense pressure to do well on standardized tests and other quantifiable measures of success, but that there’s essentially no time in their normal day to focus on developing the above attributes.

Ironic, considering that you don’t get to be a “success” without ambition and courage, communication and vision, or really any of the themes.

In other words, we have these latent properties that we really care about and are essential to someone’s success, but we don’t know how to measure them so we instead measure stuff that’s easy to measure, and reward people based on those scores.

By the way, I’m not saying we don’t also need to be good at content, and tasks, which are easier to measure. I’m just saying that, by focusing on content and tasks, and rewarding people good at that, we’re not developing people to be more courageous, or more resilient, or especially be better advocates of others.

And that’s where the women part comes in. Women, especially young women, are sensitive to the expectations of the culture. If they are getting scored on X, they tend to focus on getting good at X. That’s not a bad thing, because they usually get really good at X, but we have to understand the consequences of it. We have to choose our X’s well.

I’d love to see a system evolve wherein young women (and men) are trained to be resilient and are rewarded for that just as they’re trained to do well on the SAT’s and rewarded for that. How do you train people to be courageous? I’m sure it can be done. How crazy would it be to see a world where advocating for others is directly encouraged?

Let’s try to do this, and hell let’s quantify it too, since that desire, to quantify everything, is not going away. Instead of giving up because important things are hard to quantify, let’s just figure out a way to quantify them. After all, people didn’t think their musical tastes could be quantified 15 years ago but now there’s Pandora.

Update: Ok to quantify this, but the resulting data should not be sold or publicly available. I don’t want our sons’ and daughters’ “resilience scores” to be part of their online personas for everyone to see.

Categories: data science, modeling, musing

## Columbia Data Science course, week 13: MapReduce

The week in Rachel Schutt’s Data Science course at Columbia we had two speakers.

The first was David Crawshaw, a Software Engineer at Google who was trained as a mathematician, worked on Google+ in California with Rachel, and now works in NY on search.

David came to talk to us about MapReduce and how to deal with too much data.

Thought Experiment

Let’s think about information permissions and flow when it comes to medical records. David related a story wherein doctors estimated that 1 or 2 patients died per week in a certain smallish town because of the lack of information flow between the ER and the nearby mental health clinic. In other words, if the records had been easier to match, they’d have been able to save more lives. On the other hand, if it had been easy to match records, other breaches of confidence might also have occurred.

What is the appropriate amount of privacy in health? Who should have access to your medical records?

Comments from David and the students:

• We can assume we think privacy is a generally good thing.
• Example: to be an atheist is punishable by death in some places. It’s better to be private about stuff in those conditions.
• But it takes lives too, as we see from this story.
• Many egregious violations happen in law enforcement, where you have large databases of license plates etc., and people who have access abuse it. In this case it’s a human problem, not a technical problem.
• It’s also a philosophical problem: to what extent are we allowed to make decisions on behalf of other people?
• It’s also a question of incentives. I might cure cancer faster with more medical data, but I can’t withhold the cure from people who didn’t share their data with me.
• To a given person it’s a security issue. People generally don’t mind if someone has their data, they mind if the data can be used against them and/or linked to them personally.
• It’s super hard to make data truly anonymous.

MapReduce

What is big data? It’s a buzzword mostly, but it can be useful. Let’s start with this:

You’re dealing with big data when you’re working with data that doesn’t fit into your compute unit. Note that’s an evolving definition: big data has been around for a long time. The IRS had taxes before computers.

Today, big data means working with data that doesn’t fit in one computer. Even so, the size of big data changes rapidly. Computers have experienced exponential growth for the past 40 years. We have at least 10 years of exponential growth left (and I said the same thing 10 years ago).

Given this, is big data going to go away? Can we ignore it?

No, because although the capacity of a given computer is growing exponentially, those same computers also make the data. The rate of new data is also growing exponentially. So there are actually two exponential curves, and they won’t intersect any time soon.

Let’s work through an example to show how hard this gets.

Word frequency problem

Say you’re told to find the most frequent words in the following list: red, green, bird, blue, green, red, red.

The easiest approach for this problem is inspection, of course. But now consider the problem for lists containing 10,000, or 100,000, or $10^9$ words.

The simplest approach is to list the words and then count their prevalence.  Here’s an example code snippet from the language Go:

Since counting and sorting are fast, this scales to ~100 million words. The limit is now computer memory – if you think about it, you need to get all the words into memory twice.

We can modify it slightly so it doesn’t have to have all words loaded in memory. keep them on the disk and stream them in by using a channel instead of a list. A channel is something like a stream: you read in the first 100 items, then process them, then you read in the next 100 items.

Wait, there’s still a potential problem, because if every word is unique your program will still crash; it will still be too big for memory. On the other hand, this will probably work nearly all the time, since nearly all the time there will be repetition. Real programming is a messy game.

But computers nowadays are many-core machines, let’s use them all! Then the bandwidth will be the problem, so let’s compress the inputs… There are better alternatives that get complex. A heap of hashed values has a bounded size and can be well-behaved (a heap seems to be something like a poset, and I guess you can throw away super small elements to avoid holding everything in memory). This won’t always work but it will in most cases.

Now we can deal with on the order of 10 trillion words, using one computer.

Now say we have 10 computers. This will get us 100 trillion words. Each computer has 1/10th of the input. Let’s get each computer to count up its share of the words. Then have each send its counts to one “controller” machine. The controller adds them up and finds the highest to solve the problem.

We can do the above with hashed heaps too, if we first learn network programming.

Now take a hundred computers. We can process a thousand trillion words. But then the “fan-in”, where the results are sent to the controller, will break everything because of bandwidth problem. We need a tree, where every group of 10 machines sends data to one local controller, and then they all send to super controller. This will probably work.

But… can we do this with 1000 machines? No. It won’t work. Because at that scale one or more computer will fail. If we denote by $X$ the variable which exhibits whether a given computer is working, so $X=0$ means it works and $X=1$ means it’s broken, then we can assume

$P(X=0) = 1- \epsilon.$

But this means, when you have 1000 computers, that the chance that no computer is broken is $(1-\epsilon)^{1000},$ which is generally pretty small even if $\epsilon$ is small. So if $\epsilon = 0.001$ for each individual computer, then the probability that all 1000 computers work is 0.37, less than even odds. This isn’t sufficiently robust.

We address this problem by talking about fault tolerance for distributed work. This usually involves replicating the input (the default is to have three copies of everything), and making the different copies available to different machines, so if one blows another one will still have the good data. We might also embed checksums in the data, so the data itself can be audited for erros, and we will automate monitoring by a controller machine (or maybe more than one?).

In general we need to develop a system that detects errors, and restarts work automatically when it detects them. To add efficiency, when some machines finish, we should use the excess capacity to rerun work, checking for errors.

Q: Wait, I thought we were counting things?! This seems like some other awful rat’s nest we’ve gotten ourselves into.

A: It’s always like this. You cannot reason about the efficiency of fault tolerance easily, everything is complicated. And note, efficiency is just as important as correctness, since a thousand computers are worth more than your salary. It’s like this:

1. The first 10 computers are easy,
2. The first 100 computers are hard, and
3. The first 1,000 computers are impossible.

There’s really no hope. Or at least there wasn’t until about 8 years ago. At Google I use 10,000 computers regularly.

In 2004 Jeff and Sanjay published their paper on MapReduce (and here’s one on the underlying file system).

MapReduce allows us to stop thinking about fault tolerance; it is a platform that does the fault tolerance work for us. Programming 1,000 computers is now easier than programming 100. It’s a library to do fancy things.

To use MapReduce, you write two functions: a mapper function, and then a reducer function. It takes these functions and runs them on many machines which are local to your stored data. All of the fault tolerance is automatically done for you once you’ve placed the algorithm into the map/reduce framework.

The mapper takes each data point and produces an ordered pair of the form (key, value). The framework then sorts the outputs via the “shuffle”, and in particular finds all the keys that match and puts them together in a pile. Then it sends these piles to machines which process them using the reducer function. The reducer function’s outputs are of the form (key, new value), where the new value is some aggregate function of the old values.

So how do we do it for our word counting algorithm? For each word, just send it to the ordered with the key that word and the value being the integer 1. So

red —> (“red”, 1)

blue —> (“blue”, 1)

red —> (“red”, 1)

Then they go into the “shuffle” (via the “fan-in”) and we get a pile of (“red”, 1)’s, which we can rewrite as (“red”, 1, 1). This gets sent to the reducer function which just adds up all the 1′s. We end up with (“red”, 2), (“blue”, 1).

Key point: one reducer handles all the values for a fixed key.

Got more data? Increase the number of map workers and reduce workers. In other words do it on more computers. MapReduce flattens the complexity of working with many computers. It’s elegant and people use it even when they “shouldn’t” (although, at Google it’s not so crazy to assume your data could grow by a factor of 100 overnight). Like all tools, it gets overused.

Counting was one easy function, but now it’s been split up into two functions. In general, converting an algorithm into a series of MapReduce steps is often unintuitive.

For the above word count, distribution needs to be uniform. It all your words are the same, they all go to one machine during the shuffle, which causes huge problems. Google has solved this using hash buckets heaps in the mappers in one MapReduce iteration. It’s called CountSketch, and it is built to handle odd datasets.

At Google there’s a realtime monitor for MapReduce jobs, a box with “shards” which correspond to pieces of work on a machine. It indicates through a bar chart how the various machines are doing. If all the mappers are running well, you’d see a straight line across. Usually, however, everything goes wrong in the reduce step due to non-uniformity of the data – lots of values on one key.

The data preparation and writing the output, which take place behind the scenes, take a long time, so it’s good to try to do everything in one iteration. Note we’re assuming distributed file system is already there – indeed we have to use MapReduce to get data to the distributed file system – once we start using MapReduce we can’t stop.

Once you get into the optimization process, you find yourself tuning MapReduce jobs to shave off nanoseconds 10^{-9} whilst processing petabytes of data. These are order shifts worthy of physicists. This optimization is almost all done in C++. It’s highly optimized code, and we try to scrape out every ounce of power we can.

Josh Wills

Our second speaker of the night was Josh Wills. Josh used to work at Google with Rachel, and now works at Cloudera as a Senior Director of Data Science. He’s known for the following quote:

Data Science (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Thought experiment

How would you build a human-powered airplane? What would you do? How would you form a team?

Student: I’d run an X prize. Josh: this is exactly what they did, for $50,000 in 1950. It took 10 years for someone to win it. The story of the winner is useful because it illustrates that sometimes you are solving the wrong problem. The first few teams spent years planning and then their planes crashed within seconds. The winning team changed the question to: how do you build an airplane you can put back together in 4 hours after a crash? After quickly iterating through multiple prototypes, they solved this problem in 6 months. Josh had some observations about the job of a data scientist: • I spend all my time doing data cleaning and preparation. 90% of the work is data engineering. • On solving problems vs. finding insights: I don’t find insights, I solve problems. • Start with problems, and make sure you have something to optimize against. • Parallelize everything you do. • It’s good to be smart, but being able to learn fast is even better. • We run experiments quickly to learn quickly. Data abundance vs. data scarcity Most people think in terms of scarcity. They are trying to be conservative, so they throw stuff away. I keep everything. I’m a fan of reproducible research, so I want to be able to rerun any phase of my analysis. I keep everything. This is great for two reasons. First, when I make a mistake, I don’t have to restart everything. Second, when I get new sources of data, it’s easy to integrate in the point of the flow where it makes sense. Designing models Models always turn into crazy Rube Goldberg machines, a hodge-podge of different models. That’s not necessarily a bad thing, because if they work, they work. Even if you start with a simple model, you eventually add a hack to compensate for something. This happens over and over again, it’s the nature of designing the model. Mind the gap The thing you’re optimizing with your model isn’t the same as the thing you’re optimizing for your business. Example: friend recommendations on Facebook doesn’t optimize you accepting friends, but rather maximizing the time you spend on Facebook. Look closely: the suggestions are surprisingly highly populated by attractive people of the opposite sex. How does this apply in other contexts? In medicine, they study the effectiveness of a drug instead of the health of the patients. They typically focus on success of surgery rather than well-being of the patient. Economic interlude When I graduated in 2001, we had two options for file storage. 1) Databases: • structured schemas • intensive processing done where data is stored • somewhat reliable • expensive at scale 2) Filers: • no schemas • no data processing capability • reliable • expensive at scale Since then we’ve started generating lots more data, mostly from the web. It brings up the natural idea of a data economic indicator, return on byte. How much value can I extract from a byte of data? How much does it cost to store? If we take the ratio, we want it to be bigger than one or else we discard. Of course this isn’t the whole story. There’s also a big data economic law, which states that no individual record is particularly valuable, but having every record is incredibly valuable. So for example in any of the following categories, • web index • recommendation systems • sensor data • market basket analysis • online advertising one has an enormous advantage if they have all the existing data. A brief introduction to Hadoop Back before Google had money, they had crappy hardware. They came up with idea of copying data to multiple servers. They did this physically at the time, but then they automated it. The formal automation of this process was the genesis of GFS. There are two core components to Hadoop. First is the distributed file system (HDFS), which is based on the google file system. The data stored in large files, with block sizes of 64MB to 256MB. As above, the blocks are replicated to multiple nodes in the cluster. The master node notices if a node dies. Data engineering on hadoop Hadoop is written in java, Whereas Google stuff is in C++. Writing map reduce in the java API not pleasant. Sometimes you have to write lots and lots of map reduces. However, if you use hadoop streaming, you can write in python, R, or other high-level languages. It’s easy and convenient for parallelized jobs. Cloudera Cloudera is like Red hat for hadoop. It’s done under aegis of the Apache Software Foundation. The code is available for free, but Cloudera packages it together, gives away various distributions for free, and waits for people to pay for support and to keep it up and running. Apache hive is a data warehousing system on top of hadoop. It uses an SQL-based query language (includes some map reduce -specific extensions), and it implements common join and aggregation patterns. This is nice for people who know databases well and are familiar with stuff like this. Workflow 1. Using hive, build records that contain everything I know about an entity (say a person) (intensive mapReduce stuff) 2. Write python scripts to process the records over and over again (faster and iterative, also mapReduce) 3. Update the records when new data arrives Note phase 2 are typically map-only jobs, which makes parallelization easy. I prefer standard data formats: text is big and takes up space. Thrift, Avro, protobuf are more compact, binary formats. I also encourage you to use the code and metadata repository Github. I don’t keep large data files in git. ## How to evaluate a black box financial system I’ve been blogging about evaluation methods for modeling, for example here and here, as part of the book I’m writing with Rachel Schutt based on her Columbia Data Science class this semester. Evaluation methods are important abstractions that allow us to measure models based only on their output. Using various metrics of success, we can contrast and compare two or more entirely different models. And it means we don’t care about their underlying structure – they could be based on neural nets, logistic regression, or decision trees, but for the sake of measuring the accuracy, or the ranking, or the calibration, the evaluation method just treats them like black boxes. It recently occurred to me a that we could generalize this a bit, to systems rather than models. So if we wanted to evaluate the school system, or the political system, or the financial system, we could ignore the underlying details of how they are structured and just look at the output. To be reasonable we have to compare two systems that are both viable; it doesn’t make sense to talk about a current, flawed system relative to perfection, since of course every version of reality looks crappy compared to an ideal. The devil is in the articulated evaluation metric, of course. So for the school system, we can ask various questions: Do our students know how to read? Do they finish high school? Do they know how to formulate an argument? Have they lost interest in learning? Are they civic-minded citizens? Do they compare well to other students on standardized tests? How expensive is the system? For the financial system, we might ask things like: Does the average person feel like their money is safe? Does the system add to stability in the larger economy? Does the financial system mitigate risk to the larger economy? Does it put capital resources in the right places? Do fraudulent players inside the system get punished? Are the laws transparent and easy to follow? The answers to those questions aren’t looking good at all: for example, take note of the recent Congressional report that blames Jon Corzine for MF Global’s collapse, pins him down on illegal and fraudulent activity, and then does absolutely nothing about it. To conserve space I will only use this example but there are hundreds more like this from the last few years. Suffice it to say, what we currently have is a system where the agents committing fraud are actually glad to be caught because the resulting fines are on the one hand smaller than their profits (and paid by shareholders, not individual actors), and on the other hand are cemented as being so, and set as precedent. But again, we need to compare it to another system, we can’t just say “hey there are flaws in this system,” because every system has flaws. I’d like to compare it to a system like ours except where the laws are enforced. That may sounds totally naive, and in a way it is, but then again we once did have laws, that were enforced, and the financial system was relatively tame and stable. And although we can’t go back in a time machine to before Glass-Steagall was revoked and keep “financial innovation” from happening, we can ask our politicians to give regulators the power to simplify the system enough so that something like Glass-Steagall can once again work. Categories: data science, finance, rant ## Systematized racism in online advertising, part 1 November 25, 2012 1 comment There is no regulation of how internet ad models are built. That means that quants can use any information they want, usually historical, to decide what to expect in the future. That includes associating arrests with african-american sounding names. In a recent Reuters article, this practice was highlighted: Instantcheckmate.com, which labels itself the “Internet’s leading authority on background checks,” placed both ads. A statistical analysis of the company’s advertising has found it has disproportionately used ad copy including the word “arrested” for black-identifying names, even when a person has no arrest record. Luckily, Professor Sweeney, a Harvard University professor of government with a doctorate in computer science, is on the case: According to preliminary findings of Professor Sweeney’s research, searches of names assigned primarily to black babies, such as Tyrone, Darnell, Ebony and Latisha, generated “arrest” in the instantcheckmate.com ad copy between 75 percent and 96 percent of the time. Names assigned at birth primarily to whites, such as Geoffrey, Brett, Kristen and Anne, led to more neutral copy, with the word “arrest” appearing between zero and 9 percent of the time. Of course when I say there’s no regulation, that’s an exaggeration. There is some, and if you claim to be giving a credit report, then regulations really do exist. But as for the above, here’s what regulators have to say: “It’s disturbing,” Julie Brill, an FTC commissioner, said of Instant Checkmate’s advertising. “I don’t know if it’s illegal … It’s something that we’d need to study to see if any enforcement action is needed.” Let’s be clear: this is just the beginning. Categories: data science, news, rant ## Columbia Data Science course, week 12: Predictive modeling, data leakage, model evaluation This week’s guest lecturer in Rachel Schutt’s Columbia Data Science class was Claudia Perlich. Claudia has been the Chief Scientist at m6d for 3 years. Before that she was a data analytics group at the IBM center that developed Watson, the computer that won Jeopardy!, although she didn’t work on that project. Claudia got her Ph.D. in information systems at NYU and now teaches a class to business students in data science, although mostly she addresses how to assess data science work and how to manage data scientists. Claudia also holds a masters in Computer Science. Claudia is a famously successful data mining competition winner. She won the KDD Cup in 2003, 2007, 2008, and 2009, the ILP Challenge in 2005, the INFORMS Challenge in 2008, and the Kaggle HIV competition in 2010. She’s also been a data mining competition organizer, first for the INFORMS Challenge in 2009 and then for the Heritage Health Prize in 2011. Claudia claims to be retired from competition. Claudia’s advice to young people: pick your advisor first, then choose the topic. It’s important to have great chemistry with your advisor, and don’t underestimate the importance. Background Here’s what Claudia historically does with her time: • predictive modeling • data mining competitions • publications in conferences like KDD and journals • talks • patents • teaching • digging around data (her favorite part) Claudia likes to understand something about the world by looking directly at the data. Here’s Claudia’s skill set: • plenty of experience doing data stuff (15 years) • data intuition (for which one needs to get to the bottom of the data generating process) • dedication to the evaluation (one needs to cultivate a good sense of smell) • model intuition (we use models to diagnose data) Claudia also addressed being a woman. She says it works well in the data science field, where her intuition is useful and is used. She claims her nose is so well developed by now that she can smell it when something is wrong. This is not the same thing as being able to prove something algorithmically. Also, people typically remember her because she’s a woman, even when she don’t remember them. It has worked in her favor, she says, and she’s happy to admit this. But then again, she is where she is because she’s good. Someone in the class asked if papers submitted for journals and/or conferences are blind to gender. Claudia responded that it was, for some time, typically double-blind but now it’s more likely to be one-sided. And anyway there was a cool analysis that showed you can guess who wrote a paper with 80% accuracy just by knowing the citations. So making things blind doesn’t really help. More recently the names are included, and hopefully this doesn’t make things too biased. Claudia admits to being slightly biased towards institutions – certain institutions prepare better work. Skills and daily life of a Chief Data Scientist Claudia’s primary skills are as follows: • Data manipulation: unix (sed, awk, etc), Perl, SQL • Modeling: various methods (logistic regression, nearest neighbors, k-nearest neighbors, etc) • Setting things up She mentions that the methods don’t matter as much as how you’ve set it up, and how you’ve translated it into something where you can solve a question. More recently, she’s been told that at work she spends: • 40% of time as “contributor”: doing stuff directly with data • 40% of time as “ambassador”: writing stuff, giving talks, mostly external communication to represent m6d, and • 20% of time in “leadership” of her data group At IBM it was much more focused in the first category. Even so, she has a flexible schedule at m6d and is treated well. The goals of the audience She asked the class, why are you here? Do you want to: • become a data scientist? (good career choice!) • work with data scientist? • work for a data scientist? • manage a data scientist? Most people were trying their hands at the first, but we had a few in each category. She mentioned that it matters because the way she’d talk to people wanting to become a data scientist would be different from the way she’d talk to someone who wants to manage them. Her NYU class is more like how to manage one. So, for example, you need to be able to evaluate their work. It’s one thing to check a bubble sort algorithm or check whether a SQL server is working, but checking a model which purports to give the probability of people converting is different kettle of fish. For example, try to answer this: how much better can that model get if you spend another week on it? Let’s face it, quality control is hard for yourself as a data miner, so it’s definitely hard for other people. There’s no easy answer. There’s an old joke that comes to mind: What’s the difference between the scientist and a consultant? The scientists asks, how long does it take to get this right? whereas the consultant asks, how right can I get this in a week? Insights into data A student asks, how do you turn a data analysis into insights? Claudia: this is a constant point of contention. My attitude is: I like to understand something, but what I like to understand isn’t what you’d consider an insight. My message may be, hey you’ve replaced every “a” by a “0″, or, you need to change the way you collect your data. In terms of useful insight, Ori’s lecture from last week, when he talked about causality, is as close as you get. For example, decision trees you interpret, and people like them because they’re easy to interpret, but I’d ask, why does it look like it does? A slightly different data set would give you a different tree and you’d get a different conclusion. This is the illusion of understanding. I tend to be careful with delivering strong insights in that sense. For more in this vein, Claudia suggests we look at Monica Rogati‘s talk “Lies, damn lies, and the data scientist.” Data mining competitions Claudia drew a distinction between different types of data mining competitions. On the one hand you have the ”sterile” kind, where you’re given a clean, prepared data matrix, a standard error measure, and where the features are often anonymized. This is a pure machine learning problem. Examples of this first kind are: KDD Cup 2009 and 2011 (Netflix). In such competitions, your approach would emphasize algorithms and computation. The winner would probably have heavy machines and huge modeling ensembles. On the other hand, you have the ”real world” kind of data mining competition, where you’re handed raw data, which is often in lots of different tables and not easily joined, where you set up the model yourself and come up with task-specific evaluations. This kind of competition simulates real life more. Examples of this second kind are: KDD cup 2007, 2008, and 2010. If you’re competing in this kind of competition your approach would involve understanding the domain, analyzing the data, and building the model. The winner might be the person who best understands how to tailor the model to the actual question. Claudia prefers the second kind, because it’s closer to what you do in real life. In particular, the same things go right or go wrong. How to be a good modeler Claudia claims that data and domain understanding is the single most important skill you need as a data scientist. At the same time, this can’t really be taught – it can only be cultivated. A few lessons learned about data mining competitions that Claudia thinks are overlooked in academics: • Leakage: the contestants best friend and the organizers/practitioners worst nightmare. There’s always something wrong with the data, and Claudia has made an artform of figuring out how the people preparing the competition got lazy or sloppy with the data. • Adapting learning to real-life performance measures beyond standard measures like MSE, error rate, or AUC (profit?) • Feature construction/transformation: real data is rarely flat (i.e. given to you in a beautiful matrix) and good, practical solutions for this problem remains a challenge. Leakage Leakage refers to something that helps you predict something that isn’t fair. It’s a huge problem in modeling, and not just for competitions. Oftentimes it’s an artifact of reversing cause and effect. Example 1: There was a competition where you needed to predict S&P in terms of whether it would go up or go down. The winning entry had a AUC (area under the ROC curve) of 0.999 out of 1. Since stock markets are pretty close to random, either someone’s very rich or there’s something wrong. There’s something wrong. In the good old days you could win competitions this way, by finding the leakage. Example 2: Amazon case study: big spenders. The target of this competition was to predict customers who spend a lot of money among customers using past purchases. The data consisted of transaction data in different categories. But a winning model identified that “Free Shipping = True” was an excellent predictor What happened here? The point is that free shipping is an effect of big spending. But it’s not a good way to model big spending, because in particular it doesn’t work for new customers or for the future. Note: timestamps are weak here. The data that included “Free Shipping = True” was simultaneous with the sale, which is a no-no. We need to only use data from beforehand to predict the future. Example 3: Again an online retailer, this time the target is predicting customers who buy jewelry. The data consists of transactions for different categories. A very successful model simply noted that if sum(revenue) = 0, then it predicts jewelry customers very well? What happened here? The people preparing this data removed jewelry purchases, but only included people who bought something in the first place. So people who had sum(revenue) = 0 were people who only bought jewelry. The fact that you only got into the dataset if you bought something is weird: in particular, you wouldn’t be able to use this on customers before they finished their purchase. So the model wasn’t being trained on the right data to make the model useful. This is a sampling problem, and it’s common. Example 4: This happened at IBM. The target was to predict companies who would be willing to buy “websphere” solutions. The data was transaction data + crawled potential company websites. The winning model showed that if the term ”websphere” appeared on the company’s website, then they were great candidates for the product. What happened? You can’t crawl the historical web, just today’s web. Thought experiment You’re trying to study who has breast cancer. The patient ID, which seemed innocent, actually has predictive power. What happened? In the above image, red means cancerous, green means not. it’s plotted by patient ID. We see three or four distinct buckets of patient identifiers. It’s very predictive depending on the bucket. This is probably a consequence of using multiple databases, some of which correspond to sicker patients are more likely to be sick. A student suggests: for the purposes of the contest they should have renumbered the patients and randomized. Claudia: would that solve the problem? There could be other things in common as well. A student remarks: The important issue could be to see the extent to which we can figure out which dataset a given patient came from based on things besides their ID. Claudia: Think about this: what do we want these models for in the first place? How well can you predict cancer? Given a new patient, what would you do? If the new patient is in a fifth bin in terms of patient ID, then obviously don’t use the identifier model. But if it’s still in this scheme, then maybe that really is the best approach. This discussion brings us back to the fundamental problem that we need to know what the purpose of the model is and how is it going to be used in order to decide how to do it and whether it’s working. Pneumonia During an INFORMS competition on pneumonia predictions in hospital records, where the goal was to predict whether a patient has pneumonia, a logistic regression which included the number of diagnosis codes as a numeric feature (AUC of 0.80) didn’t do as well as the one which included it as a categorical feature (0.90). What’s going on? This had to do with how the person prepared the data for the competition: The diagnosis code for pneumonia was 486. So the preparer removed that (and replaced it by a “-1″) if it showed up in the record (rows are different patients, columns are different diagnoses, there are max 4 diagnoses, “-1″ means there’s nothing for that entry). Moreover, to avoid telling holes in the data, the preparer moved the other diagnoses to the left if necessary, so that only “-1″‘s were on the right. There are two problems with this: 1. If the column has only “-1″‘s, then you know it started out with only pneumonia, and 2. If the column has no “-1″‘s, you know there’s no pneumonia (unless there are actually 5 diagnoses, but that’s less common). This was enough information to win the competition. Note: winning competition on leakage is easier than building good models. But even if you don’t explicitly understand and game the leakage, your model will do it for you. Either way, leakage is a huge problem. How to avoid leakage Claudia’s advice to avoid this kind of problem: • You need a strict temporal cutoff: remove all information just prior to the event of interest (patient admission). • There has to be a timestamp on every entry and you need to keep • Removing columns asks for trouble • Removing rows can introduce inconsistencies with other tables, also causing trouble • The best practice is to start from scratch with clean, raw data after careful consideration • You need to know how the data was created! I only work with data I pulled and prepared myself (or maybe Ori). Evaluations How do I know that my model is any good? With powerful algorithms searching for patterns of models, there is a serious danger of over fitting. It’s a difficult concept, but the general idea is that “if you look hard enough you’ll find something” even if it does not generalize beyond the particular training data. To avoid overfitting, we cross-validate and we cut down on the complexity of the model to begin with. Here’s a standard picture (although keep in mind we generally work in high dimensional space and don’t have a pretty picture to look at): The picture on the left is underfit, in the middle is good, and on the right is overfit. The model you use matters when it concerns overfitting: So for the above example, unpruned decision trees are the most over fitting ones. This is a well-known problem with unpruned decision trees, which is why people use pruned decision trees. Accuracy: meh Claudia dismisses accuracy as a bad evaluation method. What’s wrong with accuracy? It’s inappropriate for regression obviously, but even for classification, if the vast majority is of binary outcomes are 1, then a stupid model can be accurate but not good (guess it’s always “1″), and a better model might have lower accuracy. Probabilities matter, not 0′s and 1′s. Nobody makes decisions on binary outcomes. I want to know the probability I have breast cancer, I don’t want to be told yes or no. It’s much more information. I care about probabilities. How to evaluate a probability model We separately evaluate the ranking and the calibration. To evaluate the ranking, we use the ROC curve and calculate the area under it, typically ranges from 0.5-1.0. This is independent of scaling and calibration. Here’s an example of how to draw an ROC curve: Sometimes to measure rankings, people draw the so-called lift curve: The key here is that the lift is calculated with respect to a baseline. You draw it at a given point, say 10%, by imagining that 10% of people are shown ads, and seeing how many people click versus if you randomly showed 10% of people ads. A lift of 3 means it’s 3 times better. How do you measure calibration? Are the probabilities accurate? If the model says probability of 0.57 that I have cancer, how do I know if it’s really 0.57? We can’t measure this directly. We can only bucket those predictions and then aggregately compare those in that prediction bucket (say 0.50-0.55) to the actual results for that bucket. For example, here’s what you get when your model is an unpruned decision tree, where the blue diamonds are buckets: A good model would show buckets right along the x=y curve, but here we’re seeing that the predictions were much more extreme than the actual probabilities. Why does this pattern happen for decision trees? Claudia says that this is because trees optimize purity: it seeks out pockets that have only positives or negatives. Therefore its predictions are more extreme than reality. This is generally true about decision trees: they do not generally perform well with respect to calibration. Logistic regression looks better when you test calibration, which is typical: Takeaways: • Accuracy is almost never the right evaluation metric. • Probabilities, not binary outcomes. • Separate ranking from calibration. • Ranking you can measure with nice pictures: ROC, lift • Calibration is measured indirectly through binning. • Different models are better than others when it comes to calibration. • Calibration is sensitive to outliers. • Measure what you want to be good at. • Have a good baseline. Choosing an algorithm This is not a trivial question and in particular small tests may steer you wrong, because as you increase the sample size the best algorithm might vary: often decision trees perform very well but only if there’s enough data. In general you need to choose your algorithm depending on the size and nature of your dataset and you need to choose your evaluation method based partly on your data and partly on what you wish to be good at. Sum of squared error is maximum likelihood loss function if your data can be assumed to be normal, but if you want to estimate the median, then use absolute errors. If you want to estimate a quantile, then minimize the weighted absolute error. We worked on predicting the number of ratings of a movie will get in the next year, and we assumed a poisson distributions. In this case our evaluation method doesn’t involve minimizing the sum of squared errors, but rather something else which we found in the literature specific to the Poisson distribution, which depends on the single parameter $\lambda$: Charity direct mail campaign Let’s put some of this together. Say we want to raise money for a charity. If we send a letter to every person in the mailing list we raise about$9000. We’d like to save money and only send money to people who are likely to give – only about 5% of people generally give. How can we do that?

If we use a (somewhat pruned, as is standard) decision tree, we get $0 profit: it never finds a leaf with majority positives. If we use a neural network we still make only$7500, even if we only send a letter in the case where we expect the return to be higher than the cost.

This looks unworkable. But if you model is better, it’s not. A person makes two decisions here. First, they decide whether or not to give, then they decide how much to give. Let’s model those two decisions separately, using:

$E(\|person) = P(response = 'yes'| person) \cdot E(\|response = 'yes', person).$

Note we need the first model to be well-calibrated because we really care about the number, not just the ranking. So we will try logistic regression for first half. For the second part, we train with special examples where there are donations.

Altogether this decomposed model makes a profit of $15,000. The decomposition made it easier for the model to pick up the signals. Note that with infinite data, all would have been good, and we wouldn’t have needed to decompose. But you work with what you got. Moreover, you are multiplying errors above, which could be a problem if you have a reason to believe that those errors are correlated. Parting thoughts We are not meant to understand data. Data are outside of our sensory systems and there are very few people who have a near-sensory connection to numbers. We are instead meant to understand language. We are not mean to understand uncertainty: we have all kinds of biases that prevent this from happening and are well-documented. Modeling people in the future is intrinsically harder than figuring out how to label things that have already happened. Even so we do our best, and this is through careful data generation, careful consideration of what our problem is, making sure we model it with data close to how it will be used, making sure we are optimizing to what we actually desire, and doing our homework in learning which algorithms fit which tasks. ## Whither the fake clicks? November 19, 2012 1 comment My friend Ori explained to me last week about where all those fake clicks are coming from and going to and why. He blogged about it on m6d’s website. As usual, it’s all about incentives. They’re gaming the online advertising model for profit. To understand the scam, the first thing to know is that advertisers bid for placement on websites, and they bid higher if they think high quality people will see their ad and if they think a given website is well-connected in the larger web. Say you want that advertising money. You set up a website for the express purpose of selling ads to the people who come to your website. First, if your website gets little or no traffic, nobody is willing to bid up that advertising. No problem, just invent robots that act like people clicking. Next, this still wouldn’t work if your website seems unconnected to the larger web. So what these guy have done is to create hundreds if not thousands of websites, which just constantly shovel fake people around their own network from place to place. That creates the impression, using certain calculations of centrality, that these are very well connected websites. Finally, you might ask how the bad guys convince advertisers that these robots are “high quality” clicks. Here they rely on the fact that advertisers use different definition of quality. Whereas you might only want to count people as high quality if they actually buy your product, it’s often hard to know if that has happened (especially if it’s a store-bought item) so proxies are used instead. Often it’s as simple as whether the online user visits the website of the product, which of course can be done by robots instead. So there it is, an entire phantom web set up just to game the advertisers’ bid system and collect ad money that should by all rights not exist. Two comments: • First, I’m not sure this is illegal. Which is not to say it’s unavoidable, because it’s not so hard to track if you’re fighting against it. The beginning of a very large war. • Second, even if there weren’t this free-for-the-gaming advertiser money out there, there’d still be another group of people incentivized to create fake clicks. Namely, the advertisers give out bonuses to their campaign-level people based on click-through rates. So these campaign managers have a natural incentive to artificially inflate the click-through rates (which would work against their companies’ best interest to be sure). I’m not saying that those people are the architects of the fake clicks, just that they have incentive to be. In particular, they have no incentive to fix this problem. ## O’Reilly book deal signed for “Doing Data Science” I’m very happy to say I just signed a book contract with my co-author, Rachel Schutt, to publish a book with O’Reilly called Doing Data Science. The book will be based on the class Rachel is giving this semester at Columbia which I’ve been blogging about here. For those of you who’ve been reading along for free as I’ve been blogging it, there might not be a huge incentive to buy it, but I can promise you more and better math, more explicit usable formulas, some sample code, and an overall better and more thought-out narrative. It’s supposed to be published in May with a possible early release coming up at the end of February, in time for the O’Reilly Strata Santa Clara conference, where Rachel will be speaking about it and about other stuff curriculum related. Hopefully people will pick it up in time to teach their data science courses in Fall 2013. Speaking of Rachel, she’s also been selected to give a TedXWomen talk at Barnard on December 1st, which is super exciting. She’s talking about advocating for the social good using data. Unfortunately the event is invitation-only, otherwise I’d encourage you all to go and hear her words of wisdom. Update: word on the street is that it will be video-taped. ## Columbia Data Science course, week 11: Estimating causal effects The week in Rachel Schutt’s Data Science course at Columbia we had Ori Stitelman, a data scientist at Media6Degrees. We also learned last night of a new Columbia course: STAT 4249 Applied Data Science, taught by Rachel Schutt and Ian Langmore. More information can be found here. Ori’s background Ori got his Ph.D. in Biostatistics from UC Berkeley after working at a litigation consulting firm. He credits that job with allowing him to understand data through exposure to tons of different data sets; since his job involved creating stories out of data to let experts testify at trials, e.g. for asbestos. In this way Ori developed his data intuition. Ori worries that people ignore this necessary data intuition when they shove data into various algorithms. He thinks that when their method converges, they are convinced the results are therefore meaningful, but he’s here today to explain that we should be more thoughtful than that. It’s very important when estimating causal parameters, Ori says, to understand the data-generating distributions and that involves gaining subject matter knowledge that allows you to understand if you necessary assumptions are plausible. Ori says the first step in a data analysis should always be to take a step back and figure out what you want to know, write that down, and then find and use the tools you’ve learned to answer those directly. Later of course you have to decide how close you came to answering your original questions. Thought Experiment Ori asks, how do you know if your data may be used to answer your question of interest? Sometimes people think that because they have data on a subject matter then you can answer any question. Students had some ideas: • You need coverage of your parameter space. For example, if you’re studying the relationship between household income and holidays but your data is from poor households, then you can’t extrapolate to rich people. (Ori: but you could ask a different question) • Causal inference with no timestamps won’t work. • You have to keep in mind what happened when the data was collected and how that process affected the data itself • Make sure you have the base case: compared to what? If you want to know how politicians are affected by lobbyists money you need to see how they behave in the presence of money and in the presence of no money. People often forget the latter. • Sometimes you’re trying to measure weekly effects but you only have monthly data. You end up using proxies. Ori: but it’s still good practice to ask the precise question that you want, then come back and see if you’ve answered it at the end. Sometimes you can even do a separate evaluation to see if something is a good proxy. • Signal to noise ratio is something to worry about too: as you have more data, you can more precisely estimate a parameter. You’d think 10 observations about purchase behavior is not enough, but as you get more and more examples you can answer more difficult questions. Ori explains confounders with a dating example Frank has an important decision to make. He’s perusing a dating website and comes upon a very desirable woman – he wants her number. What should he write in his email to her? Should he tell her she is beautiful? How do you answer that with data? You could have him select a bunch of beautiful women and half the time chosen at random, tell them they’re beautiful. Being random allows us to assume that the two groups have similar distributions of various features (not that’s an assumption). Our real goal is to understand the future under two alternative realities, the treated and the untreated. When we randomize we are making the assumption that the treated and untreated populations are alike. OK Cupid looked at this and concluded: But note: • It could say more about the person who says “beautiful” than the word itself. Maybe they are otherwise ridiculous and overly sappy? • The recipients of emails containing the word “beautiful” might be special: for example, they might get tons of email, which would make it less likely for Frank to get any response at all. • For that matter, people may be describing themselves as beautiful. Ori points out that this fact, that she’s beautiful, affects two separate things: 1. whether Frank uses the word “beautiful” or not in his email, and 2. the outcome (i.e. whether Frank gets the phone number). For this reason, the fact that she’s beautiful qualifies as a confounder. The treatment is Frank writing “beautiful” in his email. Causal graphs Denote by $W$ the list of all potential confounders. Note it’s an assumption that we’ve got all of them (and recall how unreasonable this seems to be in epidemiology research). Denote by $A$ the treatment (so, Frank using the word “beautiful” in the email). We usually assume this to have a binary (0/1) outcome. Denote by $Y$ the binary (0/1) outcome (Frank getting the number). We are forming the following causal graph: In a causal graph, each arrow means that the ancestor is a cause of the descendent, where ancestor is the node the arrow is coming out of and the descendent is the node the arrow is going into (see this book for more). In our example with Frank, the arrow from beauty means that the woman being beautiful is a cause of Frank writing “beautiful” in the message. Both the man writing “beautiful” and and the woman being beautiful are direct causes of her probability to respond to the message. Setting the problem up formally The building blocks in understanding the above causal graph are: 1. Ask question of interest. 2. Make causal assumptions (denote these by $P$). 3. Translate question into a formal quantity (denote this by $\Psi(P)$). 4. Estimate quantity (denote this by $\Psi(P_n)$). We need domain knowledge in general to do this. We also have to take a look at the data before setting this up, for example to make sure we may make the Positivity Assumption. We need treatment (i.e. data) in all strata of things we adjust for. So if think gender is a confounder, we need to make sure we have data on women and on men. If we also adjust for age, we need data in all of the resulting bins. Asking causal questions What is the effect of ___ on ___? This is the natural form of a causal question. Here are some examples: 1. What is the effect of advertising on customer behavior? 2. What is the effect of beauty on getting a phone number? 3. What is the effect of censoring on outcome? (censoring is when people drop out of a study) 4. What is the effect of drug on time until viral failure?, and the general case 5. What is the effect of treatment on outcome? Look, estimating causal parameters is hard. In fact the effectiveness of advertising is almost always ignored because it’s so hard to measure. Typically people choose metrics of success that are easy to estimate but don’t measure what they want! Everyone makes decision based on them anyway because it’s easier. This results in people being rewarded for finding people online who would have converted anyway. Accounting for the effect of interventions Thinking about that, we should be concerned with the effect of interventions. What’s a model that can help us understand that effect? A common approach is the (randomized) A/B test, which involves the assumption that two populations are equivalent. As long as that assumption is pretty good, which it usually is with enough data, then this is kind of the gold standard. But A/B tests are not always possible (or they are too expensive to be plausible). Often we need to instead estimate the effects in the natural environment, but then the problem is the guys in different groups are actually quite different from each other. So, for example, you might find you showed ads to more people who are hot for the product anyway; it wouldn’t make sense to test the ad that way without adjustment. The game is then defined: how do we adjust for this? The ideal case Similar to how we did this last week, we pretend for now that we have a “full” data set, which is to say we have god-like powers and we know what happened under treatment as well as what would have happened if we had not treated, as well as vice-versa, for every agent in the test. Denote this full data set by $X:$ $X = (W, A, Y^*(1), Y^*(0)),$ where • $W$ denotes the baseline variables (attributes of the agent) as above, • $A$ denotes the binary treatment as above, • $Y^*(1)$ denotes the binary outcome if treated, and • $Y^*(0)$ denotes the binary outcome if untreated. As a baseline check: if we observed this full data structure how would we measure the effect of A on Y? In that case we’d be all-powerful and we would just calculate: $E(Y^*(1)) - E(Y^*(0)).$ Note that, since $Y^*(0)$ and $Y^*(1)$ are binary, the expected value $E(Y^*(0))$ is just the probability of a positive outcome if untreated. So in the case of advertising, the above is the conversion rate change when you show someone an ad. You could also take the ratio of the two quantities: $E(Y^*(1))/E(Y^*(0)).$ This would be calculating how much more likely someone is to convert if they see an ad. Note these are outcomes you can really do stuff with. If you know people convert at 30% versus 10% in the presence of an ad, that’s real information. Similarly if they convert 3 times more often. In reality people use silly stuff like log odds ratios, which nobody understands or can interpret meaningfully. The ideal case with functions In reality we don’t have god-like powers, and we have to make do. We will make a bunch of assumptions. First off, denote by $U$ exogenous variables, i.e. stuff we’re ignoring. Assume there are functions $f_1, f_2,$ and $f_3$ so that: • $W = f_1(U_W),$ i.e. the attributes $W$ are just functions of some exogenous variables, • $A = f_2(W, U_A),$ i.e. the treatment depends in a nice way on some exogenous variables as well the attributes we know about living in $W$, and • $Y = f_3(A, W, U_Y),$ i.e. the outcome is just a function of the treatment, the attributes, and some exogenous variables. Note the various $U$‘s could contain confounders in the above notation. That’s gonna change. But we want to intervene on this causal graph as though it’s the intervention we actually want to make. i.e. what’s the effect of treatment $A$ on outcome $Y$? Let’s look at this from the point of view of the joint distribution $P(W, A, Y) = P(W)P(A|W)P(Y|A,W).$ These terms correspond to the following in our example: 1. the probability of a woman being beautiful, 2. the probability that Frank writes and email to a her saying that she’s beautiful, and 3. the probability that Frank gets her phone number. What we really care about though is the distribution under intervention: $P_a = P(W) P(Y_a| W),$ i.e. the probability knowing someone either got treated or not. To answer our question, we manipulate the value of $A,$ first setting it to 1 and doing the calculation, then setting it to 0 and redoing the calculation. Assumptions We are making a “Consistency Assumption / SUTVA” which can be expressed like this: We have also assumed that we have no unmeasured confounders, which can be expressed thus: We are also assuming positivity, which we discussed above. Down to brass tacks We only have half the information we need. We need to somehow map the stuff we have to the full data set as defined above. We make use of the following identity: Recall we want to estimate $\Psi(P) = E(Y^*(1))/E(Y^*(0)),$ which by the above can be rewritten $E_W(E(Y|A=1, W))/ E_W(E(Y|A=0, W)).$ We’re going to discuss three methods to estimate this quantity, namely: 1. MLE-based substitution estimator (MLE), 2. Inverse probability estimators (IPTW), 3. Double robust estimating equations (A-IPTW) For the above models, it’s useful to think of there being two machines, called $g$ and $Q$, which generate estimates of the probability of the treatment knowing the attributes (that’s machine $g$) and the probability of the outcome knowing the treatment and the attributes (machine $Q$). IPTW In this method, which is also called importance sampling, we weight individuals that are unlikely to be shown an ad more than those likely. In other words, we up-sample in order to generate the distribution, to get the estimation of the actual effect. To make sense of this, imagine that you’re doing a survey of people to see how they’ll vote, but you happen to do it at a soccer game where you know there are more young people than elderly people. You might want to up-sample the elderly population to make your estimate. This method can be unstable if there are really small sub-populations that you’re up-sampling, since you’re essentially multiplying by a reciprocal. The formula in IPTW looks like this: Note the formula depends on the $g$ machine, i.e. the machine that estimates the treatment probability based on attributes. The problem is that people get the $g$ machine wrong all the time, which makes this method fail. In words, when $a=1$ we are taking the sum of terms whose numerators are zero unless we have a treated, positive outcome, and we’re weighting them in the denominator by the probability of getting treated so each “population” has the same representation. We do the same for $a=0$ and take the difference. MLE This method is based on the $Q$ machine, which as you recall estimates the probability of a positive outcome given the attributes and the treatment, so the$latex P(Y|A,W)$values. This method is straight-forward: shove everyone in the machine and predict how the outcome would look under both treatment and non-treatment conditions, and take difference. Note we don’t know anything about the underlying machine$latex Q$. It could be a logistic regression. Get ready to get worried: A-IPTW What if our machines are broken? That’s when we bring in the big guns: double robust estimators. They adjust for confounding through the two machines we have on hand, $Q$ and $g,$ and one machine augments the other depending on how well it works. Here’s the functional form written in two ways to illustrate the hedge: and Note: you are still screwed if both machines are broken. In some sense with a double robust estimator you’re hedging your bet. “I’m glad you’re worried because I’m worried too.” – Ori Simulate and test I’ve shown you 3 distinct methods that estimate effects in observational studies. But they often come up with different answers. We set up huge simulation studies with known functions, i.e. where we know the functional relationships between everything, and then tried to infer those using the above three methods as well as a fourth method called TMLE (targeted maximal likelihood estimation). As a side note, Ori encourages everyone to simulate data. We wanted to know, which methods fail with respect to the assumptions? How well do the estimates work? We started to see that IPTW performs very badly when you’re adjusting by very small thing. For example we found that the probability of someone getting sick is 132. That’s not between 0 and 1, which is not good. But people use these methods all the time. Moreover, as things get more complicated with lots of nodes in our causal graph, calculating stuff over long periods of time, populations get sparser and sparser and it has an increasingly bad effect when you’re using IPTW. In certain situations your data is just not going to give you a sufficiently good answer. Causal analysis in online display advertising An overview of the process: 1. We observe people taking actions (clicks, visits to websites, purchases, etc.). 2. We use this observed data to build list of “prospects” (people with a liking for the brand). 3. We subsequently observe same user during over the next few days. 4. The user visits a site where a display ad spot exists and bid requests are made. 5. An auction is held for display spot. 6. If the auction is won, we display the ad. 7. We observe the user’s actions after displaying the ad. But here’s the problem: we’ve instituted confounders – if you find people who convert highly they think you’ve done a good job. In other words, we are looking at the treated without looking at the untreated. We’d like to ask the question, what’s the effect of display advertising on customer conversion? As a practical concern, people don’t like to spend money on blank ads. So A/B tests are a hard sell. We performed some what-if analysis stipulated on the assumption that the group of users that sees ad is different. Our process was as follows: 1. Select prospects that we got a bid request for on day 0 2. Observe if they were treated on day 1. For those treated set $A=1$ and those not treated set $A=0.$ collect attributes $W.$ 3. Create outcome window to be the next five days following treatment; observe if outcome event occurs (visit to the website whose ad was shown). 4. Estimate model parameters using the methods previously described (our three methods plus TMLE). Here are some results: Note results vary depending on the method. And there’s no way to know which method is working the best. Moreover, this is when we’ve capped the size of the correction in the IPTW methods. If we don’t then we see ridiculous results: ## Data science in the natural sciences This is a guest post written by Chris Wiggins, crossposted from strata.oreilly.com. I find myself having conversations recently with people from increasingly diverse fields, both at Columbia and in local startups, about how their work is becoming “data-informed” or “data-driven,” and about the challenges posed by applied computational statistics or big data. A view from health and biology in the 1990s In discussions with, as examples, New York City journalists, physicists, or even former students now working in advertising or social media analytics, I’ve been struck by how many of the technical challenges and lessons learned are reminiscent of those faced in the health and biology communities over the last 15 years, when these fields experienced their own data-driven revolutions and wrestled with many of the problems now faced by people in other fields of research or industry. It was around then, as I was working on my PhD thesis, that sequencing technologies became sufficient to reveal the entire genomes of simple organisms and, not long thereafter, the first draft of the human genome. This advance in sequencing technologies made possible the “high throughput” quantification of, for example, • the dynamic activity of all the genes in an organism; or • the set of all protein-protein interactions in an organism; or even • statistical comparative genomics revealing how small differences in genotype correlate with disease or other phenotypes. These advances required formation of multidisciplinary collaborations, multi-departmental initiatives, advances in technologies for dealing with massive datasets, and advances in statistical and mathematical methods for making sense of copious natural data. The fourth paradigm This shift wasn’t just a series of technological advances in biological research; the more important change was a realization that research in which data vastly outstrip our ability to posit models is qualitatively different. Much of science for the last three centuries advanced by deriving simple models from first principles — models whose predictions could then be compared with novel experiments. In modeling complex systems for which the underlying model is not yet known but for which data are abundant, however, as in systems biology or social network analysis, one may turn this process on its head by using the data to learn not only parameters of a single model but to select which among many or an infinite number of competing models is favored by the data. Just over a half-decade ago, the computer scientist Jim Gray described this as a “fourth paradigm” of science, after experimental, theoretical, and computational paradigms. Gray predicted that every sector of human endeavor will soon emulate biology’s example of identifying data-driven research and modeling as a distinct field. In the years since then we’ve seen just that. Examples include data-driven social sciences (often leveraging the massive data now available through social networks) and even data-driven astronomy (cf., Astronomy.net). I’ve personally enjoyed seeing many students from Columbia’s School of Engineering and Applied Science (SEAS), trained in applications of big data to biology, go on to develop and apply data-driven models in these fields. As one example, a recent SEAS PhD student spent a summer as a “hackNY Fellow” applying machine learning methods at the data-driven dating NYC startup OKCupid. [Disclosure: I'm co-founder and co-president of hackNY.] He’s now applying similar methods to population genetics as a postdoctoral researcher at the University of Chicago. These students, often with job titles like “data scientist,” are able to translate to other fields, or even to the “real world” of industry and technology-driven startups, methods needed in biology and health for making sense of abundant natural data. Data science: Combining engineering and natural sciences In my research group, our work balances “engineering” goals, e.g., developing models that can make accurate quantitative predictions, with “natural science” goals, meaning building models that are interpretable to our biology and clinical collaborators, and which suggest to them what novel experiments are most likely to reveal the workings of natural systems. For example: • We’ve developed machine-learning methods for modeling the expression of genes — the “on-off” state of the tens of thousands of individual processes your cells execute — by combining sequence data with microarray expression data. These models reveal which genes control which other genes, via what important sequence elements. • We’ve analyzed large biological protein networks and shown how statistical signatures reveal what evolutionary laws can give rise to such graphs. • In collaboration with faculty at Columbia’s chemistry department and NYU’s medical school, we’ve developed hierarchical Bayesian inference methods that can automate the analysis of thousands of time series data from single molecules. These techniques can identify the best model from models of varying complexity, along with the kinetic and biophysical parameters of interest to the chemist and clinician. • Our current projects include, in collaboration with experts at Columbia’s medical school in pathogenic viral genomics, using machine learning methods to reveal whether a novel viral sequence may be carcinogenic or may lead to a pandemic. This research requires an abundant corpus of training data as well as close collaboration with the domain experts to ensure that the models exploit — and are interpretable in light of — the decades of bench work that has revealed what we now know of viral pathogenic mechanisms. Throughout, our goals balance building models that are not only predictive but interpretable, e.g., revealing which sequence elements convey carcinogenicity or permit pandemic transmissibility. Data science in health More generally, we can apply big data approaches not only to biological examples as above but also to health data and health records. These approaches offer the possibility of, for example, revealing unknown lethal drug-drug interactions or forecasting future patient health problems; such models could have consequences for both public health policies and individual patent care. As one example, the Heritage Health Prize is a$3 million challenge ending in April 2013 “to identify patients who will be admitted to a hospital within the next year, using historical claims data.” Researchers at Columbia, both in SEAS and at Columbia’s medical school, are building the technologies needed for answering such big questions from big data.

The need for skilled data scientists

In 2011, the McKinsey Global Institute estimated that between 140,000 and 190,000 additional data scientistswill need to be trained by 2018 in order to meet the increased demand in academia and industry in the United States alone. The multidisciplinary skills required for data science applied to such fields as health and biology will include:

• the computational skills needed to work with large datasets usually shared online;
• the ability to format these data in a way amenable to mathematical modeling;
• the curiosity to explore these data to identify what features our models may be built on;
• the technical skills which apply, extend, and validate statistical and machine learning methods; and most importantly,
• the ability to visualize, interpret, and communicate the resulting insights in a way which advances science. (As the mathematician Richard Hamming said, “The purpose of computing is insight, not numbers.”)

More than a decade ago the statistician William Cleveland, then at Bell Labs, coined the term “data science” for this multidisciplinary set of skills and envisioned a future in which these skills would be needed for more fields of technology. The term has had a recent explosion in usage as more and more fields — both in academia and in industry — are realizing precisely this future.

Categories: data science, guest post

## Medical research needs an independent modeling panel

I am outraged this morning.

I spent yesterday morning writing up David Madigan’s lecture to us in the Columbia Data Science class, and I can hardly handle what he explained to us: the entire field of epidemiological research is ad hoc.

This means that people are taking medication or undergoing treatments that may do they harm and probably cost too much because the researchers’ methods are careless and random.

Of course, sometimes this is intentional manipulation (see my previous post on Vioxx, also from an eye-opening lecture by Madigan). But for the most part it’s not. More likely it’s mostly caused by the human weakness for believing in something because it’s standard practice.

In some sense we knew this already. How many times have we read something about what to do for our health, and then a few years later read the opposite? That’s a bad sign.

And although the ethics are the main thing here, the money is a huge issue. It required $25 million dollars for Madigan and his colleagues to implement the study on how good our current methods are at detecting things we already know. Turns out they are not good at this – even the best methods, which we have no reason to believe are being used, are only okay. Okay,$25 million dollars is a lot, but then again there are literally billions of dollars being put into the medical trials and research as a whole, so you might think that the “due diligence” of such a large industry would naturally get funded regularly with such sums.

But you’d be wrong. Because there’s no due diligence for this industry, not in a real sense. There’s the FDA, but they are simply not up to the task.

One article I linked to yesterday from the Stanford Alumni Magazine, which talked about the work of John Ioannidis (I blogged about his work here called “Why Most Published Research Findings Are False“), summed the situation up perfectly (emphasis mine):

When it comes to the public’s exposure to biomedical research findings, another frustration for Ioannidis is that “there is nobody whose job it is to frame this correctly.” Journalists pursue stories about cures and progress—or scandals—but they aren’t likely to diligently explain the fine points of clinical trial bias and why a first splashy result may not hold up. Ioannidis believes that mistakes and tough going are at the essence of science. ”In science we always start with the possibility that we can be wrong. If we don’t start there, we are just dogmatizing.”

It’s all about conflict of interest, people. The researchers don’t want their methods examined, the pharmaceutical companies are happy to have various ways to prove a new drug “effective”, and the FDA is clueless.

Another reason for an AMS panel to investigate public math models. If this isn’t in the public’s interest I don’t know what is.

## Columbia Data Science course, week 10: Observational studies, confounders, epidemiology

This week our guest lecturer in the Columbia Data Science class was David Madigan,  Professor and Chair of Statistics at Columbia. He received a bachelors degree in Mathematical Sciences and a Ph.D. in Statistics, both from Trinity College Dublin. He has previously worked for AT&T Inc., Soliloquy Inc., the University of Washington, Rutgers University, and SkillSoft, Inc. He has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance and probabilistic graphical models.

So Madigan is an esteemed guest, but I like to call him an “apocalyptic leprechaun”, for reasons which you will know by the end of this post. He’s okay with that nickname, I asked his permission.

Madigan came to talk to us about observation studies, of central importance in data science. He started us out with this:

Thought Experiment

We now have detailed, longitudinal medical data on tens of millions of patients. What can we do with it?

To be more precise, we have tons of phenomenological data: this is individual, patient-level medical record data. The largest of the databases has records on 80 million people: every prescription drug, every condition ever diagnosed, every hospital or doctor’s visit, every lab result, procedures, all timestamped.

But we still do things like we did in the Middle Ages; the vast majority of diagnosis and treatment is done in a doctor’s brain. Can we do better? Can you harness these data to do a better job delivering medical care?

Students responded:

1) There was a prize offered on Kaggle, called “Improve Healthcare, Win $3,000,000.” predicting who is going to go to the hospital next year. Doesn’t that give us some idea of what we can do? Madigan: keep in mind that they’ve coarsened the data for proprietary reasons. Hugely important clinical problem, especially as a healthcare insurer. Can you intervene to avoid hospitalizations? 2) We’ve talked a lot about the ethical uses of data science in this class. It seems to me that there are a lot of sticky ethical issues surrounding this 80 million person medical record dataset. Madigan: Agreed! What nefarious things could we do with this data? We could gouge sick people with huge premiums, or we could drop sick people from insurance altogether. It’s a question of what, as a society, we want to do. What is modern academic statistics? Madigan showed us Drew Conway’s Venn Diagram that we’d seen in week 1: Madigan positioned the modern world of the statistician in the green and purple areas. It used to be the case, say 20 years ago, according to Madigan, that academic statistician would either sit in their offices proving theorems with no data in sight (they wouldn’t even know how to run a t-test) or sit around in their offices and dream up a new test, or a new way of dealing with missing data, or something like that, and then they’d look around for a dataset to whack with their new method. In either case, the work of an academic statistician required no domain expertise. Nowadays things are different. The top stats journals are more deep in terms of application areas, the papers involve deep collaborations with people in social sciences or other applied sciences. Madigan is setting an example tonight by engaging with the medical community. Madigan went on to make a point about the modern machine learning community, which he is or was part of: it’s a newish academic field, with conferences and journals, etc., but is characterized by what stats was 20 years ago: invent a method, try it on datasets. In terms of domain expertise engagement, it’s a step backwards instead of forwards. Comments like the above make me love Madigan. Very few academic statisticians have serious hacking skills, with Mark Hansen being an unusual counterexample. But if all three is what’s required to be called data science, then I’m all for data science, says Madigan. Madigan’s timeline Madigan went to college in 1980, specialized on day 1 on math for five years. In final year, he took a bunch of stats courses, and learned a bunch about computers: pascal, OS, compilers, AI, database theory, and rudimentary computing skills. Then came 6 years in industry, working at an insurance company and a software company where he specialized in expert systems. It was a mainframe environment, and he wrote code to price insurance policies using what would now be described as scripting languages. He also learned about graphics by creating a graphic representation of a water treatment system. He learned about controlling graphics cards on PC’s, but he still didn’t know about data. Then he got a Ph.D. and went into academia. That’s when machine learning and data mining started, which he fell in love with: he was Program Chair of the KDD conference, among other things, before he got disenchanted. He learned C and java, R and S+. But he still wasn’t really working with data yet. He claims he was still a typical academic statistician: he had computing skills but no idea how to work with a large scale medical database, 50 different tables of data scattered across different databases with different formats. In 2000 he worked for AT&T labs. It was an “extreme academic environment”, and he learned perl and did lots of stuff like web scraping. He also learned awk and basic unix skills. It was life altering and it changed everything: having tools to deal with real data rocks! It could just as well have been python. The point is that if you don’t have the tools you’re handicapped. Armed with these tools he is afraid of nothing in terms of tackling a data problem. In Madigan’s opinion, statisticians should not be allowed out of school unless they know these tools. He then went to a internet startup where he and his team built a system to deliver real-time graphics on consumer activity. Since then he’s been working in big medical data stuff. He’s testified in trials related to medical trials, which was eye-opening for him in terms of explaining what you’ve done: “If you’re gonna explain logistical regression to a jury, it’s a different kind of a challenge than me standing here tonight.” He claims that super simple graphics help. Carrotsearch As an aside he suggests we go to this website, called carrotsearch, because there’s a cool demo on it. What is an observational study? Madigan defines it for us: An observational study is an empirical study in which the objective is to elucidate cause-and-effect relationships in which it is not feasible to use controlled experimentation. In tonight’s context, it will involve patients as they undergo routine medical care. We contrast this with designed experiment, which is pretty rare. In fact, Madigan contends that most data science activity revolves around observational data. Exceptions are A/B tests. Most of the time, the data you have is what you get. You don’t get to replay a day on the market where Romney won the presidency, for example. Observational studies are done in contexts in which you can’t do experiments, and they are mostly intended to elucidate cause-and-effect. Sometimes you don’t care about cause-and-effect, you just want to build predictive models. Madigan claims there are many core issues in common with the two. Here are some examples of tests you can’t run as designed studies, for ethical reasons: • smoking and heart disease (you can’t randomly assign someone to smoke) • vitamin C and cancer survival • DES and vaginal cancer • aspirin and mortality • cocaine and birthweight • diet and mortality Pitfall #1: confounders There are all kinds of pitfalls with observational studies. For example, look at this graph, where you’re finding a best fit line to describe whether taking higher doses of the “bad drug” is correlated to higher probability of a heart attack: It looks like, from this vantage point, the more drug you take the fewer heart attacks you have. But there are two clusters, and if you know more about those two clusters, you find the opposite conclusion: Note this picture was rigged it so the issue is obvious. This is an example of a “confounder.” In other words, the aspirin-taking or non-aspirin-taking of the people in the study wasn’t randomly distributed among the people, and it made a huge difference. It’s a general problem with regression models on observational data. You have no idea what’s going on. Madigan: “It’s the wild west out there.” Wait, and it gets worse. It could be the case that within each group there males and females and if you partition by those you see that the more drugs they take the better again. Since a given person either is male or female, and either takes aspirin or doesn’t, this kind of thing really matters. This illustrates the fundamental problem in observational studies, which is sometimes called Simpson’s Paradox. [Remark from someone in the class: if you think of the original line as a predictive model, it's actually still the best model you can obtain knowing nothing more about the aspirin-taking habits or genders of the patients involved. The issue here is really that you're trying to assign causality.] The medical literature and observational studies As we may not be surprised to hear, medical journals are full of observational studies. The results of these studies have a profound effect on medical practice, on what doctors prescribe, and on what regulators do. For example, in this paper, entitled “Oral bisphosphonates and risk of cancer of oesophagus, stomach, and colorectum: case-control analysis within a UK primary care cohort,” Madigan report that we see the very same kind of confounding problem as in the above example with aspirin. The conclusion of the paper is that the risk of cancer increased with 10 or more prescriptions of oral bisphosphonates. It was published on the front page of new york times, the study was done by a group with no apparent conflict of interest and the drugs are taken by millions of people. But the results were wrong. There are thousands of examples of this, it’s a major problem and people don’t even get that it’s a problem. Randomized clinical trials One possible way to avoid this problem is randomized studies. The good news is that randomization works really well: because you’re flipping coins, all other factors that might be confounders (current or former smoker, say) are more or less removed, because I can guarantee that smokers will be fairly evenly distributed between the two groups if there are enough people in the study. The truly brilliant thing about randomization is that randomization matches well on the possible confounders you thought of, but will also give you balance on the 50 million things you didn’t think of. So, although you can algorithmically find a better split for the ones you thought of, that quite possible wouldn’t do as well on the other things. That’s why we really do it randomly, because it does quite well on things you think of and things you don’t. But there’s bad news for randomized clinical trials as well. First off, it’s only ethically feasible if there’s something called clinical equipoise, which means the medical community really doesn’t know which treatment is better. If you know have reason to think treating someone with a drug will be better for them than giving them nothing, you can’t randomly not give people the drug. The other problem is that they are expensive and cumbersome. It takes a long time and lots of people to make a randomized clinical trial work. In spite of the problems, randomized clinical trials are the gold standard for elucidating cause-and-effect relationships. Rubin causal model The Rubin causal model is a mathematical framework for understanding what information we know and don’t know in observational studies. It’s meant to investigate the confusion when someone says something like “I got lung cancer because I smoked”. Is that true? If so, you’d have to be able to support the statement, “If I hadn’t smoked I wouldn’t have gotten lung cancer,” but nobody knows that for sure. Define: • $Z_i$ to be the treatment applied to unit $i$ (0 = control, 1= treatment), • $Y_i(1)$ to be the response for unit $i$ if $Z_i = 1$, • $Y_i(0)$ to be the response for unit $i$ if $Z_i = 0$. Then the unit level causal effect is $Y_i(1)-Y_i(0)$, but we only see one of $Y_i(0)$ and $Y_i(1).$ Example: $Z_i$ is 1 if I smoked, 0 if I didn’t (I am the unit). $Y_i(1)$ is 1 or 0 if I got cancer and I smoked, and $Y_i(0)$ is 1 or 0 depending on whether I got cancer while not smoking. The overall causal effect on me is the difference $Y_i(1)-Y_i(0).$ This is equal to 1 if I got really got cancer because I smoked, it’s 0 if I got cancer (or didn’t) independent of smoking, and it’s -1 if I avoided cancer by smoking. But I’ll never know my actual value since I only know one term out of the two. Of course, on a population level we do know how to infer that there are quite a few “1″‘s among the population, but we will never be able to assign a given individual that number. This is sometimes called the fundamental problem of causal inference. Confounding and Causality Let’s say we have a population of 100 people that takes some drug, and we screen them for cancer. Say 30 out of them get cancer, which gives them a cancer rate of 0.30. We want to ask the question, did the drug cause the cancer? To answer that, we’d have to know what would’ve happened if they hadn’t taken the drug. Let’s play God and stipulate that, had they not taken the drug, we would have seen 20 get cancer, so a rate of 0.20. We typically say the causal effect is the ration of these two numbers (i.e. the increased risk of cancer), so 1.5. But we don’t have God’s knowledge, so instead we choose another population to compare this one to, and we see whether they get cancer or not, whilst not taking the drug. Say they have a natural cancer rate of 0.10. Then we would conclude, using them as a proxy, that the increased cancer rate is the ratio 0.30 to 0.10, so 3. This is of course wrong, but the problem is that the two populations have some underlying differences that we don’t account for. If these were the “same people”, down to the chemical makeup of each other molecules, this “by proxy” calculation would work of course. The field of epidemiology attempts to adjust for potential confounders. The bad news is that it doesn’t work very well. One reason is that they heavily rely on stratification, which means partitioning the cases into subcases and looking at those. But there’s a problem here too. Stratification can introduce confounding. The following picture illustrates how stratification could make the underlying estimates of the causal effects go from good to bad: In the top box, the values of b and c are equal, so our causal effect estimate is correct. However, when you break it down by male and female, you get worse estimates of causal effects. The point is, stratification doesn’t just solve problems. There are no guarantees your estimates will be better if you stratify and all bets are off. What do people do about confounding things in practice? In spite of the above, experts in this field essentially use stratification as a major method to working through studies. They deal with confounding variables by essentially stratifying with respect to them. So if taking aspirin is believed to be a potential confounding factor, they stratify with respect to it. For example, with this study, which studied the risk of venous thromboembolism from the use of certain kinds of oral contraceptives, the researchers chose certain confounders to worry about and concluded the following: After adjustment for length of use, users of oral contraceptives were at least twice the risk of clotting compared with user of other kinds of oral contraceptives. This report was featured on ABC, and it was a big hoo-ha. Madigan asks: wouldn’t you worry about confounding issues like aspirin or something? How do you choose which confounders to worry about? Wouldn’t you worry that the physicians who are prescribing them are different in how they prescribe? For example, might they give the newer one to people at higher risk of clotting? Another study came out about this same question and came to a different conclusion, using different confounders. They adjusted for a history of clots, which makes sense when you think about it. This is an illustration of how you sometimes forget to adjust for things, and the outputs can then be misleading. What’s really going on here though is that it’s totally ad hoc, hit or miss methodology. Another example is a study on oral bisphosphonates, where they adjusted for smoking, alcohol, and BMI. But why did they choose those variables? There are hundreds of examples where two teams made radically different choices on parallel studies. We tested this by giving a bunch of epidemiologists the job to design 5 studies at a high level. There was zero consistency. And an addition problem is that luminaries of the field hear this and say: yeah yeah yeah but I would know the right way to do it. Is there a better way? Madigan and his co-authors examined 50 studies, each of which corresponds to a drug and outcome pair, e.g. antibiotics with GI bleeding. They ran about 5,000 analyses for every pair. Namely, they ran every epistudy imaginable on, and they did this all on 9 different databases. For example, they looked at ACE inhibitors (the drug) and swelling of the heart (outcome). They ran the same analysis on the 9 different standard databases, the smallest of which has records of 4,000,000 patients, and the largest of which has records of 80,000,000 patients. In this one case, for one database the drug triples the risk of heart swelling, but for another database it seems to have a 6-fold increase of risk. That’s one of the best examples, though, because at least it’s always bad news – it’s consistent. On the other hand, for 20 of the 50 pairs, you can go from statistically significant in one direction (bad or good) to the other direction depending on the database you pick. In other words, you can get whatever you want. Here’s a picture, where the heart swelling example is at the top: Note: the choice of database is never discussed in any of these published epidemiology papers. Next they did an even more extensive test, where they essentially tried everything. In other words, every time there was a decision to be made, they did it both ways. The kinds of decisions they tweaker were of the following types: which database you tested on, the confounders you accounted for, the window of time you care about examining (spoze they have a heart attack a week after taking the drug, is it counted? 6 months?) What they saw was that almost all the studies can get either side depending on the choices. Final example, back to oral bisphosphonates. A certain study concluded that it causes esophageal cancer, but two weeks later JAMA published a paper on same issue which concluded it is not associated to elevated risk of esophageal cancer. And they were even using the same database. This is not so surprising now for us. OMOP Research Experiment Here’s the thing. Billions upon billions of dollars are spent doing these studies. We should really know if they work. People’s lives depend on it. Madigan told us about his “OMOP 2010.2011 Research Experiment” They took 10 large medical databases, consisting of a mixture of claims from insurance companies and EHR (electronic health records), covering records of 200 million people in all. This is big data unless you talk to an astronomer. They mapped the data to a common data model and then they implemented every method used in observational studies in healthcare. Altogether they covered 14 commonly used epidemiology designs adapted for longitudinal data. They automated everything in sight. Moreover, there were about 5000 different “settings” on the 14 methods. The idea was to see how well the current methods do on predicting things we actually already know. To locate things they know, they took 10 old drug classes: ACE inhibitors, beta blockers, warfarin, etc., and 10 outcomes of interest: renal failure, hospitalization, bleeding, etc. For some of these the results are known. So for example, warfarin is a blood thinner and definitely causes bleeding. There were 9 such known bad effects. There were also 44 known “negative” cases, where we are super confident there’s just no harm in taking these drugs, at least for these outcomes. The basic experiment was this: run 5000 commonly used epidemiological analyses using all 10 databases. How well do they do at discriminating between reds and blues? This is kind of like a spam filter test. We have training emails that are known spam, and you want to know how well the model does at detecting spam when it comes through. Each of the models output the same thing: a relative risk (causal effect estimate) and an error. This was an attempt to empirically evaluate how well does epidemiology work, kind of the quantitative version of John Ioannidis’s work. we did the quantitative thing to show he’s right. Why hasn’t this been done before? There’s conflict of interest for epidemiology – why would they want to prove their methods don’t work? Also, it’s expensive, it cost$25 million dollars (of course that pales in comparison to the money being put into these studies). They bought all the data, made the methods work automatically, and did a bunch of calculations in the Amazon cloud. The code is open source.

In the second version, we zeroed in on 4 particular outcomes. Here’s the \$25,000,000 ROC curve:

To understand this graph, we need to define a threshold, which we can start with at 2. This means that if the relative risk is estimated to be above 2, we call it a “bad effect”, otherwise call it a “good effect.” The choice of threshold will of course matter.

If it’s high, say 10, then you’ll never see a 10, so everything will be considered a good effect. Moreover these are old drugs and it wouldn’t be on the market. This means your sensitivity will be low, and you won’t find any real problem. That’s bad! You should find, for example, that warfarin causes bleeding.

There’s of course good news too, with low sensitivity, namely a zero false-positive rate.

What if you set the threshold really low, at -10? Then everything’s bad, and you have a 100% sensitivity but very high false positive rate.

As you vary the threshold from very low to very high, you sweep out a curve in terms of sensitivity and false-positive rate, and that’s the curve we see above. There is a threshold (say 1.8) for which your false positive rate is 30% and your sensitivity is 50%.

This graph is seriously problematic if you’re the FDA. A 30% false-positive rate is out of control. This curve isn’t good.

The overall “goodness” of such a curve is usually measured as the area under the curve: you want it to be one, and if your curve lies on diagonal the area is 0.5. This is tantamount to guessing randomly. So if your area under the curve is less than 0.5, it means your model is perverse.

The area above is 0.64. Moreover, of the 5000 analysis we ran, this is the single best analysis.

But note: this is the best if I can only use the same method for everything. In that case this is as good as it gets, and it’s not that much better than guessing.

But no epidemiology would do that!

So what they did next was to specialize the analysis to the database and the outcome. And they got better results: for the medicare database, and for acute kidney injury, their optimal model gives them an AUC of 0.92. They can achieve 80% sensitivity with a 10% false positive rate.

They did this using a cross-validation method. Different databases have different methods attached to them. One winning method is called “OS”, which compares within a given patient’s history (so compares times when patient was on drugs versus when they weren’t). This is not widely used now.

The epidemiologists in general don’t believe the results of this study.

If you go to http://elmo/omop.org, you can see the AUM for a given database and a given method.

Note the data we used was up to mid-2010. To update this you’d have to get latest version of database, and rerun the analysis. Things might have changed.

Moreover, an outcome for which nobody has any idea on what drugs cause what outcomes you’re in trouble. This only applies to when we have things to train on where we know the outcome pretty well.

Parting remarks

Keep in mind confidence intervals only account for sampling variability. They don’t capture bias at all. If there’s bias, the confidence interval or p-value can be meaningless.

What about models that epidemiologists don’t use? We have developed new methods as well (SCCS). we continue to do that, but it’s a hard problem.

Challenge for the students: we ran 5000 different analyses. Is there a good way of combining them to do better? weighted average? voting methods across different strategies?

Note the stuff is publicly available and might make a great Ph.D. thesis.

## The zit model

When my mom turned 42, I was 12 and a total wise-ass. For her present I bought her a coffee mug that had on it the phrase “Things could be worse. You could be old and still have zits”, to tease her about her bad skin. Considering how obnoxious that was, she took it really well and drank out of the mug for years.

Well, I’m sure you can all see where this is going. I’m now 40 and I have zits. I was contemplating this in the bath yesterday, wondering if I’d ever get rid of my zits and wondering if taking long hot baths helps or not. They come and go, so it seems vaguely controllable.

Then I had a thought: well, I could collect data and see what helps. After all, I don’t always have zits. I could keep a diary of all the things that I think might affect the situation: what I eat (I read somewhere that eating cheese makes you have zits), how often I take baths vs. showers, whether I use zit cream, my hormones, etc. and certainly whether or not I have zits on a given day or not.

The first step would be to do some research on the theories people have about what causes zits, and then set up a spreadsheet where I could efficiently add my daily data. Maybe a google form! I’m wild about google forms.

After collecting this data for some time I could build a model which tries to predict zittage, to see which of those many inputs actually have signal for my personal zit model.

Of course I expect a lag between the thing I do or eat or use and the actual resulting zit, and I don’t know what that lag is (do you get zits the day after you eat cheese? or three days after eating cheese?), so I’ll expect some difficulty with this or even over fitting.

Even so, this just might work!

Then I immediately felt tired because, if you think about spending your day collecting information like that about your potential zits, then you must be totally nuts.

I mean, I can imagine doing it just for fun, or to prove a point, or on a dare (there are few things I won’t do on a dare), but when it comes down to it I really don’t care that much about my zits.

Then I started thinking about technology and how it could help me with my zit model. I mean, you know about those bracelets you can wear that count your steps and then automatically record them on your phone, right? Well, how long until those bracelets can be trained to collect any kind of information you can imagine?

• Baths? No problem. I’m sure they can detect moisture and heat.
• Cheese eating? Maybe you’d have to say out loud what you’re eating, but again not a huge problem.
• Hormones? I have no idea but let’s stipulate plausible: they already have an ankle bracelet that monitors blood alcohol levels.
• Whether you have zits? Hmmm. Let’s say you could add any variable you want with voice command.

In other words, in 5 years this project will be a snap when I have my handy dandy techno bracelet which collects all the information I want. And maybe whatever other information as well, because information storage is cheap. I’ll have a bounty of data for my zit model.

This is exciting stuff. I’m looking forward to building the definitive model, from which I can conclude that eating my favorite kind of cheese does indeed give me zits. And I’ll say to myself, worth it!