Archive for November, 2012

How to build a model that will be gamed

I can’t help but think that the new Medicare readmissions penalty, as described by the New York Times, is going to lead to wide-spread gaming. It has all the elements of a perfect gaming storm. First of all, a clear economic incentive:

Medicare last month began levying financial penalties against 2,217 hospitals it says have had too many readmissions. Of those hospitals, 307 will receive the maximum punishment, a 1 percent reduction in Medicare’s regular payments for every patient over the next year, federal records show.

It also has the element of unfairness:

“Many of us have been working on this for other reasons than a penalty for many years, and we’ve found it’s very hard to move,” Dr. Lynch said. He said the penalties were unfair to hospitals with the double burden of caring for very sick and very poor patients.

“For us, it’s not a readmissions penalty,” he said. “It’s a mission penalty.”

And the smell of politics:

In some ways, the debate parallels the one on education — specifically, whether educators should be held accountable for lower rates of progress among children from poor families.

“Just blaming the patients or saying ‘it’s destiny’ or ‘we can’t do any better’ is a premature conclusion and is likely to be wrong,” said Dr. Harlan Krumholz, director of the Center for Outcomes Research and Evaluation at Yale-New Haven Hospital, which prepared the study for Medicare. “I’ve got to believe we can do much, much better.”

Oh wait, we already have weird side effects of the new rule:

With pressure to avert readmissions rising, some hospitals have been suspected of sending patients home within 24 hours, so they can bill for the services but not have the stay counted as an admission. But most hospitals are scrambling to reduce the number of repeat patients, with mixed success.

Note, the new policy is already a kind of reaction to gaming that’s already there, namely because of the stupid way Medicare decides how much to pay for treatment (emphasis mine):

Hospitals’ traditional reluctance to tackle readmissions is rooted in Medicare’s payment system. Medicare generally pays hospitals a set fee for a patient’s stay, so the shorter the visit, the more revenue a hospital can keep. Hospitals also get paid when patients return. Until the new penalties kicked in, hospitals had no incentive to make sure patients didn’t wind up coming back.

How about, instead of adding a weird rule that compromises people’s health and especially punishes poor sick people and the hospitals that treat them, we instead improve the original billing system? Otherwise we are certain to see all sorts of weird effects in the coming years with people being stealth readmitted under different names or something, or having to travel to different hospitals to be seen for their congestive heart failure.

Categories: modeling, news

Columbia Data Science course, week 13: MapReduce

The week in Rachel Schutt’s Data Science course at Columbia we had two speakers.

The first was David Crawshaw, a Software Engineer at Google who was trained as a mathematician, worked on Google+ in California with Rachel, and now works in NY on search.

David came to talk to us about MapReduce and how to deal with too much data.

Thought Experiment

Let’s think about information permissions and flow when it comes to medical records. David related a story wherein doctors estimated that 1 or 2 patients died per week in a certain smallish town because of the lack of information flow between the ER and the nearby mental health clinic. In other words, if the records had been easier to match, they’d have been able to save more lives. On the other hand, if it had been easy to match records, other breaches of confidence might also have occurred.

What is the appropriate amount of privacy in health? Who should have access to your medical records?

Comments from David and the students:

  • We can assume we think privacy is a generally good thing.
  • Example: to be an atheist is punishable by death in some places. It’s better to be private about stuff in those conditions.
  • But it takes lives too, as we see from this story.
  • Many egregious violations happen in law enforcement, where you have large databases of license plates etc., and people who have access abuse it. In this case it’s a human problem, not a technical problem.
  • It’s also a philosophical problem: to what extent are we allowed to make decisions on behalf of other people?
  • It’s also a question of incentives. I might cure cancer faster with more medical data, but I can’t withhold the cure from people who didn’t share their data with me.
  • To a given person it’s a security issue. People generally don’t mind if someone has their data, they mind if the data can be used against them and/or linked to them personally.
  • It’s super hard to make data truly anonymous.


What is big data? It’s a buzzword mostly, but it can be useful. Let’s start with this:

You’re dealing with big data when you’re working with data that doesn’t fit into your compute unit. Note that’s an evolving definition: big data has been around for a long time. The IRS had taxes before computers.

Today, big data means working with data that doesn’t fit in one computer. Even so, the size of big data changes rapidly. Computers have experienced exponential growth for the past 40 years. We have at least 10 years of exponential growth left (and I said the same thing 10 years ago).

Given this, is big data going to go away? Can we ignore it?

No, because although the capacity of a given computer is growing exponentially, those same computers also make the data. The rate of new data is also growing exponentially. So there are actually two exponential curves, and they won’t intersect any time soon.

Let’s work through an example to show how hard this gets.

Word frequency problem

Say you’re told to find the most frequent words in the following list: red, green, bird, blue, green, red, red.

The easiest approach for this problem is inspection, of course. But now consider the problem for lists containing 10,000, or 100,000, or 10^9 words.

The simplest approach is to list the words and then count their prevalence.  Here’s an example code snippet from the language Go:

Since counting and sorting are fast, this scales to ~100 million words. The limit is now computer memory – if you think about it, you need to get all the words into memory twice.

We can modify it slightly so it doesn’t have to have all words loaded in memory. keep them on the disk and stream them in by using a channel instead of a list. A channel is something like a stream: you read in the first 100 items, then process them, then you read in the next 100 items.

Wait, there’s still a potential problem, because if every word is unique your program will still crash; it will still be too big for memory. On the other hand, this will probably work nearly all the time, since nearly all the time there will be repetition. Real programming is a messy game.

But computers nowadays are many-core machines, let’s use them all! Then the bandwidth will be the problem, so let’s compress the inputs… There are better alternatives that get complex. A heap of hashed values has a bounded size and can be well-behaved (a heap seems to be something like a poset, and I guess you can throw away super small elements to avoid holding everything in memory). This won’t always work but it will in most cases.

Now we can deal with on the order of 10 trillion words, using one computer.

Now say we have 10 computers. This will get us 100 trillion words. Each computer has 1/10th of the input. Let’s get each computer to count up its share of the words. Then have each send its counts to one “controller” machine. The controller adds them up and finds the highest to solve the problem.

We can do the above with hashed heaps too, if we first learn network programming.

Now take a hundred computers. We can process a thousand trillion words. But then the “fan-in”, where the results are sent to the controller, will break everything because of bandwidth problem. We need a tree, where every group of 10 machines sends data to one local controller, and then they all send to super controller. This will probably work.

But… can we do this with 1000 machines? No. It won’t work. Because at that scale one or more computer will fail. If we denote by X the variable which exhibits whether a given computer is working, so X=0 means it works and X=1 means it’s broken, then we can assume

P(X=0) = 1- \epsilon.

But this means, when you have 1000 computers, that the chance that no computer is broken is (1-\epsilon)^{1000}, which is generally pretty small even if \epsilon is small. So if \epsilon = 0.001 for each individual computer, then the probability that all 1000 computers work is 0.37, less than even odds. This isn’t sufficiently robust.

We address this problem by talking about fault tolerance for distributed work. This usually involves replicating the input (the default is to have three copies of everything), and making the different copies available to different machines, so if one blows another one will still have the good data. We might also embed checksums in the data, so the data itself can be audited for erros, and we will automate monitoring by a controller machine (or maybe more than one?).

In general we need to develop a system that detects errors, and restarts work automatically when it detects them. To add efficiency, when some machines finish, we should use the excess capacity to rerun work, checking for errors.

Q: Wait, I thought we were counting things?! This seems like some other awful rat’s nest we’ve gotten ourselves into.

A: It’s always like this. You cannot reason about the efficiency of fault tolerance easily, everything is complicated. And note, efficiency is just as important as correctness, since a thousand computers are worth more than your salary. It’s like this:

  1. The first 10 computers are easy,
  2. The first 100 computers are hard, and
  3. The first 1,000 computers are impossible.

There’s really no hope. Or at least there wasn’t until about 8 years ago. At Google I use 10,000 computers regularly.

In 2004 Jeff and Sanjay published their paper on MapReduce (and here’s one on the underlying file system).

MapReduce allows us to stop thinking about fault tolerance; it is a platform that does the fault tolerance work for us. Programming 1,000 computers is now easier than programming 100. It’s a library to do fancy things.

To use MapReduce, you write two functions: a mapper function, and then a reducer function. It takes these functions and runs them on many machines which are local to your stored data. All of the fault tolerance is automatically done for you once you’ve placed the algorithm into the map/reduce framework.

The mapper takes each data point and produces an ordered pair of the form (key, value). The framework then sorts the outputs via the “shuffle”, and in particular finds all the keys that match and puts them together in a pile. Then it sends these piles to machines which process them using the reducer function. The reducer function’s outputs are of the form (key, new value), where the new value is some aggregate function of the old values.

So how do we do it for our word counting algorithm? For each word, just send it to the ordered with the key that word and the value being the integer 1. So

red —> (“red”, 1)

blue —> (“blue”, 1)

red —> (“red”, 1)

Then they go into the “shuffle” (via the “fan-in”) and we get a pile of (“red”, 1)’s, which we can rewrite as (“red”, 1, 1). This gets sent to the reducer function which just adds up all the 1’s. We end up with (“red”, 2), (“blue”, 1).

Key point: one reducer handles all the values for a fixed key.

Got more data? Increase the number of map workers and reduce workers. In other words do it on more computers. MapReduce flattens the complexity of working with many computers. It’s elegant and people use it even when they “shouldn’t” (although, at Google it’s not so crazy to assume your data could grow by a factor of 100 overnight). Like all tools, it gets overused.

Counting was one easy function, but now it’s been split up into two functions. In general, converting an algorithm into a series of MapReduce steps is often unintuitive.

For the above word count, distribution needs to be uniform. It all your words are the same, they all go to one machine during the shuffle, which causes huge problems. Google has solved this using hash buckets heaps in the mappers in one MapReduce iteration. It’s called CountSketch, and it is built to handle odd datasets.

At Google there’s a realtime monitor for MapReduce jobs, a box with “shards” which correspond to pieces of work on a machine. It indicates through a bar chart how the various machines are doing. If all the mappers are running well, you’d see a straight line across. Usually, however, everything goes wrong in the reduce step due to non-uniformity of the data – lots of values on one key.

The data preparation and writing the output, which take place behind the scenes, take a long time, so it’s good to try to do everything in one iteration. Note we’re assuming distributed file system is already there – indeed we have to use MapReduce to get data to the distributed file system – once we start using MapReduce we can’t stop.

Once you get into the optimization process, you find yourself tuning MapReduce jobs to shave off nanoseconds 10^{-9} whilst processing petabytes of data. These are order shifts worthy of physicists. This optimization is almost all done in C++. It’s highly optimized code, and we try to scrape out every ounce of power we can.

Josh Wills

Our second speaker of the night was Josh Wills. Josh used to work at Google with Rachel, and now works at Cloudera as a Senior Director of Data Science. He’s known for the following quote:

Data Science (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

Thought experiment

How would you build a human-powered airplane? What would you do? How would you form a team?

Student: I’d run an X prize. Josh: this is exactly what they did, for $50,000 in 1950. It took 10 years for someone to win it. The story of the winner is useful because it illustrates that sometimes you are solving the wrong problem.

The first few teams spent years planning and then their planes crashed within seconds. The winning team changed the question to: how do you build an airplane you can put back together in 4 hours after a crash? After quickly iterating through multiple prototypes, they solved this problem in 6 months.

Josh had some observations about the job of a data scientist:

  • I spend all my time doing data cleaning and preparation. 90% of the work is data engineering.
  • On solving problems vs. finding insights: I don’t find insights, I solve problems.
  • Start with problems, and make sure you have something to optimize against.
  • Parallelize everything you do.
  • It’s good to be smart, but being able to learn fast is even better.
  • We run experiments quickly to learn quickly.

Data abundance vs. data scarcity

Most people think in terms of scarcity. They are trying to be conservative, so they throw stuff away.

I keep everything. I’m a fan of reproducible research, so I want to be able to rerun any phase of my analysis. I keep everything.

This is great for two reasons. First, when I make a mistake, I don’t have to restart everything. Second, when I get new sources of data, it’s easy to integrate in the point of the flow where it makes sense.

Designing models

Models always turn into crazy Rube Goldberg machines, a hodge-podge of different models. That’s not necessarily a bad thing, because if they work, they work. Even if you start with a simple model, you eventually add a hack to compensate for something. This happens over and over again, it’s the nature of designing the model.

Mind the gap

The thing you’re optimizing with your model isn’t the same as the thing you’re optimizing for your business.

Example: friend recommendations on Facebook doesn’t optimize you accepting friends, but rather maximizing the time you spend on Facebook. Look closely: the suggestions are surprisingly highly populated by attractive people of the opposite sex.

How does this apply in other contexts? In medicine, they study the effectiveness of a drug instead of the health of the patients. They typically focus on success of surgery rather than well-being of the patient.

Economic interlude

When I graduated in 2001, we had two options for file storage.

1) Databases:

  • structured schemas
  • intensive processing done where data is stored
  • somewhat reliable
  • expensive at scale

2) Filers:

  • no schemas
  • no data processing capability
  • reliable
  • expensive at scale

Since then we’ve started generating lots more data, mostly from the web. It brings up the natural idea of a data economic indicator, return on byte. How much value can I extract from a byte of data? How much does it cost to store? If we take the ratio, we want it to be bigger than one or else we discard.

Of course this isn’t the whole story. There’s also a big data economic law, which states that no individual record is particularly valuable, but having every record is incredibly valuable. So for example in any of the following categories,

  • web index
  • recommendation systems
  • sensor data
  • market basket analysis
  • online advertising

one has an enormous advantage if they have all the existing data.

A brief introduction to Hadoop

Back before Google had money, they had crappy hardware. They came up with idea of copying data to multiple servers. They did this physically at the time, but then they automated it. The formal automation of this process was the genesis of GFS.

There are two core components to Hadoop. First is the distributed file system (HDFS), which is based on the google file system. The data stored in large files, with block sizes of 64MB to 256MB. As above, the blocks are replicated to multiple nodes in the cluster. The master node notices if a node dies.

Data engineering on hadoop

Hadoop is written in java, Whereas Google stuff is in C++.

Writing map reduce in the java API not pleasant. Sometimes you have to write lots and lots of map reduces. However, if you use hadoop streaming, you can write in python, R, or other high-level languages. It’s easy and convenient for parallelized jobs.


Cloudera is like Red hat for hadoop. It’s done under aegis of the Apache Software Foundation. The code is available for free, but Cloudera packages it together, gives away various distributions for free, and waits for people to pay for support and to keep it up and running.

Apache hive is a data warehousing system on top of hadoop. It uses an SQL-based query language (includes some map reduce -specific extensions), and it implements common join and aggregation patterns. This is nice for people who know databases well and are familiar with stuff like this.


  1. Using hive, build records that contain everything I know about an entity (say a person) (intensive mapReduce stuff)
  2. Write python scripts to process the records over and over again (faster and iterative, also mapReduce)
  3. Update the records when new data arrives

Note phase 2 are typically map-only jobs, which makes parallelization easy.

I prefer standard data formats: text is big and takes up space. Thrift, Avro, protobuf are more compact, binary formats. I also encourage you to use the code and metadata repository Github. I don’t keep large data files in git.

Rolling Jubilee is a better idea than the lottery

Yesterday there was a reporter from CBS Morning News looking around for a quirky fun statistician or mathematician to talk about the Powerball lottery, which is worth more than $500 million right now. I thought about doing it and accumulated some cute facts I might want to say on air:

  • It costs $2 to play.
  • If you took away the grand prize, a ticket is worth 36 cents in expectation (there are 9 ways to win with prizes ranging from $4 to $1 million).
  • The chance of winning grand prize is one in about 175,000,000.
  • So when the prize goes over $175 million, that’s worth $1 in expectation.
  • So if the prize is twice that, at $350 million, that’s worth $2 in expectation.
  • Right now the prize is $500 million, so the tickets are worth more than $2 in expectation.
  • Even so, the chances of being hit by lightening in a given year is something like 1,000,000, so 175 times more likely than winning the lottery

In general, the expected payoff for playing the lottery is well below the price. And keep in mind that if you win, almost half goes to taxes. I am super busy trying to write, so I ended up helping find someone else for the interview: Jared Lander. I hope he has fun.

If you look a bit further into the lottery system, you’ll find some questionable information. For example, lotteries are super regressive: poor people spend more money than rich people on lotteries, and way more if you think of it as a percentage of their income.

One thing that didn’t occur to me yesterday but would have been nice to try, and came to me via my friend Aaron, is to suggest that instead of “investing” their $2 in a lottery, people might consider investing it in the Rolling Jubilee. Here are some reasons:

  • The payoff is larger than the investment by construction. You never pay more than $1 for $1 of debt.
  • It’s similar to the lottery in that people are anonymously chosen and their debts are removed.
  • The taxes on the benefits are nonexistent, at least as we understand the taxcode, because it’s a gift.

It would be interesting to see how the mindset would change if people were spending money to anonymously remove debt from each other rather than to win a jackpot. Not as flashy, perhaps, but maybe more stimulative to the economy. Note: an estimated $50 billion was spent on lotteries in 2010. That’s a lot of debt.

Categories: Uncategorized

How to evaluate a black box financial system

I’ve been blogging about evaluation methods for modeling, for example here and here, as part of the book I’m writing with Rachel Schutt based on her Columbia Data Science class this semester.

Evaluation methods are important abstractions that allow us to measure models based only on their output.

Using various metrics of success, we can contrast and compare two or more entirely different models. And it means we don’t care about their underlying structure – they could be based on neural nets, logistic regression, or decision trees, but for the sake of measuring the accuracy, or the ranking, or the calibration, the evaluation method just treats them like black boxes.

It recently occurred to me a that we could generalize this a bit, to systems rather than models. So if we wanted to evaluate the school system, or the political system, or the financial system, we could ignore the underlying details of how they are structured and just look at the output. To be reasonable we have to compare two systems that are both viable; it doesn’t make sense to talk about a current, flawed system relative to perfection, since of course every version of reality looks crappy compared to an ideal.

The devil is in the articulated evaluation metric, of course. So for the school system, we can ask various questions: Do our students know how to read? Do they finish high school? Do they know how to formulate an argument? Have they lost interest in learning? Are they civic-minded citizens? Do they compare well to other students on standardized tests? How expensive is the system?

For the financial system, we might ask things like: Does the average person feel like their money is safe? Does the system add to stability in the larger economy? Does the financial system mitigate risk to the larger economy? Does it put capital resources in the right places? Do fraudulent players inside the system get punished? Are the laws transparent and easy to follow?

The answers to those questions aren’t looking good at all: for example, take note of the recent Congressional report that blames Jon Corzine for MF Global’s collapse, pins him down on illegal and fraudulent activity, and then does absolutely nothing about it. To conserve space I will only use this example but there are hundreds more like this from the last few years.

Suffice it to say, what we currently have is a system where the agents committing fraud are actually glad to be caught because the resulting fines are on the one hand smaller than their profits (and paid by shareholders, not individual actors), and on the other hand are cemented as being so, and set as precedent.

But again, we need to compare it to another system, we can’t just say “hey there are flaws in this system,” because every system has flaws.

I’d like to compare it to a system like ours except where the laws are enforced.

That may sounds totally naive, and in a way it is, but then again we once did have laws, that were enforced, and the financial system was relatively tame and stable.

And although we can’t go back in a time machine to before Glass-Steagall was revoked and keep “financial innovation” from happening, we can ask our politicians to give regulators the power to simplify the system enough so that something like Glass-Steagall can once again work.

Categories: data science, finance, rant

On Reuters talking about Occupy

I was interviewed a couple of weeks ago and it just got posted here:

Categories: #OWS

Systematized racism in online advertising, part 1

November 25, 2012 1 comment

There is no regulation of how internet ad models are built. That means that quants can use any information they want, usually historical, to decide what to expect in the future. That includes associating arrests with african-american sounding names.

In a recent Reuters article, this practice was highlighted:, which labels itself the “Internet’s leading authority on background checks,” placed both ads. A statistical analysis of the company’s advertising has found it has disproportionately used ad copy including the word “arrested” for black-identifying names, even when a person has no arrest record.

Luckily, Professor Sweeney, a Harvard University professor of government with a doctorate in computer science, is on the case:

According to preliminary findings of Professor Sweeney’s research, searches of names assigned primarily to black babies, such as Tyrone, Darnell, Ebony and Latisha, generated “arrest” in the ad copy between 75 percent and 96 percent of the time. Names assigned at birth primarily to whites, such as Geoffrey, Brett, Kristen and Anne, led to more neutral copy, with the word “arrest” appearing between zero and 9 percent of the time.

Of course when I say there’s no regulation, that’s an exaggeration. There is some, and if you claim to be giving a credit report, then regulations really do exist. But as for the above, here’s what regulators have to say:

“It’s disturbing,” Julie Brill, an FTC commissioner, said of Instant Checkmate’s advertising. “I don’t know if it’s illegal … It’s something that we’d need to study to see if any enforcement action is needed.”

Let’s be clear: this is just the beginning.

Categories: data science, news, rant

Aunt Pythia’s advice and a request for cool math books

First, my answer to last week’s question which you guys also answered:

Aunt Pythia,

My loving, wonderful, caring boyfriend slurps his food. Not just soup — everything (even cereal!). Should I just deal with it, or say something? I think if I comment on it he’ll be offended, but I find it distracting during our meals together.

Food (Consumption) Critic


You guys did well with answering the question, and I’d like to nominate the following for “most likely to actually make the problem go away”, from Richard:

I’d go with blunt but not particularly bothered – halfway through his next bowl of cereal, exclaim “Wow, you really slurp your food, don’t you?! I never noticed that before.”

But then again, who says we want this problem to go away? My firm belief is that every relationship needs to have an unimportant thing that bugs the participants. Sometimes it’s how high the toaster is set, sometimes it’s how the other person stacks the dishes in the dishwasher, but there’s always that thing. And it’s okay: if we didn’t have the thing we’d invent it. In fact having the thing prevents all sorts of other things from becoming incredible upsetting. My theory anyway.

So my advice to Food Consumption Critic is: don’t do anything! Cherish the slurping! Enjoy something this banal and inconsequential being your worst criticism of this lovely man.

Unless you’re like Liz, also a commenter from last week, who left her husband because of the way he breathed. If it’s driving you that nuts, you might want to go with Richard’s advice.


Aunt Pythia,

Dear Aunt Pythia, I want to write to an advice column, but I don’t know whether or not to trust the advice I will receive. What do you recommend?

Perplexed in SoPo

Dear PiSP,

I hear you, and you’re right to worry. Most people only ask things they kind of know the answer to, or to get validation that they’re not a total jerk, or to get permission to do something that’s kind of naughty. If the advice columnist tells them something they disagree with, they ignore it entirely anyway. It’s a total waste of time if you think about it.

However, if your question is super entertaining and kind of sexy, then I suggest you write in ASAP. That’s the very kind of question that columnists know how to answer in deep, meaningful and surprising ways.




Aunt Pythia,

With global warming and hot summers do you think it’s too early to bring the toga back in style?

John Doe

Dear John,

It’s never to early to wear sheets. Think about it: you get to wear the very same thing you sleep in. It’s like you’re a walking bed.



Aunt Pythia,

Is it unethical not to tell my dad I’m starting a business? I doubt he’d approve and I’m inclined to wait until it’s successful to tell him about it.

Angsty New Yorker

Dear ANY,

Wait, what kind of business is this? Are we talking hedge fund or sex toy shop?

In either case, I don’t think you need to tell your parents anything about your life if you are older than 18 and don’t want to, it’s a rule of american families. Judging by my kids, this rule actually starts when they’re 11.

Of course it depends on your relationship with your father how easy that will be and what you’d miss out on by being honest, but the fear of his disapproval is, to me, a bad sign: you’re gonna have to be tough as nails to be a business owner, so get started by telling, not asking, your dad. Be prepared for him to object, and if he does, tell him he’ll get used to it with time.

Aunt Pythia


Aunt Pythia,

I’m a philosophy grad school dropout turned programmer who hasn’t done math since high school. But I want to learn, partly for professional reasons but mainly out of curiosity. I recently bought *Proofs From the Book* but found that I lacked the requisite mathematical maturity to work through much of it. Where should I start? What should I read? (p.s. Thanks for the entertaining blog!)

Confused in Brooklyn

Readers, this question is for you! I don’t know of too many good basic math books, so Confused in Brooklyn is counting on you. There have actually been lots of people asking similar questions, so you’d be helping them too. If I get enough good suggestions I’ll create a separate reading list for cool math page on mathbabe. Thanks in advance for your suggestions!


Please take a moment to ask me a question:

Categories: Aunt Pythia