Search Results

Keyword: ‘death spiral’

Another death spiral of modeling: e-scores

Yesterday my friend and fellow Occupier Suresh sent me this article from the New York Times.

It’s something I knew was already happening somewhere, but I didn’t know the perpetrators would be quite so proud of themselves as they are; on the other hand I’m also not surprised, because people making good money on mathematical models rarely take the time to consider the ramifications of those models. At least that’s been my experience.

So what have these guys created? It’s basically a modern internet version of a credit score, without all the burdensome regulation that comes with it. Namely, they collect all kinds of information about people on the web, anything they can get their hands on, which includes personal information like physical and web addresses, phone number, google searches, purchases, and clicks of each person, and from that they create a so-called “e-score” which evaluates how much you are worth to a given advertiser or credit card company or mortgage company or insurance company.

Some important issues I want to bring to your attention:

  1. Credit scores are regulated, and in particular the disallow the use of racial information, whereas these e-scores are completely unregulated and can use whatever information they can gather (which is a lot). Not that credit score models are open source: they aren’t, so we don’t know if they are using variables correlated to race (like zip code). But still, there is some effort to protect people from outrageous and unfair profiling. I never though I’d be thinking of credit scoring companies as the good guys, but it is what it is.
  2. These e-scores are only going for max pay-out, not default risk. So, for the sake of a credit card company, the ideal customer is someone who pays the minimum balance month after month, never finishing off the balance. That person would have a higher e-score than someone who pays off their balance every month, although presumably that person would have a lower credit score, since they are living more on the edge of insolvency.
  3. Not that I need to mention this, but this is the ultimate in predatory modeling: every person is scored based on their ability to make money for the advertiser/ insurance company in question, based on any kind of ferreted-out information available. It’s really time for everyone to have two accounts, one for normal use, including filling out applications for mortgages and credit cards and buying things, and the second for sensitive google searches on medical problems and such.
  4. Finally, and I’m happy to see that the New York Times article noticed this and called it out, this is the perfect setup for the death spiral of modeling that I’ve mentioned before: people considered low value will be funneled away from good deals, which will give them bad deals, which will put them into an even tighter pinch with money because they’re being nickeled and timed and paying high interest rates, which will make them even lower value.
  5. A model like this is hugely scalable and valuable for a given advertiser.
  6. Therefore, this model can seriously contribute to our problem of increasing inequality.
  7. How can we resist this? It’s time for some rules on who owns personal information.

The modeling death spiral for public schools

There was recently a New York Times article about how the public schools have become super segregated by race.

I’m wondering how much of this can be explained by income rather than by race in combination with the obsession we all have with test scores. Let me explain.

If I’m living in a neighborhood with a neighborhood school and the school seems pretty good, then depending on how picky I am I might just stay living there and let my kids go there.

Now assume that suddenly there are test scores available for all the schools in the area, and it turns out my neighborhood school doesn’t do as well as a surrounding neighborhood. Then, depending on how much I think those test scores matter to my childrens’ futures, and how much resources I have, I will be tempted to move to that neighborhood for the “better schools” (read: better test scores).

Over time, people with good resources will move to the new neighborhood, which will become more expensive because there’s competition to get it, which in turn will make it easier for that town to raise local taxes to improve the school, and will also attract parents who really care about the quality of the schools, which will improve the school and presumably the test scores of that school, exacerbating the original difference of test scores.

And of course that’s just what’s happened in this country. My parents moved to Lexington Massachusetts for the schools, and they paid a premium for their house for the location and the school system. So I went to a public school but one that increasingly was attended by richer and richer kids.

Income segregated public schools are the new private schools.

In New York City, where there is more to consider than just your neighborhood, because you can get your kids into schools in other neighborhoods, and there’s a whole network of gifted and talented schools as well, it’s a much more complicated dynamic, but the underlying reasons are the same, and they again have to do with segmentation modeling: we know which schools do well on tests and we avoid poorly testing schools if we can.

The availability of the test scores is huge- if I’m thinking of moving to a new city I can just look up the SAT scores of the high schools in the area and try to find a place to live which is in one of the highest-scoring towns.

This is what I call a death spiral of modeling, and it’s the same idea I described here when insurance companies have too much information about you and deny you coverage because you need insurance so bad. And it’s very difficult to get out of a death spiral, because to do so you need to reset the whole system and re-pool resources but in this case people have already moved out of town.

Questions I am thinking about:

  • Is it dumb to care so much about test scores? On the one hand I don’t want to take chances on my kids, so I will opt for the conservative route, which is to think they should be surrounded by kids who test well, because certainly in extreme cases that kind of thing is likely to be contagious behavior. But maybe we have exaggerated ideas about how contagious these things are or how important test scores really are to our kids futures. How would we test that and how would we disseminate the results? And what if we found out that everybody has been acting totally rationally?
  • Which begs the other question, namely how can we get this system to work better overall for the average student that would be realistic?
  • Note that in the above discussion I haven’t talked about the teachers at all, which is strange. But from my perspective, our system is all about concentrating kids who test well together, and it’s not all that clear that the teachers matter, although I’m sure they do actually. What am I missing? Is there a way of solving this death spiral problem through awesome teachers?
Categories: musing

Will Demographics Solve the College Tuition Problem? (A: I Don’t Know)

November 14, 2014 14 comments

I’ve got two girls in middle school. They are lovely and (in my opinion as a proud dad) smart. I wonder, on occasion, what college will they go to and what their higher education experience will be like? No matter how lovely or smart my daughters are, though, it will be hard to fork over all of that tuition money.  It sure would be nice if college somehow got cheaper by the time my daughters are ready in 6 or 8 years!

How likely is this? There has been plenty of coverage about how the cost of college has risen so dramatically over the past decades. A number of smart people have argued that the reason tuition has increased so much is because of all of the amenities that schools have built in recent years. Others are unconvinced that’s the reason, pointing out that increased spending by universities grew at a lower than the rate of tuition increases.  Perhaps schools have been buoyed by a rising demographic trend – but it’s clear tuition increases have had a great run.

One way colleges have been able to keep increasing tuitions is by competing aggressively for wealthy students who can pay the full price of tuition (which also enables the schools to offer more aid to less than wealthy students).  The children of the wealthy overseas are particularly desirable targets, apparently.  I heard a great quote yesterday about this by Brad Delong – that his school, Berkeley, and other top universities presumably had become “finishing school[s] for the superrich of Asia.”  It’s an odd sort of competition, though, where schools are competing for a particular customer (wealthy students) by raising prices.  Presumably, this suggests that colleges have had pricing power to raise tuition due to increased demand (perhaps aided by increase in student loans, but that’s an argument for another day).

Will colleges continue to have this pricing power?  For the optimistic future tuition payer, there are some signs that university pricing power may be eroding.   Tuition increased at a slower rate this year (a bit more than 3%) but still at a rate that well exceeds inflation.   And law schools are already resorting to price cutting after precipitous declines in applications – down 37% in 2014 compared to 2010!

College enrollment trends are a mixed bag and frequently obscured by studies from in-industry sources.  Clearly, the 1990s and 2000s were a time a great growth for colleges – college enrollment grew by 48% from 1990 (12 million students) to 2012 (17.7 million).  But 2010 appears to be the recent peak and enrollment fell by 2% from 2010 to 2012. In addition, overall college enrollment declined by 2.3% in 2014, although this decline is attributed to the 9.6% decline in two-year colleges while 4-year college enrollment actually increased by 1.2%.

It makes sense that the recent college enrollment trend would be down – the number of high school graduates appears to have peaked in 2010 at 3.3 million or so and is projected to decline to about 3.1 million in 2016 and stay lowish for the next few years. The US Census reports that there was a bulge of kids that are college age now (i.e. there were 22.04 million 14-19 year olds at the 2010 Census), but there are about 1.7 million fewer kids that are my daughters’ age (i.e., 5-9 year olds in the 2010 Census).  That’s a pretty steep drop off (about 8%) in this pool of potential college students.  These demographic trends have got some people worried.  Moody’s, which rates the debt of a lot of colleges, has been downgrading a lot of smaller schools and says that this type of school has already been hit by declining enrollment and revenue. One analyst went so far as to warn of a “death spiral” at some schools due to declining enrollment.  Moody’s analysis of declining revenue is an interesting factor, in light of reports of ever-increasing tuition. Last year Moody’s reported that 40% of colleges or universities (that were rated) faced stagnant or declining net tuition revenue.

Speaking strictly, again, as a future payer of my daughters’ college tuition, falling college age population and falling enrollment would seem to point to the possibility that tuition will be lower for my kids when the time comes. Plus there are a lot of other factors that seem to be lining up against the prospects for college tuition –  like continued flat or declining wages, the enormous student loan bubble (it can’t keep growing, right?), the rise of online education…

And yet, I’m not feeling that confident.  Elite universities (and it certainly would be nice if my girls could get into such a school) seem to have found a way to collect a lot of tuition from foreign students (it’s hard to find a good data source for that though) which protects them from the adverse demographic and economic trends.  I’ve wondered if US students could get turned off by the perception that top US schools have too many foreign students and are too much, as Delong says, elite finishing schools.  But that’s hard to predict and may take many years to reach a tipping point.  Plus if tuition and enrollment drop a lot, that may cripple the schools that have taken out a lot of debt to build all of those nice amenities. A Harvard Business School professor rather bearishly projects that as many as half of the 4,000 US colleges and universities may fail in the next 15 years.  Would a sharp decrease in the number of colleges due to falling enrollment have the effect of reducing competition at the remaining schools?  If so, what impact would that have on tuition?

Both college tuition and student loans have been described as bubbles thanks to their recent rate of growth.  At some point, bubbles burst (in theory).  As someone who watched, first hand and with great discomfort, the growth of the subprime and housing bubbles before the crisis, I’ve painfully learned that bubbles can last much longer than you would rationally expect.  And despite all sorts of analysis and calculation about what should happen, the thing that triggers the bursting of the bubble is really hard to predict. As is when it will happen.  To the extent I’ve learned a lesson from mortgage land, it’s that you shouldn’t do anything stupid in anticipation of the bubble either bursting or continuing.  So, as much as I hope and even expect that the trend for increased college tuition will reverse in the coming years, I guess I’ll have to keep on trying to save for when my daughters will be heading off to college.

Categories: data science, education

The creepy mindset of online credit scoring

Usually I like to think through abstract ideas – thought experiments, if you will – and not get too personal. I take exceptions for certain macroeconomists who are already public figures but most of the time that’s it.

Here’s a new category of people I’ll call out by name: CEO’s who defend creepy models using the phrase “People will trade their private information for economic value.”

That’s a quote of Douglas Merrill, CEO of Zest Finance, taken from this video taken at a recent data conference in Berkeley (hat tip Rachel Schutt). It was a panel discussion, the putative topic of which was something like “Attacking the structure of everything”, whatever that’s supposed to mean (I’m guessing it has something to do with being proud of “disrupting shit”).

Do you know the feeling you get when you’re with someone who’s smart, articulate, who probably buys organic eggs from a nice farmer’s market, but who doesn’t expose an ounce of sympathy for people who aren’t successful entrepreneurs? When you’re with someone who has benefitted so entirely and so consistently from the system that they have an almost religious belief that the system is perfect and they’ve succeeded through merit alone?

It’s something in between the feeling that, maybe you’re just naive because you’ve led such a blessed life, or maybe you’re actually incapable of human empathy, I don’t know which because it’s never been tested.

That’s the creepy feeling I get when I hear Douglas Merrill speak, but it actually started earlier, when I got the following email almost exactly one year ago via LinkedIn:

Hi Catherine,

Your profile looked interesting to me.

I’m seeking stellar, creative thinkers like you, for our team in Hollywood, CA. If you would consider relocating for the right opportunity, please read on.

You will use your math wizardry to develop radically new methods for data access, manipulation, and modeling. The outcome of your work will result in game-changing software and tools that will disrupt the credit industry and better serve millions of Americans.

You would be working alongside people like Douglas Merrill – the former CIO of Google – along with a handful of other ex-Googlers and Capital One folks. More info can be found on our LinkedIn company profile or at

At ZestFinance we’re bringing social responsibility to the consumer loan industry.

Do you have a few moments to talk about this? If you are not interested, but know someone else who might be a fit, please send them my way!

I hope to hear from you soon. Thank you for your time.


Wow, let’s “better serve millions of Americans” through manipulation of their private data, and then let’s call it being socially responsible! And let’s work with Capital One which is known to be practically a charity.


Message to ZestFinance: “getting rich with predatory lending” doesn’t mean “being socially responsible” unless you have a really weird definition of that term.

Going back to the video, I have a few more tasty quotes from Merrill:

  1. First when he’s describing how he uses personal individual information scraped from the web: “All data is credit data.”
  2. Second, when he’s comparing ZestFinance to FICO credit scoring: “Context is developed by knowing thousands of things about you. I know you as a person, not just you via five or six variables.”

I’d like to remind people that, in spite of the creepiness here, and the fact that his business plan is a death spiral of modeling, everything this guy is talking about is totally legal. And as I said in this post, I’d like to see some pushback to guys like Merrill as well as to the NSA.

Categories: data science, rant

On being a data science skeptic: due out soon

A few months ago, at the end of January, I wrote a post about Bill Gates naive views on the objectivity of data. One of the commenters, “CitizensArrest,” asked me to take a look at a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.”

Webber’s essay is really excellent, not to mention impressively prescient considering it was published in 2006, before the credit crisis. The format of the essay is simple: it brings up and explains various dangers in the context of measurement and modeling of business data, and calls for finding a space in business for skepticism. What an idea! Imagine if that had actually happened in finance when it should have back in 2006.

Please go read her essay, it’s short.

Recently, when O’Reilly asked me to write an essay, I thought back to this short piece and decided to use it as a template for explaining why I think there’s a just-as-desperate need for skepticism in 2013 here in the big data world as there was back then in finance.

Whereas most of Webber’s essay talks about people blindly accepting numbers as true, objective, precise, and important, and the related tragic consequences, I’ve added a small wrinkle to this discussion. Namely, I also devote concern over the people who underestimate the power of data.

Most of this disregard for unintended consequences is blithe and unintentional (and some of it isn’t), but even so it can be hugely damaging, especially to the individuals being modeled: think foreclosed homes due to crappy housing-related models in the past, and think creepy models and the death spiral of modeling for the present and future.

Anyhoo, I’m actively writing it now, and it’ll be coming out soon. Stay tuned!

Categories: data science, finance, modeling

Quantifying the pull of poverty traps

In yesterday’s New York Times Science section, there was an article called “Life in the Red” (hat tip Becky Jaffe) about people’s behavior when they are in debt, summed up by this:

The usual explanations for reckless borrowing focus on people’s character, or social norms that promote free spending and instant gratification. But recent research has shown that scarcity by itself is enough to cause this kind of financial self-sabotage.

“When we put people in situations of scarcity in experiments, they get into poverty traps,” said Eldar Shafir, a professor of psychology and public affairs at Princeton. “They borrow at high interest rates that hurt them, in ways they knew to avoid when there was less scarcity.”

The psychological burden of debt not only saps intellectual resources, it also reinforces the reckless behavior, and quickly, Dr. Shafir and other experts said. Millions of Americans have been keeping the lights on through hard times with borrowed money, running a kind of shell game to keep bill collectors away.

So what we’ve got here is a feedback loop of poverty, which certainly jives with my observations of friends and acquaintances I’ve seen who are in debt.

I’m guessing the experiments described in the article are not as bad as real life, however.

I say that because I’ve been talking on this blog as well as in my recent math talks about a separate feedback loop involving models, namely the feedback loop whereby people who are judged poor by the model are offered increasingly bad terms on their loans. I call it the death spiral of modeling.

If you think about how these two effects work together – the array of offers gets worse as your vulnerability to bad deals increases – then you start to understand what half of our country is actually living through on a day-to-day basis.

As an aside, I have an enormous amount of empathy for people experiencing this poverty trap. I don’t think it’s a moral issue to be in debt: nobody wants to be poor, and nobody plans it that way.

This opinion article (hat tip Laura Strausfeld), also in yesterday’s New York Times, makes the important point that listening to a bunch of rich, judgmental people like David Bach, Dave Ramsey, and Suze Orman telling us it’s our fault we haven’t finished saving for retirement isn’t actually useful, and suggest we individually choose a money issue to take charge and sort out.

So my empathetic nerd take on poverty traps is this: how can we quantitatively measure this phenomenon, or more precisely these phenomena, since we’ve identified at least two feedback loops?

One reason it’s hard is that it’d be hard to perform natural tests where some people are submitted to the toxic environment but other people aren’t – it’s the “people who aren’t” category that’s the hard part, of course.

For the vulnerability to bad terms, the article describes the level of harassment that people receive from bill collectors as a factor in how they react, which doesn’t surprise anyone who’s ever dealt with a bill collector. Are there certain people who don’t get harassed for whatever reason, and do they fall prey to bad deals at a different rate? Are there local laws in some places prohibiting certain harassment? Can we go to another country where the bill collectors are reined in and see how people in debt behave there?

Also, in terms of availability of loans, it might be relatively easy to start out with people who live in states with payday loans versus people who don’t, and see how much faster the poverty spiral overtakes people with worse options. Of course, as crappy loans get more and more available online, this proximity study will become moot.

It’s also going to be tricky to tease out the two effects from each other. One is a question of supply and the other is a question of demand, and as we know those two are related.

I’m not answering these questions today, it’s a long-term project that I need your help on, so please comment below with ideas. Maybe if we have a few good ideas and if we find some data we can plan a data hackathon.

The complexity feedback loop of modeling

Yesterday I was interviewed by a tech journalist about the concept of feedback loops in consumer-facing modeling. We ended up talking for a while about the death spiral of modeling, a term I coined for the tendency of certain public-facing models, like credit scoring models, to have such strong effects on people that they arguable create the future rather than forecast it. Of course this is generally presented from the perspective of the winners of this effect, but I care more about who is being forecast to fail.

Another feedback loop that we talked about was one that consumers have basically inheriting from the financial system, namely the “complexity feedback loop”.

In the example she and I discussed, which had to do with consumer-facing financial planning software, the complexity feedback loop refers to the fact that we are urged, as consumers, to keep track of our finances one way or another, including our cash flows, which leads to us worrying that we won’t be able to meet our obligations, which leads to us getting convinced we need to buy some kind of insurance (like overdraft insurance), which in turn has a bunch of complicated conditions on it.

The end result is increased complexity along with an increasing need for a complicated model to keep track of finances – in other words, a feedback loop.

Of course this sounds a lot like what happened in finance, where derivatives were invented to help disperse unwanted risk, but in turn complicated the portfolios so much that nobody understand them anymore, so we have endless discussions about how to measure the risk of the instruments that were created to remove risk.

The complexity feedback loop is generalizable outside of the realm of money as well.

In general models take certain things into account and ignore others, by their nature; models are simplified versions of the world, especially when they involve human behavior. So certain risks, or effects, are sufficiently small that the original model simply doesn’t see them – it may not even collect the data to measure it at all. Sometimes this omission is intentional, sometimes it isn’t.

But once the model is widely used, then the underlying approximation to the world is in some sense assumed, and then the remaining discrepancy is what we need to start modeling: the previously invisible becomes visible, and important. This leads to a second model tacked onto the first, or a modified version of the first. In either case it’s more complicated as it becomes more widely used.

This is not unlike saying that we’ve seen more vegetarian options on menus as restauranteurs realize they are losing out on a subpopulation of diners by ignoring their needs. From this example we can see that the complexity feedback loop can be good or bad, depending on your perspective. I think it’s something we should at least be aware of, as we increasingly interact with and depend on models.

Categories: data science, modeling, rant

Best case/ worst case: Medicine 50 years from now

Best Case

The scientific models and, when possible, the data have been made available to the wider scientific community for vetting. Incorrect or non-robust results are questioned and thrown out by that community, interesting and surprising new results are re-tested on larger data sets under iterative and different conditions to test for universality.

The result is that a person, with the help of their doctor and thorough exams and information-gathering session, and with their informed consent to use this data for their benefit, will have a better idea of what to watch out for in terms of health risks, how to prevent certain diseases that they may be vulnerable to, and how the tried-and-true medicines would affect them.

For example, in spite of the fact that Vioxx gives some people heart attacks, it also really helps other people with joint pain that aspirin or ibuprofen can’t touch. But which people? In the future we may know the answer to this through segmentation models, which group people by their attributes (which could come under the category of daily life conditions, such as how much someone exercises, or under the category of genetic profile).

For example, we recently learned that exercise is not always good for everyone. But instead of using that unlikely possibility as an excuse not to do any exercise, we could be able to look at a given profile and tell a person if they are in the clear and what kind of exercises would be most beneficial to their health.

It wouldn’t solve every problem; people would still die, after all. But it could help people live happier and healthier lives. It depends on the open exchange of ideas among scientists as well as strong regulation about who owns personal data and how it can be used.

Worst Case

The scientific community continues its practice of essentially private data collection and models. Scientific journals become more and more places where, backed by pharmaceutical companies and insurance companies, paid Ph.D.’s boast about their latest breakthrough with no cultural standard of evidence.

Indeed there is progress in segmentation models for disease and medicine, but the data, models, and results are owned exclusively by corporations, specifically insurance companies. This leads to a death spiral in modeling, where the very people who are vulnerable to disease and need medicine or treatment the most are priced out of the insurance system and no longer have access to anything resembling reasonable medical care, even for chronic diseases such as diabetes.

And you won’t need to give your consent for those insurance companies to use your data – they will have already bought all the data that they need to know about you from data collectors, which have been gleaning information about you from your online presence since birth. These companies will know everything about you; they control and sell your data for extra profit. To them, you represent a potential customer and a potential cost, a risk/return profile like any other investment.

Categories: data science

Creepy model watch

I really feel like I can’t keep up with all of the creepy models coming out and the news articles about them, so I think I’ll just start making a list. I would appreciate readers adding to my list in the comment section. I think I’ll move this to a separate page on my blog if it comes out nice.

  1. I recently blogged about a model that predicts student success in for-profit institutions, which I claim is really mostly about student debt and default,
  2. but here’s a model which actually goes ahead and predicts default directly, it’s a new payday-like loan model. Oh good, because the old payday models didn’t make enough money or something.
  3. Of course there’s the teacher value-added model which I’ve blogged about multiple times, most recently here. And here’s a paper I’d like everyone to read before they listen to anyone argue one way or the other about the model (h/t Joshua Batson). The abstract is stunning: Recently, educational researchers and practitioners have turned to value-added models to evaluate teacher performance. Although value-added estimates depend on the assessment used to measure student achievement, the importance of outcome selection has received scant attention in the literature. Using data from a large, urban school district, I examine whether value-added estimates from three separate reading achievement tests provide similar answers about teacher performance. I find moderate-sized rank correlations, ranging from 0.15 to 0.58, between the estimates derived from different tests. Although the tests vary to some degree in content, scaling, and sample of students, these factors do not explain the differences in teacher effects. Instead, test timing and measurement error contribute substantially to the instability of value-added estimates across tests. Just in case that didn’t come through, they are saying that the results of the teacher value-added test scores are very very noisy.
  4. That reminds me, credit scoring models are old but very very creepy, wouldn’t you agree? What’s in them that they want to conceal them?
  5. Did you read about how Target predicts pregnancy? Extremely creepy.
  6. I’m actually divided about whether it’s the creepiest though, because I think the sheer enormity of information that Facebook collects about us is the most depressing thing of all.

Before I became a modeler, I wasn’t personally offended by the idea that people could use my information. I thought, I’ve got nothing to hide, and in fact maybe it will make my life easier and more efficient for the machine to know me and my habits.

But here’s how I think now that I’m a modeler and I see how this stuff gets made and I see how it gets applied. That we are each giving up our data, and it’s so easy to do we don’t think about it, and it’s being used to funnel people into success or failure in a feedback loop. And the modelers, the people responsible for creating these things and implementing them, are always already the successes, they are educated and are given good terms on their credit cards and mortgages because they have a nifty high tech job. So the makers get to think of how much easier and more convenient their lives are now that the models see how dependable they are as consumers.

But when there are funnels, there’s always someone who gets funneled down.

Think about how it works with insurance. The idea of insurance is to pool people so that when one person gets sick, the medical costs for that person are paid from the common fund. Everyone pays a bit so it doesn’t break the bank.

But if we have really good information, we begin to see how likely people are to get sick. So we can stratify the pool. Since I almost never get sick, and when I do it’s just strep throat, I get put into a very nice pool with other people who never get sick, and we pay very very little and it works out great for us. But other people have worse luck of the DNA draw and they get put into the “pretty sick” pool and their premium gets bigger as their pool gets sicker until they are really sick and the premium is actually unaffordable. We are left with a system where the people who need insurance the most can’t be part of the system anymore. Too much information ruins the whole idea of insurance and pooled risk.

I think modern modeling is analogous. When people offer deals, they can first check to see if the people they are offering deals are guaranteed to pay back everything. In other words, the businesses (understandably) want to make very certain they are going to profit from each and every customer, and they are getting more and more able to do this. That’s great for customers with perfect credit scores, and it makes it easier for people with perfect credit scores to keep their perfect credit scores, because they are getting the best deals.

But for people with bad credit scores, they get the rottenest deals, which makes a larger and larger percentage of their takehome pay (if they even get a job considering their credit scores) go towards fees and high interest rates. This of course creates an environment in which it’s difficult to improve their credit score- so they default and their credit score gets worse instead of better.

So there you have it, a negative feedback loop and a death spiral of modeling.

Categories: data science

Where’s the outrage over private snooping?

There’s been a tremendous amount of hubbub recently surrounding the data collection data mining that the NSA has been discovered to be doing.

For me what’s weird is that so many people are up in arms about what our government knows about us but not, seemingly, about what private companies know about us.

I’m not suggesting that we should be sanguine about the NSA program – it’s outrageous, and it’s outrageous that we didn’t know about it. I’m glad it’s come out into the open and I’m glad it’s spawned an immediate and public debate about the citizen’s rights to privacy. I just wish that debate extended to privacy in general, and not just the right to be anonymous with respect to the government.

What gets to me are the countless articles that make a big deal of Facebook or Google sharing private information directly with the government, while never mentioning that Acxiom buys and sells from Facebook on a daily basis much more specific and potentially damning information about people (most people in this country) than the metadata that the government purports to have.

Of course, we really don’t have any idea what the government has or doesn’t have. Let’s assume they are also an Acxiom customer, for that matter, which stands to reason.

It begs the question, at least to me, of why we distrust the government with our private data but we trust private companies with our private data. I have a few theories, tell me if you agree.

Theory 1: people think about worst case scenarios, not probabilities

When the government is spying on you, worst case you get thrown into jail or Guantanamo Bay for no good reason, left to rot. That’s horrific but not, for the average person, very likely (although, of course, a world where that does become likely is exactly what we want to prevent by having some concept of privacy).

When private companies are spying on you, they don’t have the power to put you in jail. They do increasingly have the power, however, to deny you a job, a student loan, a mortgage, and life insurance. And, depending on who you are, those things are actually pretty likely.

Theory 2: people think private companies are only after our money

Private companies who hold our private data are only profit-seeking, so the worst thing they can do is try to get us to buy something, right? I don’t think so, as I pointed out above. But maybe people think so in general, and that’s why we’re not outraged about how our personal data and profiles are used all the time on the web.

Theory 3: people are more afraid of our rights being taken away than good things not happening to them

As my friend Suresh pointed out to me when I discussed this with him, people hold on to what they have (constitutional rights) and they fear those things being taken away (by the government). They spend less time worrying about what they don’t have (a house) and how they might be prevented from getting it (by having a bad e-score).

So even though private snooping can (and increasingly does) close all sorts of options for peoples’ lives, if they don’t think about them, they don’t notice. It’s hard to know why you get denied a job, especially if you’ve been getting worse and worse credit card terms and conditions over the years. In general it’s hard to notice when things don’t happen.

Theory 4: people think the government protects them from bad things, but who’s going to protect them from the government?

This I totally get, but the fact is the U.S. government isn’t protecting us from data collectors, and has even recently gotten together with Facebook and Google to prevent the European Union from enacting pretty good privacy laws. Let’s not hold our breath for them to understand what’s at stake here.

(Updated) Theory 5: people think they can opt out of private snooping but can’t opt out of being a citizen

Two things. First, can you really opt out? You can clear your cookies and not be on gmail and not go on Facebook and Acxiom will still track you. Believe it.

Second, I’m actually not worried about you (you reader of mathbabe) or myself for that matter. I’m not getting denied a mortgage any time soon. It’s the people who don’t know to protect themselves, don’t know to opt out, that I’m worried about and who will get down-scored and funneled into bad options that I worry about.

Theory 5 6: people just haven’t thought about it enough to get pissed

This is the one I’m hoping for.

I’d love to see this conversation expand to include privacy in general. What’s so bad about asking for data about ourselves to be automatically forgotten, say by Verizon, if we’ve paid our bills and 6 months have gone by? What’s so bad about asking for any personal information about us to have a similar time limit? I for one do not wish mistakes my children make when they’re impetuous teenagers to haunt them when they’re trying to start a family.

Categories: data science, rant

Columbia Data Science course, week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

October 5, 2012 Comments off

I was happy to be giving Rachel Schutt’s Columbia Data Science course this week, where I discussed time series, financial modeling, and ethics. I blogged previous classes here.

The first few minutes of class were for a case study with GetGlue, a New York-based start-up that won the mashable breakthrough start-up of the year in 2011 and is backed by some of the VCs that also fund big names like Tumblr, etsy, foursquare, etc. GetGlue is part of the social TV space. Lead Scientist, Kyle Teague, came to tell the class a little bit about GetGlue, and some of what he worked on there. He also came to announce that GetGlue was giving the class access to a fairly large data set of user check-ins to tv shows and movies. Kyle’s background is in electrical engineering, he placed in the 2011 KDD cup (which we learned about last week from Brian), and he started programming when he was a kid.

GetGlue’s goal is to address the problem of content discovery within the movie and tv space, primarily. The usual model for finding out what’s on TV is the 1950’s TV Guide schedule, and that’s still how we’re supposed to find things to watch. There are thousands of channels and it’s getting increasingly difficult to find out what’s good on. GetGlue wants to change this model, by giving people personalized TV recommendations and personalized guides. There are other ways GetGlue uses Data Science but for the most part we focused on how this the recommendation system works. Users “check-in” to tv shows, which means they can tell people they’re watching a show. This creates a time-stamped data point. They can also do other actions such as like, or comment on the show. So this is a -tuple: {user, action, object} where the object is a tv show or movie. This induces a bi-partite graph. A bi-partite graph or network contains two types of nodes: users and tv shows. An edges exist between users and an tv shows, but not between users and users or tv shows and tv shows. So Bob and Mad Men are connected because Bob likes Mad Men, and Sarah and Mad Men and Lost are connected because Sarah liked Mad Men and Lost. But Bob and Sarah aren’t connected, nor are Mad Men and Lost. A lot can be learned from this graph alone.

But GetGlue finds ways to create edges between users and between objects (tv shows, or movies.) Users can follow each other or be friends on GetGlue, and also GetGlue can learn that two people are similar[do they do this?]. GetGlue also hires human evaluators to make connections or directional edges between objects. So True Blood and Buffy the Vampire Slayer might be similar for some reason and so the humans create an edge in the graph between them. There were nuances around the edge being directional. They may draw an arrow pointing from Buffy to True Blood but not vice versa, for example, so their notion of “similar” or “close” captures both content and popularity. (That’s a made-up example.) Pandora does something like this too.

Another important aspect is time. The user checked-in or liked a show at a specific time, so the -tuple extends to have a time-stamp: {user,action,object,timestamp}. This is essentially the data set the class has access to, although it’s slightly more complicated and messy than that. Their first assignment with this data will be to explore it, try to characterize it and understand it, gain intuition around it and visualize what they find.

Students in the class asked him questions around topics of the value of formal education in becoming a data scientist (do you need one? Kyle’s time spent doing signal processing in research labs was valuable, but so was his time spent coding for fun as a kid), what would be messy about a data set, why would the data set be messy (often bugs in the code), how would they know? (their QA and values that don’t make sense), what language does he use to prototype algorithms (python), how does he know his algorithm is good.

Then it was my turn. I started out with my data scientist profile:

As you can see, I feel like I have the most weakness in CS. Although I can use python pretty proficiently, and in particular I can scrape and parce data, prototype models, and use matplotlib to draw pretty pictures, I am no java map-reducer and I bow down to those people who are. I am also completely untrained in data visualization but I know enough to get by and give presentations that people understand.

Thought Experiment

I asked the students the following question:

What do you lose when you think of your training set as a big pile of data and ignore the timestamps?

They had some pretty insightful comments. One thing they mentioned off the bat is that you won’t know cause and effect if you don’t have any sense of time. Of course that’s true but it’s not quite what I meant, so I amended the question to allow you to collect relative time differentials, so “time since user last logged in” or “time since last click” or “time since last insulin injection”, but not absolute timestamps.

What I was getting at, and what they came up with, was that when you ignore the passage of time through your data, you ignore trends altogether, as well as seasonality. So for the insulin example, you might note that 15 minutes after your insulin injection your blood sugar goes down consistently, but you might not notice an overall trend of your rising blood sugar over the past few months if your dataset for the past few months has no absolute timestamp on it.

This idea, of keeping track of trends and seasonalities, is very important in financial data, and essential to keep track of if you want to make money, considering how small the signals are.

How to avoid overfitting when you model with time series

After discussing seasonality and trends in the various financial markets, we started talking about how to avoid overfitting your model.

Specifically, I started out with having a strict concept of in-sample (IS) and out-of-sample (OOS) data. Note the OOS data is not meant as testing data- that all happens inside OOS data. It’s meant to be the data you use after finalizing your model so that you have some idea how the model will perform in production.

Next, I discussed the concept of causal modeling. Namely, we should never use information in the future to predict something now. Similarly, when we have a set of training data, we don’t know the “best fit coefficients” for that training data until after the last timestamp on all the data. As we move forward in time from the first timestamp to the last, we expect to get different sets of coefficients as more events happen.

One consequence of this is that, instead of getting on set of coefficients, we actually get an evolution of each coefficient. This is helpful because it gives us a sense of how stable those coefficients are. In particular, if one coefficient has changed sign 10 times over the training set, then we expect a good estimate for it is zero, not the so-called “best fit” at the end of the data.

One last word on causal modeling and IS/OOS. It is consistent with production code. Namely, you are always acting, in the training and in the OOS simulation, as if you’re running your model in production and you’re seeing how it performs. Of course you fit your model in sample, so you expect it to perform better there than in production.

Another way to say this is that, once you have a model in production, you will have to make decisions about the future based only on what you know now (so it’s causal) and you will want to update your model whenever you gather new data. So your coefficients of your model are living organisms that continuously evolve.

Submodels of Models

We often “prepare” the data before putting it into a model. Typically the way we prepare it has to do with the mean or the variance of the data, or sometimes the log (and then the mean or the variance of that transformed data).

But to be consistent with the causal nature of our modeling, we need to make sure our running estimates of mean and variance are also causal. Once we have causal estimates of our mean \overline{y} and variance $\sigma_y^2$, we can normalize the next data point with these estimates just like we do to get from a gaussian distribution to the normal gaussian distribution:

y \mapsto \frac{y - \overline{y}}{\sigma_y}

Of course we may have other things to keep track of as well to prepare our data, and we might run other submodels of our model. For example we may choose to consider only the “new” part of something, which is equivalent to trying to predict something like y_t - y_{t-1} instead of y_t. Or we may train a submodel to figure out what part of y_{t-1} predicts y_t, so a submodel which is a univariate regression or something.

There are lots of choices here, but the point is it’s all causal, so you have to be careful when you train your overall model how to introduce your next data point and make sure the steps are all in order of time, and that you’re never ever cheating and looking ahead in time at data that hasn’t happened yet.

Financial time series

In finance we consider returns, say daily. And it’s not percent returns, actually it’s log returns: if F_t denotes a close on day t, then the return that day is defined as log(F_t/F_{t-1}). See more about this here.

So if you start with S&P closing levels:

Then you get the following log returns:

What’s that mess? It’s crazy volatility caused by the financial crisis. We sometimes (not always) want to account for that volatility by normalizing with respect to it (described above). Once we do that we get something like this:

Which is clearly better behaved. Note this process is discussed in this post.

We could also normalize with respect to the mean, but we typically assume the mean of daily returns is 0, so as to not bias our models on short term trends.

Financial Modeling

One thing we need to understand about financial modeling is that there’s a feedback loop. If you find a way to make money, it eventually goes away- sometimes people refer to this as the fact that the “market learns over time”.

One way to see this is that, in the end, your model comes down to knowing some price is going to go up in the future, so you buy it before it goes up, you wait, and then you sell it at a profit. But if you think about it, your buying it has actually changed the process, and decreased the signal you were anticipating. That’s how the market learns – it’s a combination of a bunch of algorithms anticipating things and making them go away.

The consequence of this learning over time is that the existing signals are very weak. We are happy with a 3% correlation for models that have a horizon of 1 day (a “horizon” for your model is how long you expect your prediction to be good). This means not much signal, and lots of noise! In particular, lots of the machine learning “metrics of success” for models, such as measurements of precision or accuracy, are not very relevant in this context.

So instead of measuring accuracy, we generally draw a picture to assess models, namely of the (cumulative) PnL of the model. This generalizes to any model as well- you plot the cumulative sum of the product of demeaned forecast and demeaned realized. In other words, you see if your model consistently does better than the “stupidest” model of assuming everything is average.

If you plot this and you drift up and to the right, you’re good. If it’s too jaggedy, that means your model is taking big bets and isn’t stable.

Why regression?

From above we know the signal is weak. If you imagine there’s some complicated underlying relationship between your information and the thing you’re trying to predict, get over knowing what that is – there’s too much noise to find it. Instead, think of the function as possibly complicated, but continuous, and imagine you’ve written it out as a Taylor Series. Then you can’t possibly expect to get your hands on anything but the linear terms.

Don’t think about using logistic regression, either, because you’d need to be ignoring size, which matters in finance- it matters if a stock went up 2% instead of 0.01%. But logistic regression forces you to have an on/off switch, which would be possible but would lose a lot of information. Considering the fact that we are always in a low-information environment, this is a bad idea.

Note that although I’m claiming you probably want to use linear regression in a noisy environment, the actual terms themselves don’t have to be linear in the information you have. You can always take products of various terms as x’s in your regression. but you’re still fitting a linear model in non-linear terms.

Advanced regression

The first thing I need to explain is the exponential downweighting of old data, which I already used in a graph above, where I normalized returns by volatility with a decay of 0.97. How do I do this?

Working from this post again, the formula is given by essentially a weighted version of the normal one, where I weight recent data more than older data, and where the weight of older data is a power of some parameter s which is called the decay. The exponent is the number of time intervals since that data was new. Putting that together, the formula we get is:

V_{old} = (1-s) \cdot \sum_i r_i^2 s^i.

We are actually dividing by the sum of the weights, but the weights are powers of some number s, so it’s a geometric sum and the sum is given by 1/(1-s).

One cool consequence of this formula is that it’s easy to update: if we have a new return r_0 to add to the series, then it’s not hard to show we just want

V_{new} = s \cdot V_{old} + (1-s) \cdot r_0^2.

In fact this is the general rule for updating exponential downweighted estimates, and it’s one reason we like them so much- you only need to keep in memory your last estimate and the number s.

How do you choose your decay length? This is an art instead of a science, and depends on the domain you’re in. Think about how many days (or time periods) it takes to weight a data point at half of a new data point, and compare that to how fast the market forgets stuff.

This downweighting of old data is an example of inserting a prior into your model, where here the prior is “new data is more important than old data”. What are other kinds of priors you can have?


Priors can be thought of as opinions like the above. Besides “new data is more important than old data,” we may decide our prior is “coefficients vary smoothly.” This is relevant when we decide, say, to use a bunch of old values of some time series to help predict the next one, giving us a model like:

y = F_t = \alpha_0 + \alpha_1 F_{t-1} + \alpha_2 F_{t-2} + \epsilon,

which is just the example where we take the last two values of the time series $F$ to predict the next one. But we could use more than two values, of course.

[Aside: in order to decide how many values to use, you might want to draw an autocorrelation plot for your data.]

The way you’d place the prior about the relationship between coefficients (in this case consecutive lagged data points) is by adding a matrix to your covariance matrix when you perform linear regression. See more about this here.


I then talked about modeling and ethics. My goal is to get this next-gen group of data scientists sensitized to the fact that they are not just nerds sitting in the corner but have increasingly important ethical questions to consider while they work.

People tend to overfit their models. It’s human nature to want your baby to be awesome. They also underestimate the bad news and blame other people for bad news, because nothing their baby has done or is capable of is bad, unless someone else made them do it. Keep these things in mind.

I then described what I call the deathspiral of modeling, a term I coined in this post on creepy model watching.

I counseled the students to

  • try to maintain skepticism about their models and how their models might get used,
  • shoot holes in their own ideas,
  • accept challenges and devise tests as scientists rather than defending their models using words – if someone thinks they can do better, than let them try, and agree on an evaluation method beforehand,
  • In general, try to consider the consequences of their models.

I then showed them Emanuel Derman’s Hippocratic Oath of Modeling, which was made for financial modeling but fits perfectly into this framework. I discussed the politics of working in industry, namely that even if they are skeptical of their model there’s always the chance that it will be used the wrong way in spite of the modeler’s warnings. So the Hippocratic Oath is, unfortunately, insufficient in reality (but it’s a good start!).

Finally, there are ways to do good: I mentioned stuff like DataKind. There are also ways to be transparent: I mentioned Open Models, which is so far just an idea, but Victoria Stodden is working on RunMyCode, which is similar and very awesome.