Archive for the ‘modeling’ Category

Guest post: Open-Source Loan-Level Analysis of Fannie and Freddie

This is a guest post by Todd Schneider. You can read the full post with additional analysis on Todd’s personal site.

[M]ortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan; it was poetic that the single financial complexity contributed to the marketplace by the common man was the Gordian knot giving the best brains on Wall Street a run for their money. Ranieri’s instincts that had led him to build an enormous research department had been right: Mortgages were about math.

The money was made, therefore, with ever more refined tools of analysis.

—Michael Lewis, Liar’s Poker (1989)

Fannie Mae and Freddie Mac began reporting loan-level credit performance data in 2013 at the direction of their regulator, the Federal Housing Finance Agency. The stated purpose of releasing the data was to “increase transparency, which helps investors build more accurate credit performance models in support of potential risk-sharing initiatives.”

The GSEs went through a nearly $200 billion government bailout during the financial crisis, motivated in large part by losses on loans that they guaranteed, so I figured there must be something interesting in the loan-level data. I decided to dig in with some geographic analysis, an attempt to identify the loan-level characteristics most predictive of default rates, and more. The code for processing and analyzing the data is all available on GitHub.

Screen Shot 2015-06-18 at 6.16.26 PM

The “medium data” revolution

In the not-so-distant past, an analysis of loan-level mortgage data would have cost a lot of money. Between licensing data and paying for expensive computers to analyze it, you could have easily incurred costs north of a million dollars per year. Today, in addition to Fannie and Freddie making their data freely available, we’re in the midst of what I might call the “medium data” revolution: personal computers are so powerful that my MacBook Air is capable of analyzing the entire 215 GB of data, representing some 38 million loans, 1.6 billion observations, and over $7.1 trillion of origination volume. Furthermore, I did everything with free, open-source software.

What can we learn from the loan-level data?

Loans originated from 2005-2008 performed dramatically worse than loans that came before them! That should be an extraordinarily unsurprising statement to anyone who was even slightly aware of the U.S. mortgage crisis that began in 2007:

Screen Shot 2015-06-18 at 6.17.18 PM

About 4% of loans originated from 1999 to 2003 became seriously delinquent at some point in their lives. The 2004 vintage showed some performance deterioration, and then the vintages from 2005 through 2008 show significantly worse performance: more than 15% of all loans originated in those years became distressed.

From 2009 through present, the performance has been much better, with fewer than 2% of loans defaulting. Of course part of that is that it takes time for a loan to default, so the most recent vintages will tend to have lower cumulative default rates while their loans are still young. But there has also been a dramatic shift in lending standards so that the loans made since 2009 have been much higher credit quality: the average FICO score used to be 720, but since 2009 it has been more like 765. Furthermore, if we look 2 standard deviations from the mean, we see that the low end of the FICO spectrum used to reach down to about 600, but since 2009 there have been very few loans with FICO less than 680:

Screen Shot 2015-06-18 at 6.17.56 PM

Tighter agency standards, coupled with a complete shutdown in the non-agency mortgage market, including both subprime and Alt-A lending, mean that there is very little credit available to borrowers with low credit scores (a far more difficult question is whether this is a good or bad thing!).

Geographic performance

Default rates increased everywhere during the bubble years, but some states fared far worse than others. I took every loan originated between 2005 and 2007, broadly considered to be the height of reckless mortgage lending, bucketed loans by state, and calculated the cumulative default rate of loans in each state:

Screen Shot 2015-06-18 at 6.18.41 PM

4 states in particular jump out as the worst performers: California, Florida, Arizona, and Nevada. Just about every state experienced significantly higher than normal default rates during the mortgage crisis, but these 4 states, often labeled the “sand states”, experienced the worst of it.

Read more

If you’re interested in more technical discussion, including an attempt to identify which loan-level variables are most correlated to default rates (the number one being the home price adjusted loan to value ratio), read the full post on, and be sure to check out the project on GitHub if you’d like to do your own data analysis.

China announces it is scoring its citizens using big data

Please go read the article in the Dutch newspaper de Volkskrant entitled China rates its own citizens – including online behavior (hat tip Ernie Davis).

In the article, it describes China’s plan to use big data techniques to score all of its citizens – with the help of China internet giants Alibaba, Baidu, and Tencent – in a kind of expanded credit score that includes behavior and reputation. So what you buy, who you’re friends with, and whether you seem sufficiently “socialist” are factors that affect your overall score.

Here’s a quote from a person working on the model, from Chinese Academy of Social Science, that is incredibly creepy:

When people’s behavior isn’t bound by their morality, a system must be used to restrict their actions

And here’s another quote from Rogier Creemers, an academic at Oxford who specializes in China:

Government and big internet companies in China can exploit ‘Big Data’ together in a way that is unimaginable in the West

I guess I’m wondering whether that’s really true. Given my research over the past couple of years, I see this kind of “social credit scoring” being widely implemented here in the United States.

Looking for big data reading suggestions

I have been told by my editor to take a look at the books already out there on big data to make sure my book hasn’t already been written. For example, today I’m set to read Robert Scheer’s They Know Everything About You: how data-collecting corporations and snooping government agencies are destroying democracy.

This book, like others I’ve already read and written about (Bruce Schneier’s Data and Goliath, Frank Pasquale’s Black Box Society, and Julia Angwin’s Dragnet Nation) are all primarily concerned with individual freedom and privacy, whereas my book is primarily concerned with social justice issues, and each chapter gives an example of how big data is being used a tool against the poor, against minorities, against the mentally ill, or against public school teachers.

Not that my book is entirely different from the above books, but the relationship is something like what I spelled out last week when I discussed the four political camps in the big data world. So far the books I’ve found are focused on the corporate angle or the privacy angle. There may also be books focused on the open data angle, but I’m guessing they have even less in common with my book, which focuses on the ways big data increase inequality and further alienate already alienated populations.

If any of you know of a book I should be looking at, please tell me!

The Police State is already here.

The thing that people like Snowden are worried about with respect to mass surveillance has already happened. It’s being carried out by police departments, though, not the NSA, and its targets are black men, not the general population.

Take a look at this incredible Guardian article written by Rose Hackman. Her title is, Is the online surveillance of black teenagers the new stop-and-frisk? but honestly that’s a pretty tame comparison if you think about the kinds of permanent electronic information that the police are collecting about black boys in Harlem as young as 10 years old.

Some facts about the program:

  • 28,000 residents are being surveilled
  • 300 “crews,” a designation that rises to “gangs” when there are arrests,
  • Officers trawl Facebook, Instagram, Twitter, YouTube, and other social media for incriminating posts
  • They pose as young women to gain access to “private” accounts
  • Parents are not notified
  • People never get off these surveillance lists
  • In practice, half of court cases actually use social media data to put people away
  • NYPD cameras are located all over Harlem as well

We need to limit the kind of information police can collect, and put limits on how discriminatory their collection practices are. As the article points out, white fraternity brothers two blocks away at Columbia University are not on the lists, even though there was a big drug bust in 2010.

For anyone who wonders what a truly scary police surveillance state looks like, they need look no further than what’s already happening for certain Harlem residents.

Workplace Personality Tests: a Cynical View

There’s a frightening article in the Wall Street Journal by Lauren Weber about personality tests people are now forced to take to get shitty jobs in customer calling centers and the like. Some statistics from the article include: 8 out of 10 of the top private employers use such tests, and 57% of employers overall in 2013, a steep rise from previous years.

The questions are meant to be ambiguous so you can’t game them if you are an applicant. For example, yes or no: “I have never understood why some people find abstract art appealing.”

At the end of the test, you get a red light, a yellow light, or a green light. Red lighted people never get an interview, and yellow lighted may or may not. Companies cited in the article use the tests to disqualify more than half their applicants without ever talking to them in person.

The argument for these tests is that, after deploying them, turnover has gone down by 25% since 2000. The people who make and sell personality tests say this is because they’re controlling for personality type and “company fit.”

I have another theory about why people no longer leave shitty jobs, though. First of all, the recession has made people’s economic lives extremely precarious. Nobody wants to lose a job. Second of all, now that everyone is using arbitrary personality tests, the power of the worker to walk off the job and get another job the next week has gone down. By the way, the usage of personality tests seems to correlate with a longer waiting period between applying and starting work, so there’s that disincentive as well.

Workplace personality tests are nothing more than voodoo management tools that empower employers. In fact I’ve compared them in the past to modern day phrenology, and I haven’t seen any reason to change my mind since then. The real “metric of success” for these models is the fact that employers who use them can fire a good portion of their HR teams.

Categories: data science, modeling, rant

Fingers crossed – book coming out next May

As it turns out, it takes a while to write a book, and then another few months to publish it.

I’m very excited today to tentatively announce that my book, which is tentatively entitled Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, will be published in May 2016, in time to appear on summer reading lists and well before the election.

Fuck yeah! I’m so excited.

p.s. Fight for 15 is happening now.

The achievement gap: whose problem is it?

On Monday night I went to see Boston College professor Henry Braun speak about the Value-Added Model for teachers (VAM) at Teachers College, right here in my hood (hat tip Sendhil Revuluri).

I wrote about VAM recently, and I’m not a fan, so I was excited for the event. Here’s the poster from Monday:



The room was not entirely filled with anti-VAM activists such as myself, even though it was an informed audience. In fact one of the people I found myself talking to before the talk started mentioned that he’d worked on Wall Street, where they “culled” 10% of the workforce regularly – during downsizing phases – and how fantastic it was, how it kept standards high.

I mentioned that the question is, who gets decide which 10% and why, and he responded that it was all about profit, naturally. Being an easily provoked person, I found myself saying, well right, that’s the definition of success for Wall Street, and we can see how that’s turned out for everyone. He stared blankly at me.

I told that story because it irks me, still, how utterly unscathed individuals feel, who were or are part of the Wall Street culture. They don’t see any lesson to learn from that whole mess.

But even more than that, the same mindset which served the country so poorly is now somehow being held up as a success story, and applied to other fields like public education.

That brings me to the talk itself. Professor Braun did a very good job of explaining the VAM, and the inconsistencies, and the smallish correlations and unaccountable black box nature of the test.

But he then did more: he drew up a (necessarily vague) picture of the entire process by which a teacher is “assessed,” of which VAM plays a varying role, and he asked some important questions: how does this process affect the teaching profession? Does the scrutiny of each teacher in this way make students learn more? Does it make bad teachers get better? Does it make good teachers stay in the profession?

Great questions, but he didn’t even stop there. He went on to point something out that I’d never directly considered. Namely, why do we think individual responsibility – i.e. finger pointing at individual teachers – is going to improve the overall system? Here he suggested that there’s been a huge split in the profession between those who want to improve educational systems and those who want to assess teachers (and think that will “close the achievement gap”). The people who want to improve education talk about increasing communication between teachers in a school or between schools in a district, and they talk about improving and strengthening communities and cultures of learning.

By contrast the “assess the teachers” crowd is convinced that holding teachers individually accountable for the achievement of their students is the only possible approach. Fuck the school culture, fuck communicating with other teachers in the school. Fuck differences in curriculum or having old books or not having enough books due to unequal funding.

It got me thinking, especially since I read that book last week, The New Prophets of Capitalism (review here). That book explained how hollow Oprah’s urging to live a perfect life is to people whose situations are beyond their control. The problem with Oprah’s reasoning is that it ignores real systemic problems and issues that radically affect certain parts of the population and make it much harder to take her advice. It’s context free in a world where context is more and more meaningful.

So, whose problem is the achievement gap? Is it owned in tiny pieces by every teacher who dares to enter the profession? Is it owned by schools or school systems? Or is it owned by all of us, by the country as a whole? And if it is, how are we going to start working together to solve it?

Categories: education, modeling

Get every new post delivered to your Inbox.

Join 3,629 other followers