Archive for the ‘modeling’ Category

Duke deans drop the ball on scientific misconduct

Former Duke University cancer researcher Anil Potti was found guilty of research misconduct yesterday by the federal Office of Research Integrity (ORI), after a multi-year investigation. You can read the story in Science, for example. His punishment is that he won’t do research without government-sponsored supervision for the next five years. Not exactly stiff.

This article also covers the ORI decision, and describes some of the people who suffered from poor cancer treatment because of his lies. Here’s an excerpt:

Shoffner, who had Stage 3 breast cancer, said she still has side effects from the wrong chemotherapy given to her in the Duke trial. Her joints were damaged, she said, and she suffered blood clots that prevent her from having knee surgery now. Of the eight patients who sued, Shoffner said, she is one of two survivors.

What’s interesting to me this morning is that both articles above mention the same reason for the initial investigation in his work. Namely, that he had padded his resume, pretending to be a Rhodes Scholar when he wasn’t. That fact was reported by a website called Cancer Letter in 2010.

But here’s the thing, back in 2008 a 3rd-year medical student named Bradford Perez sent the deans at Duke (according to Cancer Letter) a letter explaining that Potti’s lab was fabricating results. And for those of you who can read nerd, please go ahead and read his letter, it is extremely convincing. An excerpt:

Fifty-nine cell line samples with mRNA expression data from NCI-60 with associated radiation sensitivity were split in half to designate sensitive and resistant phenotypes. Then in developing the model, only those samples which fit the model best in cross validation were included. Over half of the original samples were removed. It is very possible that using these methods two samples with very little if any difference in radiation sensitivity could be in separate phenotypic categories. This was an incredibly biased approach which does little more than give the appearance of a successful cross validation.

Instead of taking up the matter seriously, the deans pressured Perez to keep quiet. And nothing more happened for two more years.

The good news: Bradford Perez seems to have gotten a perfectly good job.

The bad news: the deans at Duke suck. Unfortunately I don’t know exactly which deans and what their job titles are, but still: why are they not under investigation? What would deans have to do – or not do – to get in trouble? Is there any kind of accountability here?

Guest post: Open-Source Loan-Level Analysis of Fannie and Freddie

This is a guest post by Todd Schneider. You can read the full post with additional analysis on Todd’s personal site.

[M]ortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan; it was poetic that the single financial complexity contributed to the marketplace by the common man was the Gordian knot giving the best brains on Wall Street a run for their money. Ranieri’s instincts that had led him to build an enormous research department had been right: Mortgages were about math.

The money was made, therefore, with ever more refined tools of analysis.

—Michael Lewis, Liar’s Poker (1989)

Fannie Mae and Freddie Mac began reporting loan-level credit performance data in 2013 at the direction of their regulator, the Federal Housing Finance Agency. The stated purpose of releasing the data was to “increase transparency, which helps investors build more accurate credit performance models in support of potential risk-sharing initiatives.”

The GSEs went through a nearly $200 billion government bailout during the financial crisis, motivated in large part by losses on loans that they guaranteed, so I figured there must be something interesting in the loan-level data. I decided to dig in with some geographic analysis, an attempt to identify the loan-level characteristics most predictive of default rates, and more. The code for processing and analyzing the data is all available on GitHub.

Screen Shot 2015-06-18 at 6.16.26 PM

The “medium data” revolution

In the not-so-distant past, an analysis of loan-level mortgage data would have cost a lot of money. Between licensing data and paying for expensive computers to analyze it, you could have easily incurred costs north of a million dollars per year. Today, in addition to Fannie and Freddie making their data freely available, we’re in the midst of what I might call the “medium data” revolution: personal computers are so powerful that my MacBook Air is capable of analyzing the entire 215 GB of data, representing some 38 million loans, 1.6 billion observations, and over $7.1 trillion of origination volume. Furthermore, I did everything with free, open-source software.

What can we learn from the loan-level data?

Loans originated from 2005-2008 performed dramatically worse than loans that came before them! That should be an extraordinarily unsurprising statement to anyone who was even slightly aware of the U.S. mortgage crisis that began in 2007:

Screen Shot 2015-06-18 at 6.17.18 PM

About 4% of loans originated from 1999 to 2003 became seriously delinquent at some point in their lives. The 2004 vintage showed some performance deterioration, and then the vintages from 2005 through 2008 show significantly worse performance: more than 15% of all loans originated in those years became distressed.

From 2009 through present, the performance has been much better, with fewer than 2% of loans defaulting. Of course part of that is that it takes time for a loan to default, so the most recent vintages will tend to have lower cumulative default rates while their loans are still young. But there has also been a dramatic shift in lending standards so that the loans made since 2009 have been much higher credit quality: the average FICO score used to be 720, but since 2009 it has been more like 765. Furthermore, if we look 2 standard deviations from the mean, we see that the low end of the FICO spectrum used to reach down to about 600, but since 2009 there have been very few loans with FICO less than 680:

Screen Shot 2015-06-18 at 6.17.56 PM

Tighter agency standards, coupled with a complete shutdown in the non-agency mortgage market, including both subprime and Alt-A lending, mean that there is very little credit available to borrowers with low credit scores (a far more difficult question is whether this is a good or bad thing!).

Geographic performance

Default rates increased everywhere during the bubble years, but some states fared far worse than others. I took every loan originated between 2005 and 2007, broadly considered to be the height of reckless mortgage lending, bucketed loans by state, and calculated the cumulative default rate of loans in each state:

Screen Shot 2015-06-18 at 6.18.41 PM

4 states in particular jump out as the worst performers: California, Florida, Arizona, and Nevada. Just about every state experienced significantly higher than normal default rates during the mortgage crisis, but these 4 states, often labeled the “sand states”, experienced the worst of it.

Read more

If you’re interested in more technical discussion, including an attempt to identify which loan-level variables are most correlated to default rates (the number one being the home price adjusted loan to value ratio), read the full post on, and be sure to check out the project on GitHub if you’d like to do your own data analysis.

China announces it is scoring its citizens using big data

Please go read the article in the Dutch newspaper de Volkskrant entitled China rates its own citizens – including online behavior (hat tip Ernie Davis).

In the article, it describes China’s plan to use big data techniques to score all of its citizens – with the help of China internet giants Alibaba, Baidu, and Tencent – in a kind of expanded credit score that includes behavior and reputation. So what you buy, who you’re friends with, and whether you seem sufficiently “socialist” are factors that affect your overall score.

Here’s a quote from a person working on the model, from Chinese Academy of Social Science, that is incredibly creepy:

When people’s behavior isn’t bound by their morality, a system must be used to restrict their actions

And here’s another quote from Rogier Creemers, an academic at Oxford who specializes in China:

Government and big internet companies in China can exploit ‘Big Data’ together in a way that is unimaginable in the West

I guess I’m wondering whether that’s really true. Given my research over the past couple of years, I see this kind of “social credit scoring” being widely implemented here in the United States.

Looking for big data reading suggestions

I have been told by my editor to take a look at the books already out there on big data to make sure my book hasn’t already been written. For example, today I’m set to read Robert Scheer’s They Know Everything About You: how data-collecting corporations and snooping government agencies are destroying democracy.

This book, like others I’ve already read and written about (Bruce Schneier’s Data and Goliath, Frank Pasquale’s Black Box Society, and Julia Angwin’s Dragnet Nation) are all primarily concerned with individual freedom and privacy, whereas my book is primarily concerned with social justice issues, and each chapter gives an example of how big data is being used a tool against the poor, against minorities, against the mentally ill, or against public school teachers.

Not that my book is entirely different from the above books, but the relationship is something like what I spelled out last week when I discussed the four political camps in the big data world. So far the books I’ve found are focused on the corporate angle or the privacy angle. There may also be books focused on the open data angle, but I’m guessing they have even less in common with my book, which focuses on the ways big data increase inequality and further alienate already alienated populations.

If any of you know of a book I should be looking at, please tell me!

The Police State is already here.

The thing that people like Snowden are worried about with respect to mass surveillance has already happened. It’s being carried out by police departments, though, not the NSA, and its targets are black men, not the general population.

Take a look at this incredible Guardian article written by Rose Hackman. Her title is, Is the online surveillance of black teenagers the new stop-and-frisk? but honestly that’s a pretty tame comparison if you think about the kinds of permanent electronic information that the police are collecting about black boys in Harlem as young as 10 years old.

Some facts about the program:

  • 28,000 residents are being surveilled
  • 300 “crews,” a designation that rises to “gangs” when there are arrests,
  • Officers trawl Facebook, Instagram, Twitter, YouTube, and other social media for incriminating posts
  • They pose as young women to gain access to “private” accounts
  • Parents are not notified
  • People never get off these surveillance lists
  • In practice, half of court cases actually use social media data to put people away
  • NYPD cameras are located all over Harlem as well

We need to limit the kind of information police can collect, and put limits on how discriminatory their collection practices are. As the article points out, white fraternity brothers two blocks away at Columbia University are not on the lists, even though there was a big drug bust in 2010.

For anyone who wonders what a truly scary police surveillance state looks like, they need look no further than what’s already happening for certain Harlem residents.

Workplace Personality Tests: a Cynical View

There’s a frightening article in the Wall Street Journal by Lauren Weber about personality tests people are now forced to take to get shitty jobs in customer calling centers and the like. Some statistics from the article include: 8 out of 10 of the top private employers use such tests, and 57% of employers overall in 2013, a steep rise from previous years.

The questions are meant to be ambiguous so you can’t game them if you are an applicant. For example, yes or no: “I have never understood why some people find abstract art appealing.”

At the end of the test, you get a red light, a yellow light, or a green light. Red lighted people never get an interview, and yellow lighted may or may not. Companies cited in the article use the tests to disqualify more than half their applicants without ever talking to them in person.

The argument for these tests is that, after deploying them, turnover has gone down by 25% since 2000. The people who make and sell personality tests say this is because they’re controlling for personality type and “company fit.”

I have another theory about why people no longer leave shitty jobs, though. First of all, the recession has made people’s economic lives extremely precarious. Nobody wants to lose a job. Second of all, now that everyone is using arbitrary personality tests, the power of the worker to walk off the job and get another job the next week has gone down. By the way, the usage of personality tests seems to correlate with a longer waiting period between applying and starting work, so there’s that disincentive as well.

Workplace personality tests are nothing more than voodoo management tools that empower employers. In fact I’ve compared them in the past to modern day phrenology, and I haven’t seen any reason to change my mind since then. The real “metric of success” for these models is the fact that employers who use them can fire a good portion of their HR teams.

Categories: data science, modeling, rant

Fingers crossed – book coming out next May

As it turns out, it takes a while to write a book, and then another few months to publish it.

I’m very excited today to tentatively announce that my book, which is tentatively entitled Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, will be published in May 2016, in time to appear on summer reading lists and well before the election.

Fuck yeah! I’m so excited.

p.s. Fight for 15 is happening now.


Get every new post delivered to your Inbox.

Join 3,682 other followers