modeling | mathbabe

MAA Distinguished Lecture Series: Start Your Own Netflix

October 16, 2013 Cathy O'Neil, mathbabe 4 comments

I’m on my way to D.C. today to give an alleged “distinguished lecture” to a group of mathematics enthusiasts. I misspoke in a previous post where I characterized the audience to consist of math teachers. In fact, I’ve been told it will consist primarily of people with some mathematical background, with typically a handful of high school teachers, a few interested members of the public, and a number of high school and college students included in the group.

So I’m going to try my best to explain three different ways of approaching recommendation engine building for services such as Netflix. I’ll be giving high-level descriptions of a latent factor model (this movie is violent and we’ve noticed you like violent movies), of the co-visitation model (lots of people who’ve seen stuff you’ve seen also saw this movie) and the latent topic model (we’ve noticed you like movies about the Hungarian 1956 Revolution). Then I’m going to give some indication of the issues in doing these massive-scale calculation and how it can be worked out.

And yes, I double-checked with those guys over at Netflix, I am allowed to use their name as long as I make sure people know there’s no affiliation.

In addition to the actual lecture, the MAA is having me give a 10-minute TED-like talk for their website as well as an interview. I am psyched by how easy it is to prepare my slides for that short version using prezi, since I just removed a bunch of nodes on the path of the material without removing the material itself. I will make that short version available when it comes online, and I also plan to share the longer prezi publicly.

[As an aside, and not to sound like an advertiser for prezi (no affiliation with them either!), but they have a free version and the resulting slides are pretty cool. If you want to be able to keep your prezis private you have to pay, but not as much as you’d need to pay for powerpoint. Of course there’s always Open Office.]

Train reading: Wrong Answer: the case against Algebra II, by Nicholson Baker, which was handed to me emphatically by my friend Nick. Apparently I need to read this and have an opinion.

Categories: math, math education, modeling

Are PayDay lenders better than banks? #OWS

October 15, 2013 Cathy O'Neil, mathbabe 10 comments

Sometimes my plan of getting up super early to write on my blog fails, and this is one of those days. But I’m still going to ask you to read this article from the New Yorker written by Lisa Servon and entitled, “The High Cost, For The Poor, Of Using A Bank.” Here’s a key passage, but the whole thing is amazing, and yes, I’ve invited her to my Occupy group already:

To understand why, consider loans of small amounts. People criticize payday loans for their high annual percentage rates (APR), which range from three hundred per cent to six hundred per cent. Payday lenders argue that APR is the wrong measure: the loans, they say, are designed to be repaid in as little as two weeks. Consumer advocates counter that borrowers typically take out nine of these loans each year, and end up indebted for more than half of each year.

But what alternative do low-income borrowers have? Banks have retreated from small-dollar credit, and many payday borrowers do not qualify anyway. It happens that banks offer a de-facto short-term, high-interest loan. It’s called an overdraft fee. An overdraft is essentially a short-term loan, and if it had a repayment period of seven days, the APR for a typical incident would be over five thousand per cent.

It makes me wonder whether, if someone did a careful analysis with all-in costs including time and travel, whether PayDay Lenders are not actually a totally rational choice for the poor.

Categories: #OWS, finance, modeling, news

Cumulative covariance plots

October 11, 2013 Cathy O'Neil, mathbabe 4 comments

One thing I do a lot when I work with data is figure out how to visualize my signals, especially with respect to time.

Lots of things change over time – relationships between variables, for example – and it’s often crucial to get deeply acquainted with how exactly that works with your in-sample data.

Say I am trying to predict “y”: so for a data point at time t, we’ll say we try to predict y(t). I’ll take an “x”, a variable that is expected to predict “y”, and I’ll demean both series x and y, hopefully in a causal way, and I will rename them x’ and y’, and then, making sure I’ve ordered everything with respect to time, I’ll plot the cumulative sum of the product x'(t) * y'(t).

In the case that both x'(t) and y'(t) have the both sign – so they’re both bigger than average or they’re both smaller than average, this product is positive, and otherwise it’s negative. So if you plot the cumulative sum, you get an upwards trend if things are positively correlated and downwards trend if things are negatively correlated. If you think about it, you are computing the numerator of the correlation function, so it is indeed just an unscaled version of total correlation.

Plus, since you ordered everything by time first, you can see how the relationship between these variables evolved over time.

Also, in the case that you are working with financial models, you can make a simplifying assumption that both x and y are pretty well demeaned already (especially at short time scales) and this gives you the cumulative PnL plot of your model. In other words, it tells you how much money your model is making.

So I was doing this exercise of plotting the cumulative covariance with some data the other day, and I got a weird picture. It kind of looked like a “U” plot: it went down dramatically at the beginning, then was pretty flat but trending up, then it went straight up at the end. It ended up not quite as high as it started, which is to say that in terms of straight-up overall correlation, I was calculating something negative but not very large.

But what could account for that U-shape? After some time I realized that the data had been extracted from the database in such a way that, after ordering my data by date, it was hugely biased in the beginning and at the end, in different directions, and that this was unavoidable, and the picture helped me determine exactly which data to exclude from my set.

After getting rid of the biased data at the beginning and the end, I concluded that I had a positive correlation here, even though if I’d trusted the overall “dirty” correlation I would have thought it was negative.

This is good information, and confirmed my belief that it’s always better to visualize data over time than it is to believe one summary statistic like correlation.

Categories: data science, modeling

Data Skeptic post

October 10, 2013 Cathy O'Neil, mathbabe 3 comments

I wrote a blog post for O’Reilly’s website to accompany my essay, On Being a Data Skeptic. Here’s an excerpt:

I left finance pretty disgusted with the whole thing, and because I needed to make money and because I’m a nerd, I pretty quickly realized I could rebrand myself a “data scientist” and get a pretty cool job, and that’s what I did. Once I started working in the field, though, I was kind of shocked by how positive everyone was about the “big data revolution” and the “power of data science.”

Not to underestimate the power of data––it’s clearly powerful! And big data has the potential to really revolutionize the way we live our lives for the better––or sometimes not. It really depends.

From my perspective, this was, in tenor if not in the details, the same stuff we’d been doing in finance for a couple of decades and that fields like advertising were slow to pick up on. And, also from my perspective, people needed to be way more careful and skeptical of their powers than they currently seem to be. Because whereas in finance we need to worry about models manipulating the market, in data science we need to worry about models manipulating people, which is in fact scarier. Modelers, if anything, have a bigger responsibility now than ever before.

Categories: data science, finance, modeling

Guest post: Rage against the algorithms

October 8, 2013 Cathy O'Neil, mathbabe 8 comments

This is a guest post by Nicholas Diakopoulos, a Tow Fellow at the Columbia University Graduate School of Journalism where he is researching the use of data and algorithms in the news. You can find out more about his research and other projects on his website or by following him on Twitter. Crossposted from engenhonetwork with permission from the author.

How can we know the biases of a piece of software? By reverse engineering it, of course.

When was the last time you read an online review about a local business or service on a platform like Yelp? Of course you want to make sure the local plumber you hire is honest, or that even if the date is dud, at least the restaurant isn’t lousy. A recent survey found that 76 percent of consumers check online reviews before buying, so a lot can hinge on a good or bad review. Such sites have become so important to local businesses that it’s not uncommon for scheming owners to hire shills to boost themselves or put down their rivals.

To protect users from getting duped by fake reviews Yelp employs an algorithmic review reviewer which constantly scans reviews and relegates suspicious ones to a “filtered reviews” page, effectively de-emphasizing them without deleting them entirely. But of course that algorithm is not perfect, and it sometimes de-emphasizes legitimate reviews and leaves actual fakes intact—oops. Some businesses have complained, alleging that the filter can incorrectly remove all of their most positive reviews, leaving them with a lowly one- or two-stars average.

This is just one example of how algorithms are becoming ever more important in society, for everything from search engine personalization, discrimination, defamation, and censorship online, to how teachers are evaluated, how markets work, how political campaigns are run, and even how something like immigration is policed. Algorithms, driven by vast troves of data, are the new power brokers in society, both in the corporate world as well as in government.

They have biases like the rest of us. And they make mistakes. But they’re opaque, hiding their secrets behind layers of complexity. How can we deal with the power that algorithms may exert on us? How can we better understand where they might be wronging us?

Transparency is the vogue response to this problem right now. The big “open data” transparency-in-government push that started in 2009 was largely the result of an executive memo from President Obama. And of course corporations are on board too; Google publishes a biannual transparency report showing how often they remove or disclose information to governments. Transparency is an effective tool for inculcating public trust and is even the way journalists are now trained to deal with the hole where mighty Objectivity once stood.

But transparency knows some bounds. For example, though the Freedom of Information Act facilitates the public’s right to relevant government data, it has no legal teeth for compelling the government to disclose how that data was algorithmically generated or used in publicly relevant decisions (extensions worth considering).

Moreover, corporations have self-imposed limits on how transparent they want to be, since exposing too many details of their proprietary systems may undermine a competitive advantage (trade secrets), or leave the system open to gaming and manipulation. Furthermore, whereas transparency of data can be achieved simply by publishing a spreadsheet or database, transparency of an algorithm can be much more complex, resulting in additional labor costs both in creation as well as consumption of that information—a cognitive overload that keeps all but the most determined at bay. Methods for usable transparency need to be developed so that the relevant aspects of an algorithm can be presented in an understandable way.

Given the challenges to employing transparency as a check on algorithmic power, a new and complementary alternative is emerging. I call it algorithmic accountability reporting. At its core it’s really about reverse engineering—articulating the specifications of a system through a rigorous examination drawing on domain knowledge, observation, and deduction to unearth a model of how that system works.

As interest grows in understanding the broader impacts of algorithms, this kind of accountability reporting is already happening in some newsrooms, as well as in academic circles. At the Wall Street Journal a team of reporters probed e-commerce platforms to identify instances of potential price discrimination in dynamic and personalized online pricing. By polling different websites they were able to spot several, such as Staples.com, that were adjusting prices dynamically based on the location of the person visiting the site. At the Daily Beast, reporter Michael Keller dove into the iPhone spelling correction feature to help surface patterns of censorship and see which words, like “abortion,” the phone wouldn’t correct if they were misspelled. In my own investigation for Slate, I traced the contours of the editorial criteria embedded in search engine autocomplete algorithms. By collecting hundreds of autocompletions for queries relating to sex and violence I was able to ascertain which terms Google and Bing were blocking or censoring, uncovering mistakes in how these algorithms apply their editorial criteria.

All of these stories share a more or less common method. Algorithms are essentially black boxes, exposing an input and output without betraying any of their inner organs. You can’t see what’s going on inside directly, but if you vary the inputs in enough different ways and pay close attention to the outputs, you can start piecing together some likeness for how the algorithm transforms each input into an output. The black box starts to divulge some secrets.

Algorithmic accountability is also gaining traction in academia. At Harvard, Latanya Sweeney has looked at how online advertisements can be biased by the racial association of names used as queries. When you search for “black names” as opposed to “white names” ads using the word “arrest” appeared more often for online background check service Instant Checkmate. She thinks the disparity in the use of “arrest” suggests a discriminatory connection between race and crime. Her method, as with all of the other examples above, does point to a weakness though: Is the discrimination caused by Google, by Instant Checkmate, or simply by pre-existing societal biases? We don’t know, and correlation does not equal intention. As much as algorithmic accountability can help us diagnose the existence of a problem, we have to go deeper and do more journalistic-style reporting to understand the motivations or intentions behind an algorithm. We still need to answer the question of why.

And this is why it’s absolutely essential to have computational journalists not just engaging in the reverse engineering of algorithms, but also reporting and digging deeper into the motives and design intentions behind algorithms. Sure, it can be hard to convince companies running such algorithms to open up in detail about how their algorithms work, but interviews can still uncover details about larger goals and objectives built into an algorithm, better contextualizing a reverse-engineering analysis. Transparency is still important here too, as it adds to the information that can be used to characterize the technical system.

Despite the fact that forward thinkers like Larry Lessig have been writing for some time about how code is a lever on behavior, we’re still in the early days of developing methods for holding that code and its influence accountable. “There’s no conventional or obvious approach to it. It’s a lot of testing or trial and error, and it’s hard to teach in any uniform way,” noted Jeremy Singer-Vine, a reporter and programmer who worked on the WSJ price discrimination story. It will always be a messy business with lots of room for creativity, but given the growing power that algorithms wield in society it’s vital to continue to develop, codify, and teach more formalized methods of algorithmic accountability. In the absence of new legal measures, it may just provide a novel way to shed light on such systems, particularly in cases where transparency doesn’t or can’t offer much clarity.

Categories: data science, guest post, modeling

A.F.R. Transparency Panel coming up on Friday in D.C.

October 7, 2013 Cathy O'Neil, mathbabe 6 comments

I’m preparing for a short trip to D.C. this week to take part in a day-long event held by Americans for Financial Reform. You can get the announcement here online, but I’m not sure what the finalized schedule of the day is going to be. Also, I believe it will be recorded, but I don’t know the details yet.

In any case, I’m psyched to be joining this, and the AFR are great guys doing important work in the realm of financial reform.

——

Opening Wall Street’s Black Box: Pathways to Improved Financial Transparency

Sponsored By Americans for Financial Reform and Georgetown University Law Center

Keynote Speaker: Gary Gensler Chair, Commodity Futures Trading Commission

October 11, 2013 10 AM – 3 PM

Georgetown Law Center, Gewirz Student Center, 12th Floor

120 F Street NW, Washington, DC (Judiciary Square Metro) (Space is limited. Please RSVP to AFRtransparencyrsvp@gmail.com)

The 2008 financial crisis revealed that regulators and many sophisticated market participants were in the dark about major risks and exposures in our financial system. The lack of financial transparency enabled large-scale fraud and deception of investors, weakened the stability of the financial system, and contributed to the market failure after the collapse of Lehman Brothers. Five years later, despite regulatory efforts, it’s not clear how much the situation has improved.

Join regulators, market participants, and academic experts for an exploration of the progress made – and the work that remains to be done – toward meaningful transparency on Wall Street. How can better information and disclosure make the financial system both fairer and safer?

Panelists include:

Jesse Eisinger, Pulitzer Prize-winning reporter for the New York Times and Pro Publica

Zach Gast, Head of financial sector research, Center on Financial Research and Analysis

Amias Gerety, Deputy Assistant Secretary for the FSOC, United States Treasury

Henry Hu, Alan Shivers Chair in the Law of Banking and Finance, University of Texas Law School

Albert “Pete” Kyle, Charles E. Smith Professor of Finance, University of Maryland

Adam Levitan, Professor of Law, Georgetown University Law Center

Antoine Martin, Vice President, New York Federal Reserve Bank

Brad Miller, Former Representative from North Carolina; Of Counsel, Grais & Ellsworth

Cathy O’Neil, Senior Data Scientist, Johnson Research Labs; Occupy Alternative Banking

Gene Phillips, Director, PF2 Securities Evaluation

Greg Smith, Author of “Why I Left Goldman Sachs”; former Goldman Sachs Executive Director

Categories: finance, modeling, open source tools

I think I understand the revolving door problem

October 3, 2013 Cathy O'Neil, mathbabe 18 comments

I was reading this Bloomberg article about the internal risk models at JP Morgan versus Goldman Sachs, and it hit me: I too had an urge for the SEC to hire the insiders at Goldman Sachs to help them “understand risk” at every level. Why not hire a small team of Goldman Sachs experts to help the SEC combat bullshit like what happened with the London Whale?

After all, Goldman people know risk. They probably knew risk even better before 1999, when they went IPO and the partners stopped being personally liable for losses. But even now, of all the big players on the street, Goldman is known for being a few steps ahead of everyone else when it comes to a losing trade.

So it’s natural to want someone from deeply within that culture to come spread their technical risk wisdom to the other side, the regulators.

Unfortunately that’s never what actually happens. Instead of getting the technical knowledge of how to think about risk, how to model a portfolio to squirrel out black holes of mystery, the revolving door instead keeps outputting crazy freaks like Jon Corzine, who blow up firms through, ironically, taking ridiculous risks at the first opportunity.

So, why does this happen? Some possibilities:

Goldman Sachs promotes crazy freaks because they make great leaders while constrained inside a disciplined culture of calculated risks, but when they get outside they go nuts. This is kind of the model of Mormon children who are finally allowed out into the world and engage in tons of sex and drugs.
On the flip side, perhaps Goldman Sachs keeps the people who actually understand the technical part of risk very deep in the machine and these guys never get leave the building at all.
Or maybe, people who understand risk sometimes do go through the revolving door, but they don’t share their knowledge with the other side, because their incentives have changed once they’re outside.
In other words, they don’t help the regulators understand how banks lie and cheat to regulators, because they’re too busy watering down regulation so their buddies can continuously lie and cheat to regulators.

Whatever the case, for whatever reason we keep using the revolving door in hopes that someone will eventually tell us the magic that Goldman Sachs knows, but we never quite get anyone like that, and that means the the SEC and other regulators are woefully unprepared for the kind of tricks that banks have up their sleeves.

Categories: finance, modeling, musing

New Essay, On Being a Data Skeptic, now out

October 1, 2013 Cathy O'Neil, mathbabe 5 comments

It is available here and is based on a related essay written by Susan Webber entitled “Management’s Great Addiction: It’s time we recognized that we just can’t measure everything.” It is being published by O’Reilly as an e-book.

No, I don’t know who that woman is looking skeptical on the cover. I wish they’d asked me for a picture of a skeptical person, I think my 11-year-old son would’ve done a better job.

Categories: data science, modeling, musing

“Here and Now” is shilling for the College Board

September 30, 2013 Cathy O'Neil, mathbabe 13 comments

Did you think public radio doesn’t have advertising? Think again.

Last week Here and Now’s host Jeremy Hobson set up College Board’s James Montoya for a perfect advertisement regarding a story on SAT scores going down. The transcript and recording are here (hat tip Becky Jaffe).

To set it up, they talk about how GPA’s are going up on average over the country but how, at the same time, the average SAT score went down last year.

Somehow the interpretation of this is that there’s grade inflation and that kids must be in need of more test prep because they’re dumber.

What is the College Board?

You might think, especially if you listen to this interview, that the college board is a thoughtful non-profit dedicated to getting kids prepared for college.

Make no mistake about it: the College Board is a big business, and much of their money comes from selling test prep stuff on top of administering tests. Here are a couple of things you might want to know about College Board through its wikipedia page:

Consumer rights organization Americans for Educational Testing Reform (AETR) has criticized College Board for violating its non-profit status through excessive profits and exorbitant executive compensation; nineteen of its executives make more than $300,000 per year, with CEO Gaston Caperton earning $1.3 million in 2009 (including deferred compensation).^[10]^[11] AETR also claims that College Board is acting unethically by selling test preparation materials, directly lobbying legislators and government officials, and refusing to acknowledge test-taker rights.^[12]

Anyhoo, let’s just say it this way: College Board has the ability to create an “emergency” about SAT scores, by say changing the test or making it harder, and then the only “reasonable response” is to pay for yet more test prep. And somehow Here and Now’s host Jeremy Hobson didn’t see this coming at all.

The interview

Here’s an excerpt:

HOBSON: It also suggests, when you look at the year-over-year scores, the averages, that things are getting worse, not better, because if I look at, for example, in critical reading in 2006, the average being 503, and now it’s 496. Same deal in math and writing. They’ve gone down.

MONTOYA: Well, at the same time that we have seen the scores go down, what’s very interesting is that we have seen the average GPAs reported going up. So, for example, when we look at SAT test takers this year, 48 percent reported having a GPA in the A range compared to 45 percent last year, compared to 44 percent in 2011, I think, suggesting that there simply have to be more rigor in core courses.

HOBSON: Well, and maybe that there’s grade inflation going on.

MONTOYA: Well, clearly, that there is grade inflation. There is no question about that. And it’s one of the reasons why standardized test scores are so important in the admission office. I know that, as a former dean of admission, test scores help gauge the meaning of a GPA, particularly given the fact that nearly half of all SAT takers are reporting a GPA in the A range.

Just to be super clear about the shilling, here’s Hobson a bit later in the interview:

HOBSON: Well – and we should say that your report noted – since you mentioned practice – that as is the case with the ACT, the students who take the rigorous prep courses do better on the SAT.

What does it really mean when SAT scores go down?

Here’s the thing. SAT scores are fucked with ALL THE TIME. Traditionally, they had to make SAT’s harder since people were getting better at them. As test-makers, they want a good bell curve, so they need to adjust the test as the population changes and as their habits of test prep change.

The result is that SAT tests are different every year, so just saying that the scores went down from year to year is meaningless. Even if the same group of kids took those two different tests in the same year, they’d have different scores.

Also, according to my friend Becky who works with kids preparing for the SAT, they really did make substantial changes recently in the math section, changing the function notation, which makes it much harder for kids to parse the questions. In other words, they switched something around to give kids reason to pay for more test prep.

Important: this has nothing to do with their knowledge, it has to do with their training for this specific test.

If you want to understand the issues outside of math, take for example the essay. According to this critique, the number one criterion for essay grade is length. Length trumps clarity of expression, relevance of the supporting arguments to the thesis, mechanics, and all other elements of quality writing. As my friend Becky says:

I have coached high school students on the SAT for years and have found time and again, much to my chagrin, that students receive top scores for long essays even if they are desultory, tangent-filled and riddled with sentence fragments, run-ons, and spelling errors.

Similarly, I have consistently seen students receive low scores for shorter essays that are thoughtful and sophisticated, logical and coherent, stylish and articulate.

As long as the number one criterion for receiving a high score on the SAT essay is length, students will be confused as to what constitutes successful college writing and scoring well on the written portion of the exam will remain essentially meaningless. High-scoring students will have to unlearn the strategies that led to success on the SAT essay and relearn the fundamentals of written expression in a college writing class.

If the College Board (the makers of the SAT) is so concerned about the dumbing down of American children, they should examine their own role in lowering and distorting the standards for written expression.

Conclusion

Two things. First, shame on College Board and James Montoya for acting like SAT scores are somehow beacons of truth without acknowledging the fiddling that goes on time and time again by his company. And second, shame on Here and Now and Jemery Hobson for being utterly naive and buying in entirely to this scare tactic.

Categories: Becky Jaffe, math education, modeling, rant

A Code of Conduct for data scientists from the Bellagio Fellows

September 25, 2013 Cathy O'Neil, mathbabe 3 comments

The 2013 PopTech & Rockefeller Foundation Bellagio Fellows – Kate Crawford, Patrick Meier, Claudia Perlich, Amy Luers, Gustavo Faleiros and Jer Thorp – yesterday published “Seven Principles for Big Data and Resilience Projects” on Patrick Meier’s blog iRevolution.

Although they claim that these principles are meant for “best practices for resilience building projects that leverage Big Data and Advanced Computing,” I think they’re more general than that (although I’m not sure exactly what a resilience building project is) I and I really like them. They are looking for public comments too. Go to the post for the full description of each, but here is a summary:

1. Open Source Data Tools

Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.).

2. Transparent Data Infrastructure

Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure.

3. Develop and Maintain Local Skills

Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills.

4. Local Data Ownership

Use Creative Commons and licenses that state that data is not to be used for commercial purposes.

5. Ethical Data Sharing

Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option.

6. Right Not To Be Sensed

Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate.

7. Learning from Mistakes

Big Data and Resilience projects need to be open to face, report, and discuss failures.

Categories: data science, modeling, news, open source tools

Interactive scoring models: why hasn’t this happened yet?

September 12, 2013 Cathy O'Neil, mathbabe 10 comments

My friend Suresh just reminded me about this article written a couple of years ago by Malcolm Gladwell and published in the New Yorker.

It concerns various scoring models that claim to be both comprehensive (which means it covers the whole thing, not just one aspect of the thing) and heterogeneous (which means it is broad enough to cover all things in a category), say for cars or for colleges.

Weird things happen when you try to do this, like not caring much about price or exterior detailing for sports cars.

Two things. First, this stuff is actually really hard to do well. I like how Gladwell addresses this issue:

At no point, however, do the college guides acknowledge the extraordinary difficulty of the task they have set themselves.

Second of all, I think the issue of combining heterogeneity and comprehensiveness is addressable, but it has to be addressed interactively.

Specifically, what if instead of a single fixed score, there was a place where a given car-buyer or college-seeker could go to fill out a form of preferences? For each defined and rated aspect, the user would fill answer a question about how much they cared about that aspect. They’d assign a weight to each aspect. A given question would look something like this:

For colleges, some people care a lot about whether their college has a ton of alumni giving, other people care more about whether the surrounding town is urban or rural. Let’s let people create their own scoring system. It’s technically easy.

I’ve suggested this before when I talked about rating math articles on various dimensions (hard, interesting, technical, well-written) and then letting people come and search based on weighting those dimensions and ranking. But honestly we can start even dumber, with car ratings and college ratings.

Categories: data science, modeling

Working in the NYC Mayor’s Office

September 10, 2013 Cathy O'Neil, mathbabe 7 comments

I recently took a job in the NYC Mayor’s Office as an unpaid consultant. It’s an interesting time to be working for the Mayor, to be sure – everyone’s waiting to see what happens this week with the election, and all sorts of things are up in the air. Planning essentially stops at December 31st.

Note the expiration date.

I’m working in a data group which deals with social service agency data. That means Child Services, Homeless Services, and the like. Any agency where there there is direct contact with lots of people and their data. The idea is for me to help them out with a project that, if successful, I might be able to take to another city as a product. I’m still working full-time at the same job.

Specifically, my goal is to figure out a way to use data to help the people involved – the homeless, for example – get connected to better services. As a side effect I think this should make the agency more efficient. Far too many data studies only care about efficiency – how to make do with fewer police or fewer ambulances – with no thought or care about whether the people experiencing the services are being affected. I want to start with the people, and hope for efficiency gains, which I believe will come.

One thing that has already amazed me about this job, which I’ve just started, is the conversations people have about the ethics of data privacy.

It is a well-known fact that, as you link more and more data about people together, you can predict their behavior better. So for example, you could theoretically link all the different agency data for a given person into a profile, including crime data, health data, education and the like.

This might help you profile that person, and that might help you offer them better services. But it also might not be what that person wants you to do, especially if you start adding social media information. There’s a tension between the best model and reasonable limits of privacy and decency, even when the model is intended to be used in a primarily helpful manner. It’s more obvious when you’re attempting something insidious like predictive policing, of course.

Now, it shouldn’t shock me to have such conversations, because after all we are talking about some of the most vulnerable populations here. But even so, it does.

In all my time as a predictive modeler, I’ve never been in that kind of conversation, about the malicious things people could do with such-and-such profile information, or with this or that model, unless I started it myself.

When you work as a quant in finance, the data you work with is utterly sanitized to the point where, although it eventually trickles down to humans, you are asked to think of it as generated by some kind of machine, which we call “the market.”

Similarly, when you work in ad tech or other internet modeling, you think of users as the targets of your predatory goals: click on this, user, or buy that, user! They are prey, and the more we know about them the better our aim will be. If we can buy their profiles from Acxiom, all the better for our purposes.

This is the opposite of all of that. Super interesting, and glad I am being given this opportunity.

Categories: data science, modeling

Experimentation in education – still a long way to go

September 5, 2013 Cathy O'Neil, mathbabe 13 comments

Yesterday’s New York Times ran a piece by Gina Kolata on randomized experiments in education. Namely, they’ve started to use randomized experiments like they do in medical trials. Here’s what’s going on:

… a little-known office in the Education Department is starting to get some real data, using a method that has transformed medicine: the randomized clinical trial, in which groups of subjects are randomly assigned to get either an experimental therapy, the standard therapy, a placebo or nothing.

They have preliminary results:

The findings could be transformative, researchers say. For example, one conclusion from the new research is that the choice of instructional materials — textbooks, curriculum guides, homework, quizzes — can affect achievement as profoundly as teachers themselves; a poor choice of materials is at least as bad as a terrible teacher, and a good choice can help offset a bad teacher’s deficiencies.

So far, the office — the Institute of Education Sciences — has supported 175 randomized studies. Some have already concluded; among the findings are that one popular math textbook was demonstrably superior to three competitors, and that a highly touted computer-aided math-instruction program had no effect on how much students learned.

Other studies are under way. Cognitive psychology researchers, for instance, are assessing an experimental math curriculum in Tampa, Fla.

If you go to any of the above links, you’ll see that the metric of success is consistently defined as a standardized test score. That’s the only gauge of improvement. So any “progress” that’s made is by definition measured by such a test.

In other words, if we optimize to this system, we will optimize for textbooks which raise standardized test scores. If it doesn’t improve kids’ test scores, it might as well not be in the book. In fact it will probably “waste time” with respect to raising scores, so there will effectively be a penalty for, say, fun puzzles, or understanding why things are true, or learning to write.

Now, if scores are all we cared about, this could and should be considered progress. Certainly Gina Kolata, the NYTimes journalist, didn’t mention that we might not care only about this – she recorded it as unfettered good, as she was expected to by the Education Department, no doubt. But, as a data scientist who gets paid to think about the feedback loops and side effects of choices like “metrics of success,” I have a problem with it.

I don’t have a thing against randomized tests – using them is a good idea, and will maybe even quiet some noise around all the different curriculums, online and in person. I do think, though, that we need to have more ways of evaluating an educational experience than a test score.

After all, if I take a pill once a day to prevent a disease, then what I care about is whether I get the disease, not which pill I took or what color it was. Medicine is a very outcome- focused discipline in a way that education is not. Of course, there are exceptions, say when the treatment has strong and negative side-effects, and the overall effect is net negative. Kind of like when the teacher raises his or her kids’ scores but also causes them to lose interest in learning.

If we go the way of the randomized trial, why not give the students some self-assessments and review capabilities of their text and their teacher (which is not to say teacher evaluations give clean data, because we know from experience they don’t)? Why not ask the students how they liked the book and how much they care about learning? Why not track the students’ attitudes, self-assessment, and goals for a subject for a few years, since we know longer-term effects are sometimes more important that immediate test score changes?

In other words, I’m calling for collecting more and better data beyond one-dimensional test scores. If you think about it, teenagers get treated better by their cell phone companies or Netflix than by their schools.

I know what you’re thinking – that students are all lazy and would all complain about anyone or anything that gave them extra work. My experience is that kids actually aren’t like this, know the difference between rote work and real learning, and love the learning part.

Another complaint I hear coming – long-term studies take too long and are too expensive. But ultimately these things do matter in the long term, and as we’ve seen in medicine, skimping on experiments often leads to bigger and more expensive problems. Plus, we’re not going to improve education overnight.

And by the way, if and/or when we do this, we need to implement strict privacy policies for the students’ answers – you don’t want a 7-year-old’s attitude about math held against him when he of she applies to college.

Categories: data science, math education, modeling, musing

Short your kids, go long your neighbor: betting on people is coming soon

September 2, 2013 Cathy O'Neil, mathbabe 15 comments

Yet another aspect of Gary Shteyngart’s dystopian fiction novel Super Sad True Love Story is coming true for reals this week.

Besides anticipating Occupy Wall Street, as well as Bloomberg’s sweep of Zuccotti Park (although getting it wrong on how utterly successful such sweeping would be), Shteyngart proposed the idea of instant, real-time and broadcast credit ratings.

Anyone walking around the streets of New York, as they’d pass a certain type of telephone pole – the kind that identifies you via your cell phone and communicates with data warehousing services and databases – would have their credit rating flashed onto a screen. If you went to a party, depending on how you impressed the other party go-ers, your score could plummet or rise in real time, and everyone would be able to keep track and treat you accordingly.

I mean, there were other things about the novel too, but as a data person these details certainly stuck with me since they are both extremely gross and utterly plausible.

And why do I say they are coming true now? I base my claim on two news stories I’ve been sent by my various blog readers recently.

[Aside: if you read my blog and find an awesome article that you want to send me, by all means do! My email address is available on my “About” page.]

First, coming via Suresh and Marcos, we learn that data broker Acxiom is letting people see their warehoused data. A few caveats, bien sûr:

You get to see your own profile, here, starting in 2 days, but only your own.
And actually, you only get to see some of your data. So they won’t tell you if you’re a suspected gambling addict, for example. It’s a curated view, and they want your help curating it more. You know, for your own good.
And they’re doing it so that people have clarity on their business.
Haha! Just kidding. They’re doing it because they’re trying to avoid regulations and they feel like this gesture of transparency might make people less suspicious of them.
And they’re counting on people’s laziness. They’re allowing people to opt out, but of course the people who should opt out would likely never even know about that possibility.
Just keep in mind that, as an individual, you won’t know what they really think they know about you, but as a corporation you can buy complete information about anyone who hasn’t opted out.

In any case those credit scores that Shteyngart talks about are already happening. The only issue is who gets flashed those numbers and when. Instead of the answers being “anyone walking down the street” and “when you walk by a pole” it’s “any corporation on the interweb” and “whenever you browse”.

After all, why would they give something away for free? Where’s the profit in showing the credit scores of anyone to everyone? Hmmmm….

That brings me to my second news story of the morning coming to me via Constantine, namely this TechCrunch story which explains how a startup called Fantex is planning to allow individuals to invest in celebrity athletes’ stocks. Yes, you too can own a tiny little piece of someone famous, for a price. From the article:

People can then buy shares of that player’s brand, like a stock, in the Fantex-consumer market. Presumably, if San Francisco 49ers tight end Vernon Davis has a monster year and looks like he’s going to get a bigger endorsement deal or a larger contract in a few years, his stock would rise and a fan could sell their Davis stock and cash out with a real, monetary profit. People would own tracking or targeted stocks in Fantex that would depend on the specific brand that they choose; these stocks would then rise and fall based on their own performance, not on the overall performance of Fantex.

Let’s put these two things together. I think it’s not too much of a stretch to acknowledge a reason for everyone to know everyone else’s credit score! Namely, we can can bet on each other’s futures!

I can’t think of any set-up more exhilarating to the community of hedge fund assholes than a huge, new open market – containing profit potentials for every single citizen of earth – where you get to make money when someone goes to the wrong college, or when someone enters into an unfortunate marriage and needs a divorce, or when someone gets predictably sick. An orgy in the exact center of tech and finance.

Are you with me peoples?!

I don’t know what your Labor Day plans are, but I’m getting ready my list of people to short in this spanking new market.

Categories: data science, finance, modeling, news, rant

Summers’ Lending Club makes money by bypassing the Equal Credit Opportunity Act

August 29, 2013 Cathy O'Neil, mathbabe 31 comments

Don’t know about you, but for some reason I have a sinking feeling when it comes to the idea of Larry Summers. Word on the CNBC street is that he’s about to be named new Fed Chair, and I am living in a state of cognitive dissonance.

To distract myself, I’m going to try better to explain what I started to explain here, when I talked about the online peer-to-peer lending company Lending Club. Summers sits on the board of Lending Club, and from my perspective it’s a logical continuation of his career of deregulation and/or bypassing of vital regulation to enrich himself.

In this case, it’s a vehicle for bypassing the FTC’s Equal Credit Opportunities Rights. It’s not perfect, but it “prohibits credit discrimination on the basis of race, color, religion, national origin, sex, marital status, age, or because you get public assistance.” It forces credit scores to be relatively behavior based, like you see here. Let me contrast that to Lending Club.

Lending Club also uses mathematical models to score people who want to borrow money. These act as credit scores. But in this case, they use data like browsing history or anything they can grab about you on the web or from data warehousing companies like Acxiom (which I’ve written about here). From this Bloomberg article on Lending Club:

“What we’ve done is radically transform the way consumer lending operates,” Laplanche says in his speech. He says that LendingClub keeps staffing low by using algorithms to screen prospective borrowers for risk — rejecting 90 percent of them – – and has no physical branches like banks. “The savings can be passed on to more borrowers in terms of lower interest rates and investors in terms of attractive returns.”

I’d focus on the benefit for investors. Big money is now involved in this stuff. Turns out that bypassing credit score regulation is great for business, so of course.

For example, such models might look at your circle of friends on Facebook to see if you “run with the right crowd” before loaning you money. You can now blame your friends if you don’t get that loan! From this CNN article on the subject (hat tip David):

“It turns out humans are really good at knowing who is trustworthy and reliable in their community,” said Jeff Stewart, a co-founder and CEO of Lenddo. “What’s new is that we’re now able to measure through massive computing power.”

Moving along from taking out loans to getting jobs, there’s this description of how recruiters work online to perform digital background checks for potential employees. It’s a different set of laws this time that is subject to arbitrage but it’s exactly the same idea:

Non-discrimination laws prohibit employers from asking job applicants certain questions. They’re not supposed to ask about things like age, race, gender, disability, marital, and veteran status. (As you can imagine, sometimes a picture alone can reveal this privileged information. These safeguards against discrimination urge employers to simply not use this knowledge to make hiring decisions.) In addition to protecting people from systemic prejudice, these employment laws intend to shield us from capricious bias and whimsy. While casually snooping, however, a recruiter can’t unsee your Facebook rant on immigration amnesty, the same for your baby bump on Instagram. From profile pics and bios, blog posts and tweets, simple HR reconnaissance can glean tons of off-limits information.

…

Along with forcing recruiters to gaze with eyes wide shut, straddling legal liability and ignorance, invisible employment screens deny American workers the robust protections afforded by the FTC and the Fair Credit Reporting Act. The FCRA ensures that prospective employees are notified before their backgrounds and credit scores are verified. Employees are free to decline the checks, but employers are also free to deny further consideration unless a screening is allowed to take place. What’s important here is that employees must first give consent.

When a report reveals unsavory information about a candidate, and the employer chooses to take what’s called “adverse action,”—like deny a job offer—the employer is required to share the content of the background reports with the candidate. The applicant then has the right to explain or dispute inaccurate and incomplete aspects of the background check. Consent, disclosure, and recourse constitute a straightforward approach to employment screening.

Contrast this citizen-empowering logic with the casual Google search or to the informal, invisible social-media exam. As applicants, we don’t know if employers are looking, we’re not privy to what they see, and we have no way to appeal.

As legal scholars Daniel Solove and Chris Hoofnagle discuss, the amateur Google screens that are now a regular feature of work-life go largely unnoticed. Applicants are simply not called back. And they’ll never know the real reason.

I think the silent failure is the scariest part for me – people who don’t get jobs won’t know why.

Similarly, people denied loans from Lending Club by a secret algorithm don’t know why either. Maybe it’s because I made friends with the wrong person on Facebook? Maybe I should just go ahead and stop being friends with anyone who might put my electronic credit score at risk?

Of course this rant is predicated on the assumption that we think anti-discrimination laws are a good thing. In an ideal world, of course, we wouldn’t need them. But that’s not where we live.

Categories: data science, finance, modeling

College ranking models

August 26, 2013 Cathy O'Neil, mathbabe 19 comments

Last week Obama began to making threats regarding a new college ranking system and its connection to federal funding. Here’s an excerpt of what he was talking about, from this WSJ article:

The president called for rating colleges before the 2015 school year on measures such as affordability and graduation rates—”metrics like how much debt does the average student leave with, how easy is it to pay off, how many students graduate on time, how well do those graduates do in the workforce,” Mr. Obama told a crowd at the University at Buffalo, the first stop on a two-day bus tour.

Interesting! This means that Obama is wading directly into the field of modeling. He’s probably sick of the standard college ranking system, put out by US News & World Reports. I kind of don’t blame him, since that model is flawed and largely gamed. In fact, I made a case for open sourcing that model recently just so that people would look into it and lose faith in its magical properties.

So I’m with Obama, that model sucks, and it’s high time there are other competing models so that people have more than one thing to think about.

On the other hand, what Obama is focusing on seems narrow. Here’s what he supposedly wants to do with that model (again from the WSJ article):

Once a rating system is in place, Mr. Obama will ask Congress to allocate federal financial aid based on the scores by 2018. Students at top-performing colleges could receive larger federal grants and more affordable student loans. “It is time to stop subsidizing schools that are not producing good results,” he said.

His main goal seems to be “to make college more affordable”.

I’d like to make a few comments on this overall plan. The short version is that he’s suggesting something that will have strong, mostly negative effects, and that won’t solve his problem of college affordability.

Why strong negative effects?

What Obama seems to realize about the existing model is that it’s had side effects because of the way college administrators have gamed the model. Presumably, given that this new proposed model will be directly tied to federal funding, it will be high-impact and will thus be thoroughly gamed by administrators as well.

The first complaint, then, is that Obama didn’t address this inevitably gaming directly – and that doesn’t bode well about his ability to put into place a reasonable model.

But let’s not follow his lead. Let’s think about what kind of gaming will occur once such a model is in place. It’s not pretty.

Here are the attributes he’s planning to use for colleges. I’ve substituted reasonably numerical proxies for his descriptions above:

Cost (less is better)
Percentage of people able to pay off their loans within 10 years (more is better)
Graduation rate (more is better)
Percentage of people graduating within 4 years (more is better)
Percentage of people who get high-paying jobs after graduating (more is better)

Cost

Nobody is going to argue against optimizing for lower cost. Unfortunately, what with the cultural assumption of the need for a college education, combined with the ignorance and naive optimism of young people, not to mention start-ups like Upstart that allow young people to enter indentured servitude, the pressure is upwards, not downwards.

The supply of money for college is large and growing, and the answer to rising tuition costs is not to supply more money. Colleges have already responded to the existence of federal loans, for example, by raising tuition in the amount of the loan. Ironically, much of the rise in tuition cost has gone to administrators, whose job it is to game the system for more money.

Which is to say, you can penalize certain colleges for being at the front of the pack in terms of price, but if the overall cost is rising constantly, you’re not doing much.

If you really wanted to make costs low, then fund state universities and make them really good, and make them basically free. That would actually make private colleges try to compete on cost.

Paying off loans quickly

Here’s where we get to the heart of the problem with Obama’s plan.

What are you going to do, as an administrator tasked with making sure you never lose federal funding under the new regime?

Are you going to give all the students fairer terms on their debt? Or are you going to select for students that are more likely to get finance jobs? I’m guessing the latter.

So much for liberal arts educations. So much for learning about art, philosophy, or for that matter anything that isn’t an easy entrance into the tech or finance sector. Only colleges that don’t care a whit about federal money will even have an art history department.

Graduation rate

Gaming the graduation rate is easy. Just lower your standards for degrees, duh.

How quickly people graduate

Again, a general lowering of standards is quick and easy.

How well graduates do in the workforce

Putting this into your model is toxic, and measures a given field directly in terms of market forces. Economics, Computer Science, and Business majors will be the kings of the hill. We might as well never produce writers, thinkers, or anything else creative again.

Note this pressure already exists today: many of our college presidents are becoming more and more corporate minded and less interested in education itself, mostly as a means to feed their endowments. As an example, I don’t need to look further than across my street to Barnard, where president Debora Spar somehow decided to celebrate Ina Drew as an example of success in front of a bunch of young Barnard students. I can’t help but think that was related to a hoped-for gift.

Obama needs to think this one through. Do we really want to build the college system in this country in the image of Wall Street and Silicon Valley? Do we want to intentionally skew the balance towards those industries even further?

Building a better college ranking model

The problem is that it’s actually really hard to model quality of education. The mathematical models that already exist and are being proposed are just pathetically bad at it, partly because college, ultimately, isn’t only about the facts you learn, or the job you get, or how quickly you get it. It’s actually a life experience which, in the best of cases, enlarges your world view, and gets you to strive for something you might not have known existed before going.

I’d suggest that, instead of building a new ranking system, we on the one hand identify truly fraudulent colleges (which really do exist) and on the other, invest heavily in state schools, giving them enough security so they can do without their army of expensive administrators.

Categories: modeling, news, rant

Staples.com rips off poor people; let’s take control of our online personas

August 22, 2013 Cathy O'Neil, mathbabe 29 comments

You’ve probably heard rumors about this here and there, but the Wall Street Journal convincingly reported yesterday that websites charge certain people more for the exact thing.

Specifically, poor people were more likely to pay more for, say, a stapler from Staples.com than richer people. Home Depot and Lowes does the same for their online customers, and Discover and Capitol One make different credit card offers to people depending on where they live (“hey, do you live in a PayDay lender neighborhood? We got the card for you!”).

They got pretty quantitative for Staples.com, and did tests to determine the cost. From the article:

It is possible that Staples’ online-pricing formula uses other factors that the Journal didn’t identify. The Journal tested to see whether price was tied to different characteristics including population, local income, proximity to a Staples store, race and other demographic factors. Statistically speaking, by far the strongest correlation involved the distance to a rival’s store from the center of a ZIP Code. That single factor appeared to explain upward of 90% of the pricing pattern.

If anyone’s ever seen a census map, race is highly segregated by ZIP code, and my guess is we’d see pretty high correlations along racial lines as well, although they didn’t mention it in the article except to say that explicit race-related pricing is illegal. The article does mentions that things get more expensive in rural areas, which are also poorer, so there’s that acknowledged correlation.

But wait, how much of a price difference are we talking about? From the article:

Prices varied for about a third of the more than 1,000 randomly selected Staples.com products tested. The discounted and higher prices differed by about 8% on average.

In other words, a really non-trivial amount.

The messed up thing about this, or at least one of them, is that we could actually have way more control over our online personas than we think. It’s invisible to us, typically, so we don’t think about our cookies and our displayed IP addresses. But we could totally manipulate these signatures to our advantage if we set our minds to it.

Hackers, get thyselves to work making this technology easily available.

For that matter, given the 8% difference, there’s money on the line so some straight-up capitalist somewhere should be meeting that need. I for one would be willing to give someone a sliver of the amount saved every time they manipulated my online persona to save me money. You save me $1.00, I’ll give you a dime.

Here’s my favorite part of this plan: it would be easy for Staples to keep track of how much people are manipulating their ZIP codes. So if Staples.com infers a certain ZIP code for me to display a certain price, but then in check-out I ask them to send the package to a different ZIP code, Staples will know after-the-fact that I fooled them. But whatever, last time I looked it didn’t cost more or less to send mail to California or wherever than to Manhattan [Update: they do charge differently for packages, though. That’s the only differential in cost I think is reasonable to pay].

I’d love to see them make a case for how this isn’t fair to them.

Categories: data science, modeling, rant

When big data goes bad in a totally predictable way

August 19, 2013 Cathy O'Neil, mathbabe 10 comments

Three quick examples this morning in the I-told-you-so category. I’d love to hear Kenneth Neil Cukier explain how “objective” data science is when confronted with this stuff.

1. When an unemployed black woman pretends to be white her job offers skyrocket (Urban Intellectuals, h/t Mike Loukides). Excerpt from the article: “Two years ago, I noticed that Monster.com had added a “diversity questionnaire” to the site. This gives an applicant the opportunity to identify their sex and race to potential employers. Monster.com guarantees that this “option” will not jeopardize your chances of gaining employment. You must answer this questionnaire in order to apply to a posted position—it cannot be skipped. At times, I would mark off that I was a Black female, but then I thought, this might be hurting my chances of getting employed, so I started selecting the “decline to identify” option instead. That still had no effect on my getting a job. So I decided to try an experiment: I created a fake job applicant and called her Bianca White.”

2. How big data could identify the next felon – or blame the wrong guy (Bloomberg). From the article: “The use of physical characteristics such as hair, eye and skin color to predict future crimes would raise ‘giant red privacy flags’ since they are a proxy for race and could reinforce discriminatory practices in hiring, lending or law enforcement, said Chi Chi Wu, staff attorney at the National Consumer Law Center.”

3. How algorithms magnify misbehavior (the Guardian, h/t Suresh Naidu). From the article: “For one British university, what began as a time-saving exercise ended in disgrace when a computer model set up to streamline its admissions process exposed – and then exacerbated – gender and racial discrimination.”

This is just the beginning, unfortunately.

Categories: data science, modeling

What’s the difference between big data and business analytics?

August 16, 2013 Cathy O'Neil, mathbabe 24 comments

I offend people daily. People tell me they do “big data” and that they’ve been doing big data for years. Their argument is that they’re doing business analytics on a larger and larger scale, so surely by now it must be “big data”.

No.

There’s an essential difference between true big data techniques, as actually performed at surprisingly few firms but exemplified by Google, and the human-intervention data-driven techniques referred to as business analytics.

No matter how big the data you use is, at the end of the day, if you’re doing business analytics, you have a person looking at spreadsheets or charts or numbers, making a decision after possibly a discussion with 150 other people, and then tweaking something about the way the business is run.

If you’re really doing big data, then those 150 people probably get ~~fired~~ laid off, or even more likely are never hired in the first place, and the computer is programmed to update itself via an optimization method.

That’s not to say it doesn’t also spit out monitoring charts and numbers, and it’s not to say no person takes a look every now and then to make sure the machine is humming along, but there’s no point at which the algorithm waits for human intervention.

In other words, in a true big data setup, the human has stepped outside the machine and lets the machine do its thing. That means, of course, that it takes way more to set up that machine in the first place, and probably people make huge mistakes all the time in doing this, but sometimes they don’t. Google search got pretty good at this early on.

So with a business analytics set up we might keep track of the number of site visitors and a few sales metrics so we can later try to (and fail to) figure out whether a specific email marketing campaign had the intended effect.

But in a big data set-up it’s typically much more microscopic and detail oriented, collecting everything it can, maybe 1,000 attributed of a single customer, and figuring out what that guy is likely to do next time, how much they’ll spend, and the magic question, whether there will even be a next time.

So the first thing I offend people about is that they’re not really part of the “big data revolution”. And the second thing is that, usually, their job is potentially up for grabs by an algorithm.

Categories: data science, modeling

Larry Summers and the Lending Club

August 12, 2013 Cathy O'Neil, mathbabe 19 comments

So here’s something potential Fed Chair Larry Summers is involved with, a company called Lending Club, which creates a money lending system that cuts out the middle man banks.

Specifically, people looking for money come to the site and tell their stories, and try to get loans. The investors invest in whichever loans look good to them, for however much money they want. For a perspective on the risks and rewards of this kind of peer-to-peer lending operation, look at this Wall Street Journal article which explains things strictly from the investor’s point of view.

A few red flags go up for me as I learn more about Lending Club.

First, from this NYTimes article, “The company [Lending Club] itself is not regulated as a bank. But it has teamed up with a bank in Utah, one of the states that allows banks to charge high interest rates, and that bank is overseen by state regulators and the Federal Deposit Insurance Corporation.”

I’m not sure how the FDIC is involved exactly, but the Utah connection is good for something, namely allowing high interest rates. According to the same article, 37% of loans are for APR’s of between 19% and 29%.

Next, Summers is referred to in that article as being super concerned about the ability for the consumers to pay back the loans. But I wonder how someone is supposed to be both desperate enough to go for a 25% APR loan and also able to pay back the money. This sounds like loan sharking to me.

Probably what bothers me most though is that Lending Club, in addition to offering credit scores and income when they have that information, also scores people asking for loans with a proprietary model which is, as you guessed it, unregulated. Specifically, if it’s anything like ZestFinance, could use signals more correlated to being uneducated and/or poor than to the willingness or ability to pay back loans.

By the way, I’m not saying this concept is bad for everyone- there are probably winners on the side of the loanees, and it might be possible that they get a loan they otherwise couldn’t get or they get better terms than otherwise or a more bespoke contract than otherwise. I’m more worried about the idea of this becoming the new normal of how money changes hands and how that would affect people already squeezed out of the system.

I’d love your thoughts.

Categories: data science, finance, modeling

Newer Entries Older Entries

mathbabe

Archive

MAA Distinguished Lecture Series: Start Your Own Netflix

Are PayDay lenders better than banks? #OWS

Cumulative covariance plots

Data Skeptic post

Guest post: Rage against the algorithms

A.F.R. Transparency Panel coming up on Friday in D.C.

Opening Wall Street’s Black Box: Pathways to Improved Financial Transparency

I think I understand the revolving door problem

New Essay, On Being a Data Skeptic, now out

“Here and Now” is shilling for the College Board

A Code of Conduct for data scientists from the Bellagio Fellows

Interactive scoring models: why hasn’t this happened yet?

Working in the NYC Mayor’s Office

Experimentation in education – still a long way to go

Short your kids, go long your neighbor: betting on people is coming soon

Summers’ Lending Club makes money by bypassing the Equal Credit Opportunity Act

College ranking models

Staples.com rips off poor people; let’s take control of our online personas

When big data goes bad in a totally predictable way

What’s the difference between big data and business analytics?

Larry Summers and the Lending Club

Top Posts & Pages

Follow Blog via Email

Recent Posts

Meta