Modeling in Plain English

Home > data science, modeling, rant > Modeling in Plain English

Modeling in Plain English

March 17, 2013 Cathy O'Neil, mathbabe

I’ve been enjoying my new job at Johnson Research Labs, where I spend a majority of the time editing my book with my co-author Rachel Schutt. It’s called Doing Data Science (now available for pre-purchase at Amazon), and it’s based on these notes I took last semester at Rachel’s Columbia class.

Recently I’ve been working on Brian Dalessandro‘s chapter on logistic regression. Before getting into the brass tacks of that algorithm, which is especially useful when you are trying to predict a binary outcome (i.e. a 0 or 1 outcome like “will click on this ad”), Brian discusses some common constraints to models.

The one that’s particularly interesting to me is what he calls “interpretability”. His example of an interpretability constraint is really good: it turns out that credit card companies have to be able to explain to people why they’ve been rejected. Brain and I tracked down the rule to this FTC website, which explains the rights of consumers who own credit cards. Here’s an excerpt where I’ve emphasized the key sentences:

You Also Have The Right To…

Have credit in your birth name (Mary Smith), your first and your spouse’s last name (Mary Jones), or your first name and a combined last name (Mary Smith Jones).

Get credit without a cosigner, if you meet the creditor’s standards.

Have a cosigner other than your spouse, if one is necessary.

Keep your own accounts after you change your name, marital status, reach a certain age, or retire, unless the creditor has evidence that you’re not willing or able to pay.

Know whether your application was accepted or rejected within 30 days of filing a complete application.

Know why your application was rejected. The creditor must tell you the specific reason for the rejection or that you are entitled to learn the reason if you ask within 60 days. An acceptable reason might be: “your income was too low” or “you haven’t been employed long enough.” An unacceptable reason might be “you didn’t meet our minimum standards.” That information isn’t specific enough.

Learn the specific reason you were offered less favorable terms than you applied for, but only if you reject these terms. For example, if the lender offers you a smaller loan or a higher interest rate, and you don’t accept the offer, you have the right to know why those terms were offered.

Find out why your account was closed or why the terms of the account were made less favorable, unless the account was inactive or you failed to make payments as agreed.

The result of this rule is that credit card companies must use simple models, probably decision trees, to make their rejection decisions.

It’s a new way to think about modeling choice, to be sure. It doesn’t necessarily make for “better” decisions from the point of view of the credit card company: random forests, a generalization of decision trees, are known to be more accurate, but are arbitrarily more complicated to explain.

So it matters what you’re optimizing for, and in this case the regulators have decided we’re optimizing for interpretability rather than accuracy. I think this is appropriate, given that consumers are at the mercy of these decisions and relatively powerless to act against them (although the FTC site above gives plenty of advice to people who have been rejected, mostly about how to raise their credit scores).

Three points to make about this. First, I’m reading the Bankers New Clothes, written by Anat Admati and Martin Hellwig (h/t Josh Snodgrass), which is absolutely excellent – I’m planning to write up a review soon. One thing they explain very clearly is the cost of regulation (specifically, higher capital requirements) from the bank’s perspective versus from the taxpayer’s perspective, and how it genuinely seems “expensive” to a bank but is actually cost-saving to the general public. I think the same thing could be said above for the credit card interpretability rule.

Second, it makes me wonder what else one could regulate in terms of plain english modeling. For example, what would happen if we added that requirement to, say, the teacher value-added model? Would we get much-needed feedback to teachers like, “You don’t have enough student participation”? Oh wait, no. The model only looks at student test scores, so would only be able to give the following kind of feedback: “You didn’t raise scores enough. Teach to the test more.”

In other words, what I like about the “Modeling in Plain English” idea is that you have to be able to first express and second back up your reasons for making decisions. It may not lead to ideal accuracy on the part of the modeler but it will lead to much greater clarity on the part of the modeled. And we could do with a bit more clarity.

Finally, what about online loans? Do they have any such interpretability rule? I doubt it. In fact, if I’m not wrong, they can use any information they can scrounge up about someone to decide on who gets a loan, and they don’t have to reveal their decision-making process to anyone. That seems unreasonable to me.

Categories: data science, modeling, rant

Comments (8)

mathematrucker

March 17, 2013 at 10:05 am

That’s interesting FTC info. One of my first thoughts was, given a large enough sample of rejections, perhaps much of what’s in the model(s) could be “reverse-engineered”.

LikeLike
Leon Kautsky

March 17, 2013 at 10:18 am

I don’t see how banning SVM and arbitrage pricing helps people in the grand scheme of things. There’s also a weird asymmetry here. If you do get a loan, you don’t care why and in fact, how the bank expects you to behave will be unintelligible to you.

Even worse, I can see a lot of quant depts. at banks using the more complicated/accurate models to make choices and then “rejecting” you with a dumbed down linear model. At least, they would if they were clever.

LikeLike
- Cathy O'Neil, mathbabe
  
  March 17, 2013 at 10:47 am
  
  Not for every model. But for consumer facing high impact models I think it makes sense.
  
  LikeLike
justaluckyfool

March 17, 2013 at 11:57 am

Modeling in plain English, could that be “transparency”?
As for your example re ‘banking capital requirements, yes the PFPBanks absolutely have higher cost with higher capital requirements because they have to be ‘transparent’ show they are using their own money and no longer secretly using money that they have in trust (that of their depositors) . And yes, the people would save because they will not have to replace ‘the appropriated money’ if the Private For Profit Bank loses the present 90% they do not even own.
Thanks to social media, an AHA moment : Read:
http://aquinums-razor.blogspot.com/2011/11/here-is-how-bankers-game-works.html

LikeLike
alagator2k13

March 17, 2013 at 12:44 pm

Cathy, have you and the publisher discussed publishing as an eBook or Kindle? I’m reading so many books at once that without iPhone/Kindle it would never have a large enough book bag…Also, interesting article in a ‘similar’ vein: http://andrewgelman.com/2013/03/14/everyones-trading-bias-for-variance-at-some-point-its-just-done-at-different-places-in-the-analyses/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+StatisticalModelingCausalInferenceAndSocialScience+%28Statistical+Modeling%2C+Causal+Inference%2C+and+Social+Science%29

LikeLike
adamobeng

March 17, 2013 at 7:33 pm

I’m not sure that being expressed in ‘plain’ English is always beneficial to interpretability (tongue-in-cheek e.g. http://xkcd.com/1133/). Won’t the explanation given by the decision tree be some complex conjunction — which if written out would be unreadable? Wouldn’t it actually convey more information to say “we used a model which is trained on characteristics X, Y, and Z from this particular population”, without detailing exactly how the model works?

LikeLike
Steven H. Noble

March 19, 2013 at 4:11 am

A few years ago I worked in credit risk modeling (although in Canada) and you have it basically correct. It’s not a secret that credit risk scores are built off of decision trees with logistic regressions on the leaves. Decisions are then made by way of decision trees on scores (and sometimes other basic features).

Reasons for rejection are usually generated by assuming the leaf is invariant and selecting the fewest amount of features that would have to be different for the logistic regression to raise a score enough to give an accept decision.

Personally I believe this rule is actually helpful for the banks to follow and I hope most of them would anyhow. First I think that a modeller who can see how their features behave and interact in a model is much more likely to construct better features (automatic feature generation techniques are still pretty bad). Secondly, customers hearing what negative features led to rejections often leads to discovering bugs in data capture. If you reject based on “insufficient credit history” and the customer says “what are you talking about? I have lots of credit history” you may discover that you have a credit bureau partner that is bad at finding the right customer record.

Remember, it’s pretty hard to find out in credit risk if your model is giving you a false positive of write off. If you’re model tells you a customer is highly likely to write off you’ll end up rejecting him and you never find out if the model was right (it’s very expensive to be constantly testing this). So if you are rejecting based off of faulty data you want your customers to let you know.

LikeLike
- Cathy O'Neil, mathbabe
  
  March 19, 2013 at 7:07 am
  
  Great point about false positives, thanks!
  
  LikeLike