Home > data science, open source tools, rant > Let them game the model

Let them game the model

February 3, 2012

One of the most common reasons I hear for not letting a model be more transparent is that, if they did that, then people would game the model. I’d like to argue that that’s exactly what they should do, and it’s not a valid argument against transparency.

Take as an example the Value-added model for teachers. I don’t think there’s any excuse for this model to be opaque: it is widely used (all of New York City public middle and high schools for example), the scores are important to teachers, especially when they are up for tenure, and the community responds to the corresponding scores for the schools by taking their kids out or putting their kids into those schools. There’s lots at stake.

Why would you not want this to be transparent? Don’t we usually like to know how to evaluate our performance on the job? I’d like to know it if being 4 minutes late to work was a big deal, or if I need to stay late on Tuesdays in order to be perceived as working hard. In other words, given that it’s high stakes it’s only fair to let people know how they are being measured and, thus, how to “improve” with respect to that measurement.

Instead of calling it “gaming the model”, we should see it as improving our scores, which, if it’s a good model, should mean being better teachers (or whatever you’re testing). If you tell me that when someone games the model, they aren’t actually becoming a better teacher, then I’d say that means your model needs to improve, not the teacher. Moreover, if that’s true, then without transparency or with transparency, in either case, you’re admitting that the model doesn’t measure the right thing. At least when it’s transparent the problems are more obvious and the modelers have more motivation to make the model measure the right thing.

Another example: credit scoring. Why are these models closed? They affect everyone all the time. How is Visa or Mastercard winning if they don’t tell us what we need to do to earn a good credit card interest rate? What’s the worst thing that could happen, that we are told explicitly that we need to pay our bills on time? I don’t see it. Unless the models are using something devious, like people’s race or gender, in which case I’d understand why they’d want to hide that model. I suspect they aren’t, because that would be too obvious, but I also suspect they might be using other kinds of inputs (like zip codes) that are correlated to race and/ or gender. That’s the kind of thing that argues for transparency, not against it. When a model is as important as credit scores are, I don’t see an argument for opacity.

  1. AZ
    February 3, 2012 at 10:48 am

    But isn’t it too high a standard to ask that the model be so robust that it gives a good measure even when people are trying to “game the system”? I suspect that very few models would pass this test. The point of modeling is to approximately capture a complicated general phenomenon in a relatively simple way. This is sometimes possible because of a law of large numbers, but when pathological behavior can happen frequently, it could become much harder.

    In other words, I think it’s quite plausible that certain models (e.g. teacher value-added) fare worse when made transparent. This might be outweighed by the fact that transparency allows people to more accurately judge the usefulness of a model and thus place an appropriate amount of weight on what it says. For things like teaching effectiveness, any objective model would probably be improved by adding a subjective human assessment component. Unfortunately, if you need an objective standard, you don’t have the luxury of weighing the model against something else.* In that case it would seem best to optimize the model itself, which might involve hiding it. Obviously there are downsides to hiding the model as well, but perhaps the optimum lies in some middle ground.

    * There is an argument, of course, that the value-added model should not be used as an objective standard at all.

    Like

    • February 3, 2012 at 10:51 am

      Can you be more specific about what models would be better opaque? I thought about this and I really can’t think of an example, which is one reason I made this post. Or even why exactly the teacher value-added model would fare worse when made transparent. In other words, I can’t think of a reason that it would be better not to “know” the shortcomings of a model. If there are shortcomings, and if at the same time the model is being used to deny tenure to people, then we should all know them, right?

      Like

      • Dan L
        February 3, 2012 at 11:51 am

        I definitely agree with you that teacher value-added models should be transparent, for a variety of reasons. However, I think you are overlooking the benefits of opacity.

        A really simple example of where opacity helps is with a calculus exam. If your goal is to measure how much calculus students have learned, you do not want to tell them exactly what problems (or even what types of problems) are on the exam, because then the students need only master what they know they will be tested on.

        This might sound like a bad analogy, but it’s not. The calculus exam can only pick up a *sample* of what the students know. Similarly, any teacher evaluation system can only pick up certain pieces of evidence of good teaching. This is precisely why basing everything on standardized test performance is a horrible idea. Even though test performance does provide *a* measure of teacher competence (maybe even a very good one), the moment you evaluate teachers based solely on test scores, everything else that goes into being a good teacher will be neglected. It’s the equivalent of telling your calculus students, “This will not be on the exam.”

        And then the problem is compounded because, in order for the standardized tests to be fair, they have to predictable. And their predictability leads to a trickle-down of the exact same problem: Teachers need only teach students how to solve the problems that they know will be on the exams. Again, this is another situation where “transparency” is actually bad (though probably necessary). A well-written standardized tests can be an excellent measurement of what kids really know, IF they are not prepped for them and have no idea what to expect. (To go off on a tangent, this is what bothers me about longitudinal research on standardized test scores. Back when I was a kid, there never seemed to be anything at stake for these tests, so the test scores were measuring something very different from what they measure now.)

        Like

        • February 3, 2012 at 12:04 pm

          Dan,

          Thanks for your thoughtful comment. I agree that telling kids what’s on an exam is stupid. But I’m not sure it’s the right analogy. I think you’d agree that the students should know which sections of the book will be covered on the test. Right now the teachers actually don’t know that, or at least don’t know the relative weights of different aspects of their performance. And the metrics by which they are measured aren’t like calculus problems whose answer can be memorized, either: it’s more like how students do on tests, which they only have indirect influence on.

          On the other hand, I kind of agree with you at the same time. I’d like to make another analogy: financial regulation. Financial regulations are typically very transparent, and banks do their very best to game them, and they do. Does this mean we should make financial regulation opaque? In some sense, yes. I’d like to see regulation made transparently random so that they are harder to game. An example: make them assess the risk of benchmark portfolios that are regularly and randomly changed. The overall structure is transparent but the result is hard to game. I think we can do this and I think it will work, and moreover it’s fair.

          Cathy

          Like

        • Dan L
          February 3, 2012 at 4:17 pm

          Well, note that I started my comment by agreeing that teacher evaluation should be transparent. (However, I’ve yet to see an evaluation system that seems right to me.) I just thought it was a bit extreme when you said that you couldn’t think of *any* situations in which opacity is good.

          As for financial regulation, I don’t know anything about it, but my naive idea about that is that it should be kind of like the calculus test: The bank should be able to pass pretty much any reasonable “risk assessment,” since if they were responsible, they would be doing a variety of internal risk assessments anyway. This sounds kind of like what you are suggesting.

          Like

  2. February 3, 2012 at 10:59 am

    One aspect of the “gaming the model” argument I’m sympathetic to in this context is that for reasons of expediency you’re probably going to be using a small number of easy to quantify indicators to approximate what you really want to measure. Thus if the model is known, some people will do only the measured things and not other (equally if not more) valuable things. If people are unsure about what is measured, then hopefully they will just try to do their job well overall, which is what you really want in the end. That said, I still think that transparency is usually the way to go, and I think keeping the models hidden is mostly to avoid criticism of them.

    Like

    • February 3, 2012 at 11:06 am

      I agree that that would be the lazy (and common) way to make a model. But if there are more things that actually have signal, it won’t hurt to add them to the model, and we have enough computer power to do so, so it’s not really a great argument; in particular the work required to improve the model is worth the benefits of transparency.

      Like

      • February 3, 2012 at 5:50 pm

        If you already have the raw data, then, sure, you should just add them to the model. But gathering that data in the first place could be quite costly. As a concrete example, consider the academic journal rankings that are based on average citations per article in each journal. Certainly this seems like a reasonable approach, at least within a single discipline. Once they started being commonly used, however, some clever editors realized they could push their journals to the top by adding (a quite small number of) extra references to each article citing papers in the same journal. This made these journal rankings less useful, though you can get around this particular attack by filtering out same-journal citations (which in this case is cheap, since it’s already in your data set).

        Or to use your example of evaluating job performance, one might start with very little quantitative data at all. If one wanted to measure employee effort, say, then it would be relatively easy to track tardiness and it’s reasonable that this correlates with what you’re trying to measure. But if you told people that’s how you’re measuring effort, then the correlation would weaken and the next easiest to measure indicator of effort might be a lot more work to track.

        Like

        • February 3, 2012 at 6:11 pm

          Those are good examples. Certainly it would be on you to update your model, so if people started gaming it as you described, we would see certain signals get downplayed (like the number of references). However, I’d say if you noticed people coming to work on time (assuming that matters a lot to you) to “game the model”, then it should be exactly what you want. If you notice those people also go to lunch early and stay away for 4 hours, then you’d be realizing you need another signal, namely their lunch break length. Overall you’d be getting a more complete picture of what you mean by good job performance as people game the model and as you improve it. It’s all good.

          Like

        • February 3, 2012 at 9:38 pm

          Perhaps I’m just not very imaginative, but in the case of ranking journals, it’s not clear to me what other signals besides those based on citations are feasible to measure. You could survey people for their opinions (e.g. as one does for graduate program rankings) but that would be a hugely expensive undertaking if you wanted to cover all academic disciplines. Another approach might be to look at the institutional affiliations of the authors of the papers and tie that in to department/university ranks, though that would create very perverse incentives for the journals (sorry, we can’t accept this paper since you’re at Smuckville State, it would drag down our average too much). I guess you could look at total copies sold of each journal (to libraries only, to avoid them selling cheap copies to individuals to inflate the numbers). Another approach would be look at paper acceptance rates (though with self-selection on the part of submitters it’s not clear this will track what you what). I guess also the last two ideas require the publishers to cough up information that they might well not be willing to part with, and naively neither of them seem as robust as using citations.

          Like

  3. Anonymous Reader
    February 3, 2012 at 11:22 am

    “I don’t think there’s any excuse for this model to be opaque.”

    On VAMs for teachers (which I’ve worked with): I’m not so sure, and I think the argument you are responding to is much more subtle than you’re giving it credit for. It’s true that there is a lot at stake for students, principals, and teachers, but by itself, high stakes do not, in general, justify transparency.

    Suppose, for example, a school is using a VAM to raise the salaries of exceptional instructors, while deselecting the less effective. Over time, such a policy might increase the average quality of the school’s teaching workforce. To do this, the school relies on the fact that a teacher’s value-added is strongly predictive of the human capital that teacher builds in the student. See, for example, the recent work of Chetty et. al. on the impact of a “good” or “bad” teacher (as measured by value-added) on future earnings. If a teacher is not informed about the details of the value-added model, he may not change his behavior at all, since behavior change is costly and the benefits to the teacher of any specific change are uncertain. In contrast, if he is well-informed, he might know exactly how to teach to the test to improve scores in the short run in a way that undermines long-term outcomes. The quality of the model can deteriorate once the model is used to mete out rewards and punishments (as teachers behave strategically)–and keeping teachers in the dark about the model could mitigate this. This view is NOT the same as “admitting your model doesn’t measure the right thing.”

    (I’m not sure that this outweighs the other issues you’ve raised–for example, that an opaque model might diminish teacher morale or, in the long run, reduce model quality by lowering the amount of information available for people to argue about it. But I still think it’s important to flesh out the other side.)

    Like

    • February 3, 2012 at 11:33 am

      I see where you’re coming from, but I’d argue that, by making the stakes high, people *are* modifying their behavior – why wouldn’t they?

      By the way, I’ve seen the argument made that the Chetty et al paper was relying on test scores way back when the testing *wasn’t* high stakes, and must therefore be understood as potentially irrelevant in the current climate.

      I guess the meta argument I’d like to make is that the test itself is distorting behavior (for example we see lots of cheating), and we should acknowledge this as reality, and the question of whether the tests are still overall a good idea is a serious one, and needs to take this distortion into account. If good teachers leave because poor and opaque test scores make their lives feel arbitrary, maybe we should stop using them, at least at an individual level.

      Like

  4. Anonymous Reader
    February 3, 2012 at 11:57 am

    Your meta argument is a great one. How do we set up incentive schemes to maximize student learning, taking into account both the beneficial and harmful effects of strategic responses to our chosen rewards and penalties? How do we draw capable individuals into the teaching? But caring about that question doesn’t directly imply that we need to suddenly make everything transparent, which is why I think it’s unjustified to to say there’s absolutely no excuse for these procedures to be secret.

    Maybe what we really need is better quality tests and better protections against cheating, where there are potentially HUGE gains. Here there really is no excuse! Until recently, for example, free-response questions on a student’s New York Regents exams could be graded by the teacher that taught that student! (See http://www.nytimes.com/2011/10/18/nyregion/regents-set-to-alter-rules-for-grading-state-exams.html). *Shocker*: there are tons of students every year mysteriously clumped right above the pass/fail cutoff.

    Like

  5. JSE
    February 3, 2012 at 12:23 pm

    It might be that the model is better at measuring the right thing if it’s closed, and worse at measuring the right thing if it’s open. At least, that’s what people who use the word “gaming” think.

    Like

  6. Dan L
    February 3, 2012 at 4:23 pm

    Side note: I agree with you 100% about credit scores. I don’t even understand what it means to “game” your credit score. Wouldn’t that mean behaving in exactly the way that creditors want you to behave? How could that possibly be a bad thing?

    Like

  7. February 3, 2012 at 8:05 pm

    I don’t get the criticism of openness. Of course model openness would lead to reduced transaction costs. If people start to exploit the model and it has ill effects, just change the model. This may cause some volatility, but we already live in a high volatility market because of opacity. Do you think Bank of America’s share price keeps changing by 50% because everyone understands their balance sheet?

    Like

  8. February 4, 2012 at 7:32 am

    VAM hasn’t taken hold as much as some would like for one reason: 75% of the teachers cannot be measured using any system that can be “gamed”. None of the following teachers can be measured using standardized tests that are the basis for VAM: elementary Pre-K through grade 3; art; music; PE; Special Ed; and any MS and HS teachers who teach subjects outside of English and Math.

    The metrics for teachers and schools are either murky (like NYC’s school rating system) or misleadingly exact (like any standardized test).

    Like

  9. John
    February 4, 2012 at 9:25 am
    • Parmenion
      February 4, 2012 at 10:44 am

      I don’t think this is a strong argument against openness. It is a strong argument against using weak proxies and not updating your model in the face of it not working.

      Like

  10. JEHR
    February 4, 2012 at 11:27 am

    If I weren’t married with a husband who has a great credit score, I would be at a great disadvantage. I never believed in being in debt and have always saved up money in order to buy things I needed or wanted. When applying for a credit card in order to buy an airline ticket (as you can no longer buy with cash at the desk), I did not have any credit score, good or bad, because I have never made payments on a debt. In order to get a good credit score, I would have to go into debt which is against my beliefs. What a conundrum!

    Like

  11. lawrence castiglione
    February 4, 2012 at 11:35 am

    Measurements are estimates, imperfect and rarely pure. Models combine entailed error and serve multiple purposes, sometimes antithetical to one another. Such tensions are to be expected.

    Like

  12. bob
    February 4, 2012 at 1:55 pm

    Credit scoring is closed because it’s proprietary. Open the model, others can use it, the fees charged for access to the model disappear, stockholders are unhappy.

    Like

  13. February 4, 2012 at 3:40 pm

    Excellent post. I have a draft on credit scoring (and the even more eerie credit analytics) you may be interested in. I’ve pasted a bit below. Your points are relevant to personality scores, TSA risk scoring, and many other contexts.

    From Credit Scoring to Credit Analytics

    Access to credit evaluation software will become more important as data analysis becomes more complex. Credit card issuers are now moving away from all-purpose scores to more multi-dimensional assessments of customers. Higher math can model variables like “default risk” for cross-cutting variables, promising data-driven decisionmaking that is theoretically more accurate—but also more inscrutable—than a mere three-digit score.

    Credit card purchase records are a data geek’s dream. “Quants” (short for “quantitative analysts”) promiscuously correlate various forms of behavior to uncover hidden relationships. Billing data may identify buying patterns that associate with profitable customers. One company determined that buyers of cheap automotive oil were worse risks than those who paid for a brand-name oil. Drinking beer at a sketchy bar, installing showy “chrome thrusters” on your car, or subscribing to Soldier of Fortune magazine—all might lead to higher interest rates or lower credit limits. One researcher bragged that his firm considers over 300 characteristics to pinpoint delinquency risks.

    Credit analysts aim to overcome some problems of credit scoring. Think back to the example of the “responsible consumer,” who was chagrined to find that algorithms could mechanistically reduce his score once he reduced his credit limit. Why should he be lumped in with the majority of individuals whose higher debt-to-limit ratios indicate more problems with repaying their debt? Credit analysts agree, arguing that inaccurate or unfair decisions merely reflect incomplete implementation of the scoring concept. More immediate access to larger stores of data about consumers might allow them to identify our hypothetical consumer’s behavior as an indicator of responsibility, rather than desperation. A critical mass of additional variables in a profile (say, “donates over $200 to non-radical political parties,” “pays Parent Teacher Association dues,” “always buys his wife flowers on their anniversary”) could flip the valence of the reduced credit limit entirely. With a more complete profile, the argument goes, credit analysts could model the reduced credit limit as a positive influence on creditworthiness.

    Credit analytics exposes a faultline in contemporary concerns about privacy and reputation, a contrast I deem “anonymity vs. accuratism.” Traditionally, privacy has been interpreted as a right to conceal, to keep others from knowing details about one’s life, opinions, or preferences. But laws like FCRA defend a different value: that of being fairly and accurately assessed. Taken to a logical extreme, an “accuratist” approach to reputation would insist that any threats occasioned by loss of privacy can be defused once a decisionmaker has a fuller picture of the object of surveillance. To paraphrase the old French proverb, to explain all is to forgive all. Or, to put it more darkly: why care about privacy if you’ve got nothing to hide?

    The accuratist mindset has a certain pragmatic appeal, especially when framed as a source of bonuses rather than penalties. Purchases of carbon monoxide detectors or floor protectors might correlate with meticulous attention to personal finances. Perhaps those who care enough to maintain their linoleum should enjoy a boost at the bank.

    However, credit analytics also has a more sinister side. Card issuers can cut the credit lines of customers who attend couples therapy, since a divorce might make each spouse less financially capable to pay off debts. As a statistical matter, it may well be the case that those entering therapy are more likely to default than those who have avoided it. However, credit card companies’ data cornucopia raises troubling possibilities. It’s one thing for credit analytics to reduce the credit limit for reckless drivers. It’s quite another to compound the misery of a troubled couple by imposing financial burdens that exacerbate their rancor.

    Like

    • breads
      February 6, 2012 at 2:06 pm

      “buyers of cheap automotive oil were worse risks than those who paid for a brand-name oil” . Really?

      He -he. I would say that most name-brand oils are not worth the money. So someone who is very responsible with money will not buy name-brand. That goes not just for auto oil, but across the board. Store brands and generics are good for most of life’s needs.

      What next? Shopping at thrift stores? I do, even with 7-figure nw and 6-figure income. Come to think of it, that is why I don’t need credit.

      Like

  14. Tara
    February 5, 2012 at 11:54 pm

    Those who develop credit scorecards aren’t doing it so that your credit will improve. They’re doing it to make money, and specifically, more money than their competitors. It’s not that they are tracking your terrible secrets in secrecy, it’s that the companies building them want to stay in business.

    Also, while it’s nice to know what you’re being graded on, having specific points leans to a situation like that of teachers teaching to pass the standardised testing, rather than real education. Or a person who follows their job description to the letter and nothing else. There must be something in-between, but there are negative aspects to full transparency in scoring.

    Like

  1. February 4, 2012 at 3:16 am
  2. February 5, 2012 at 7:14 am
  3. February 5, 2012 at 6:18 pm
  4. February 8, 2012 at 11:21 pm
  5. February 21, 2012 at 7:04 am
  6. January 14, 2013 at 6:52 am
  7. August 5, 2013 at 11:20 am
Comments are closed.