Definitions are basic objects in mathematics. Even so, I’ve never seen the art of definition explicitly taught, and I have rarely seen the need for a definition explicitly discussed.
Have you ever noticed how damn hard it is to make a good definition and yet how utterly useful a good definition can be?
The basic definitions inform the research of any field, and a good definition will lead to better theorems than a bad one. If you get them right, if you really nail down the definition, then everything works out much more cleanly than otherwise.
So for example, it doesn’t make sense to work in algebraic geometry without the concepts of affine and projective space, and varieties, and schemes. They are to algebraic geometry like circles and triangles are to elementary geometry. You define your objects, then you see how they act and how they interact.
I saw first hand how a good definition improves clarity of thought back in grad school. I was lucky enough to talk to John Tate (my mathematical hero) about my thesis, and after listening to me go on for some time with a simple object but complicated proofs, he suggested that I add an extra sentence to my basic object, an assumption with a fixed structure.
This gave me a bit more explaining to do up front – but even there added intuition – and greatly simplified the statement and proofs of my theorems. It also improved my talks about my thesis. I could now go in and spend some time motivating the definition, and then state the resulting theorem very cleanly once people were convinced.
Another example from my husband’s grad seminar this semester: he’s starting out with the concept of triangulated categories coming from Verdier’s thesis. One mysterious part of the definition involves the so-called “octahedral axiom,” which mathematicians have been grappling with ever since it was invented. As far as Johan tells it, people struggle with why it’s necessary but not that it’s necessary, or at least something very much like it. What’s amazing is that Verdier managed to get it right when he was so young.
Why? Because definition building is naturally iterative, and it can take years to get it right. It’s not an obvious process. I have no doubt that many arguments were once fought over whether the most basic definitions, although I’m no historian. There’s a whole evolutionary struggle that I can imagine could take place as well – people could make the wrong definition, and the community would not be able to prove good stuff about that, so it would eventually give way to stronger, more robust definitions. Better to start out carefully.
Going back to that. I think it’s strange that the building up of definitions is not explicitly taught. I think it’s a result of the way math is taught as if it’s already known, so the mystery of how people came up with the theorems is almost hidden, never mind the original objects and questions about them. For that matter, it’s not often discussed why we care whether a given theorem is important, just whether it’s true. Somehow the “importance” conversations happen in quiet voices over wine at the seminar dinners.
Personally, I got just as much out of Tate’s help with my thesis as anything else about my thesis. The crystalline focus that he helped me achieve with the correct choice of the “basic object of study” has made me want to do that every single time I embark on a project, in data science or elsewhere.
Yesterday’s New York Times ran a piece by Gina Kolata on randomized experiments in education. Namely, they’ve started to use randomized experiments like they do in medical trials. Here’s what’s going on:
… a little-known office in the Education Department is starting to get some real data, using a method that has transformed medicine: the randomized clinical trial, in which groups of subjects are randomly assigned to get either an experimental therapy, the standard therapy, a placebo or nothing.
They have preliminary results:
The findings could be transformative, researchers say. For example, one conclusion from the new research is that the choice of instructional materials — textbooks, curriculum guides, homework, quizzes — can affect achievement as profoundly as teachers themselves; a poor choice of materials is at least as bad as a terrible teacher, and a good choice can help offset a bad teacher’s deficiencies.
So far, the office — the Institute of Education Sciences — has supported 175 randomized studies. Some have already concluded; among the findings are that one popular math textbook was demonstrably superior to three competitors, and that a highly touted computer-aided math-instruction program had no effect on how much students learned.
Other studies are under way. Cognitive psychology researchers, for instance, are assessing an experimental math curriculum in Tampa, Fla.
If you go to any of the above links, you’ll see that the metric of success is consistently defined as a standardized test score. That’s the only gauge of improvement. So any “progress” that’s made is by definition measured by such a test.
In other words, if we optimize to this system, we will optimize for textbooks which raise standardized test scores. If it doesn’t improve kids’ test scores, it might as well not be in the book. In fact it will probably “waste time” with respect to raising scores, so there will effectively be a penalty for, say, fun puzzles, or understanding why things are true, or learning to write.
Now, if scores are all we cared about, this could and should be considered progress. Certainly Gina Kolata, the NYTimes journalist, didn’t mention that we might not care only about this – she recorded it as unfettered good, as she was expected to by the Education Department, no doubt. But, as a data scientist who gets paid to think about the feedback loops and side effects of choices like “metrics of success,” I have a problem with it.
I don’t have a thing against randomized tests – using them is a good idea, and will maybe even quiet some noise around all the different curriculums, online and in person. I do think, though, that we need to have more ways of evaluating an educational experience than a test score.
After all, if I take a pill once a day to prevent a disease, then what I care about is whether I get the disease, not which pill I took or what color it was. Medicine is a very outcome- focused discipline in a way that education is not. Of course, there are exceptions, say when the treatment has strong and negative side-effects, and the overall effect is net negative. Kind of like when the teacher raises his or her kids’ scores but also causes them to lose interest in learning.
If we go the way of the randomized trial, why not give the students some self-assessments and review capabilities of their text and their teacher (which is not to say teacher evaluations give clean data, because we know from experience they don’t)? Why not ask the students how they liked the book and how much they care about learning? Why not track the students’ attitudes, self-assessment, and goals for a subject for a few years, since we know longer-term effects are sometimes more important that immediate test score changes?
In other words, I’m calling for collecting more and better data beyond one-dimensional test scores. If you think about it, teenagers get treated better by their cell phone companies or Netflix than by their schools.
I know what you’re thinking – that students are all lazy and would all complain about anyone or anything that gave them extra work. My experience is that kids actually aren’t like this, know the difference between rote work and real learning, and love the learning part.
Another complaint I hear coming – long-term studies take too long and are too expensive. But ultimately these things do matter in the long term, and as we’ve seen in medicine, skimping on experiments often leads to bigger and more expensive problems. Plus, we’re not going to improve education overnight.
And by the way, if and/or when we do this, we need to implement strict privacy policies for the students’ answers – you don’t want a 7-year-old’s attitude about math held against him when he of she applies to college.
My buddy Jordan Ellenberg just came out with a fantastic piece in Slate entitled “The Case of the Missing Zeroes: An astonishing act of statistical chutzpah in the Indiana schools’ grade-changing scandal.”
Here are the leading sentences of the piece:
Florida Education Commissioner Tony Bennett resigned Thursday amid claims that, in his former position as superintendent of public instruction in Indiana, he manipulated the state’s system for evaluating school performance. Bennett, a Republican who created an A-to-F grading protocol for Indiana schools as a way to promote educational accountability, is accused of raising the mark for a school operated by a major GOP donor.
Jordan goes on to explain exactly what happened and how that manipulation took place. Turns out it was a pretty outrageous and easy-to-understand lie about missing zeroes which didn’t make any sense. You should read the whole thing, Jordan is a great writer and his fantasy about how he would deal with a student trying the same scam in his calculus class is perfect.
A few comments to make about this story overall.
- First of all, it’s another case of a mathematical model being manipulated for political reasons. It just happens to be a really simple mathematical model in this case, namely a weighted average of scores.
- In other words, the lesson learned for corrupt politicians in the future may well to be sure the formulae are more complicated and thus easier to game.
- Or in other words, let’s think about other examples of this kind of manipulation, where people in power manipulate scores after the fact for their buddies. Where might it be happening now? Look no further than the Value-Added Model for teachers and schools, which literally nobody understands or could prove is being manipulated in any given instance.
- Taking a step further back, let’s remind ourselves that educational accountability models in general are extremely ripe for gaming and manipulation due to their high stakes nature. And the question of who gets the best opportunity to manipulate their scores is, as shown in this example of the GOP-donor-connected school, often a question of who has the best connections.
- In other words, I wonder how much the system can be trusted to give us a good signal on how well schools actually teach (at least how well they teach to the test).
- And if we want that signal to be clear, maybe we should take away the high stakes and literally measure it, with no consequences. Then, instead of punishing schools with bad scores, we could see how they need help.
- The conversation doesn’t profit from our continued crazy high expectations and fundamental belief in the existence of a silver bullet, the latest one being the Kipp Charter Schools – read this reality check if you’re wondering what I’m talking about (hat tip Jordan Ellenberg).
- As any statistician could tell you, any time you have an “educational experiment” involving highly motivated students, parents, and teachers, it will seem like a success. That’s called selection bias. The proof of the pudding lies in the scaling up of the method.
- We need to think longer term and consider how we’re treating good teachers and school administration who have to live under arbitrary and unfair systems. They might just leave.
Have you read this recent article in Slate about they canceled online courses at San Jose State University after more than half the students failed? The failure rate ranged from 56 to 76 percent for five basic undergrad classes with a student enrollment limit of 100 people.
Personally, I’m impressed that so many people passed them considering how light-weight the connection is in such course experiences. Maybe it’s because they weren’t free – they cost $150.
It all depends on what you were expecting, I guess. It begs the question of what college is for anyway.
I was talking to a business guy about the MOOC potential for disruption, and he mentioned that, as a Yale undergrad himself, he never learned a thing in classes, that in fact he skipped most of his classes to hang out with his buddies. He somehow thought MOOCs would be a fine replacement for that experience. However, when I asked him whether he still knew any of his buddies from college, he acknowledged that he does business with them all the time.
Personally, this confirms my theory that education is more about making connection than education per se, and although I learned a lot of math in college, I also made a friend who helped me get into grad school and even introduced me to my thesis advisor.
This is a guest post by Eugene Stern.
Now that I have kids in school, I’ve become a lot more familiar with high-stakes testing, which is the practice of administering standardized tests with major consequences for students who take them (you have to pass to graduate), their teachers (who are often evaluated based on standarized test results), and their school districts (state funding depends on test results). To my great chagrin, New Jersey, where I live, is in the process of putting such a teacher evaluation system in place (for a lot more detail and criticism, see here).
The excellent John Ewing pointed me to a pretty comprehensive survey of standardized testing called “Measuring Up,” by Harvard Ed School prof Daniel Koretz, who teaches a course there about this stuff. If you have any interest in the subject, the book is very much worth your time. But in case you don’t get to it, or just to whet your appetite, here are my top 10 takeaways:
Believe it or not, most of the people who write standardized tests aren’t idiots. Building effective tests is a difficult measurement problem! Koretz makes an analogy to political polling, which is a good reminder that a test result is really a sample from a distribution (if you take multiple versions of a test designed to measure the same thing, you won’t do exactly the same each time), and not an absolute measure of what someone knows. It’s also a good reminder that the way questions are phrased can matter a great deal.
The reliability of a test is inversely related to the standard deviation of this distribution: a test is reliable if your score on it wouldn’t vary very much from one instance to the next. That’s a function of both the test itself and the circumstances under which people take it. More reliability is better, but the big trade-off is that increasing the sophistication of the test tends to decrease reliability. For example, tests with free form answers can test for a broader range of skills than multiple choice, but they introduce variability across graders, and even the same person may grade the same test differently before and after lunch. More sophisticated tasks also take longer to do (imagine a lab experiment as part of a test), which means fewer questions on the test and a smaller cross-section of topics being sampled, again meaning more noise and less reliability.
A complementary issue is bias, which is roughly about people doing better or worse on a test for systematic reasons outside the domain being tested. Again, there are trade-offs: the more sophisticated the test, the more extraneous skills beyond those being tested it may be bringing in. One common way to weed out such questions is to look at how people who score the same on the overall test do on each particular question: if you get variability you didn’t expect, that may be a sign of bias. It’s harder to do this for more sophisticated tests, where each question is a bigger chunk of the overall test. It’s also harder if the bias is systematic across the test.
Beyond the (theoretical) distribution from which a single student’s score is a sample, there’s also the (likely more familiar) distribution of scores across students. This depends both on the test and on the population taking it. For example, for many years, students on the eastern side of the US were more likely to take the SAT than those in the west, where only students applying to very selective eastern colleges took the test. Consequently, the score distributions were very different in the east and the west (and average scores tended to be higher in the west), but this didn’t mean that there was bias or that schools in the west were better.
The shape of the score distribution across students carries important information about the test. If a test is relatively easy for the students taking it, scores will be clustered to the right of the distribution, while if it’s hard, scores will be clustered to the left. This matters when you’re interpreting results: the first test is worse at discriminating among stronger students and better at discriminating among weaker ones, while the second is the reverse.
The score distribution across students is an important tool in communicating results (you may not know right away what a score of 600 on a particular test means, but if you hear it’s one standard deviation above a mean of 500, that’s a decent start). It’s also important for calibrating tests so that the results are comparable from year to year. In general, you want a test to have similar means and variances from one year to the next, but this raises the question of how to handle year-to-year improvement. This is particularly significant when educational goals are expressed in terms of raising standardized test scores.
If you think in terms of the statistics of test score distributions, you realize that many of those goals of raising scores quickly are deluded. Koretz has a good phrase for this: the myth of the vanishing variance. The key observation is that test score distributions are very wide, on all tests, everywhere, including countries that we think have much better education systems than we do. The goals we set for student score improvement (typically, a high fraction of all students taking a test several years from now are supposed to score above some threshold) imply a great deal of compression at the lower end of this distribution – compression that has never been seen in any country, anywhere. It sounds good to say that every kid who takes a certain test in four years will score as proficient, but that corresponds to a score distribution with much less variance than you’ll ever see. Maybe we should stop lying to ourselves?
Koretz is highly critical of the recent trend to report test results in terms of standards (e.g., how many students score as “proficient”) instead of comparisons (e.g., your score is in the top 20% of all students who took the test). Standards and standard-based reporting are popular because it’s believed that American students’ performance as a group is inadequate. The idea is that being near the top doesn’t mean much if the comparison group is weak, so instead we should focus on making sure every student meets an absolute standard needed for success in life. There are three (at least) problems with this. First, how do you set a standard – i.e., what does proficient mean, anyway? Koretz gives enough detail here to make it clear how arbitrary the standards are. Second, you lose information: in the US, standards are typically expressed in terms of just four bins (advanced, proficient, partially proficient, basic), and variation inside the bins is ignored. Third, even standards-based reporting tends to slide back into comparisons: since we don’t know exactly what proficient means, we’re happiest when our school, or district, or state places ahead of others in the fraction of students classified as proficient.
Koretz’s other big theme is score inflation for high-stakes tests: if everyone is evaluated based on test scores, everyone has an incentive to get those scores up, whether or not that actually has much correlation with learning. If you remember anything from the book or from this post, remember this phrase: sawtooth pattern. The idea is that when a new high-stakes standardized test appears, average scores start at some base level, go up quickly as people figure out how to game the test, then plateau. If the test is replaced with another, the same thing happens: base, rapid growth, plateau. Repeat ad infinitum. Koretz and his collaborators did a nice experiment in which they went back to a school district in which one high-stakes test had been replaced with another and administered the first test several years later. Now that teachers weren’t teaching to the first test, scores on it reverted back to the original base level. Moral: score inflation is real, pervasive, and unavoidable, unless we bite the bullet and do away with high-stakes tests.
While Koretz is sympathetic toward test designers, who live the complexity of standardized testing every day, he is harsh on those who (a) interpret and report on test results and (b) set testing and education policy, without taking that complexity into account. Which, as he makes clear, is pretty much everyone who reports on results and sets policy.
If you think it’s a good idea to make high-stakes decisions about schools and teachers based on standardized test results, Koretz’s book offers several clear warnings.
First, we should expect any high-stakes test to be gamed. Worse yet, the more reliable tests, being more predictable, are probably easier to game (look at the SAT prep industry).
Second, the more (statistically) reliable tests, by their controlled nature, cover only a limited sample of the domain we want students to learn. Tests trying to cover more ground in more depth (“tests worth teaching to,” in the parlance of the last decade) will necessarily have noisier results. This noise is a huge deal when you realize that high-stakes decisions about teachers are made based on just two or three years of test scores.
Third, a test that aims to distinguish “proficiency” will do a worse job of distinguishing students elsewhere in the skills range, and may be largely irrelevant for teachers whose students are far away from the proficiency cut-off. (For a truly distressing example of this, see here.)
With so many obstacles to rating schools and teachers reliably based on standardized test scores, is it any surprise that we see results like this?
My friend Josh Vekhter sent me this blog post written by someone who calls herself celandine13 and tutors students with learning disabilities.
In the post, she reframes the concept of mistake or “being bad at something” as often stemming from some fundamental misunderstanding or poor procedure:
Once you move it to “you’re performing badly because you have the wrong fingerings,” or “you’re performing badly because you don’t understand what a limit is,” it’s no longer a vague personal failing but a causal necessity. Anyone who never understood limits will flunk calculus. It’s not you, it’s the bug.
This also applies to “lazy.” Lazy just means “you’re not meeting your obligations and I don’t know why.” If it turns out that you’ve been missing appointments because you don’t keep a calendar, then you’re not intrinsically “lazy,” you were just executing the wrong procedure. And suddenly you stop wanting to call the person “lazy” when it makes more sense to say they need organizational tools.
And she wants us to stop with the labeling and get on with the understanding of why the mistake was made and addressing that, like she does when she tutors students. She even singles out certain approaches she considers to be flawed from the start:
This is part of why I think tools like Knewton, while they can be more effective than typical classroom instruction, aren’t the whole story. The data they gather (at least so far) is statistical: how many questions did you get right, in which subjects, with what learning curve over time? That’s important. It allows them to do things that classroom teachers can’t always do, like estimate when it’s optimal to review old material to minimize forgetting. But it’s still designed on the error model. It’s not approaching the most important job of teachers, which is to figure out why you’re getting things wrong — what conceptual misunderstanding, or what bad study habit, is behind your problems. (Sometimes that can be a very hard and interesting problem. For example: one teacher over many years figured out that the grammar of Black English was causing her students to make conceptual errors in math.)
On the one hand I like the reframing: it’s always good to see knee-jerk reactions become more contemplative, and it’s always good to see people trying to help rather than trying to blame. In fact, one of my tenets of real life is that mistakes will be made, and it’s not the mistake that we should be anxious about but how we act to fix the mistake that exposes who we are as people.
I would, however, like to take issue with her anti-example in the case of Knewton, which is an online adaptive learning company. Full disclosure: I interviewed with Knewton before I took my current job, and I like the guys who work there. But, I’d add, I like them partly because of the healthy degree of skepticism they take with them to their jobs.
What the blogwriter celandine13 is pointing out, correctly, is that understanding causality is pretty awesome when you can do it. If you can figure out why someone is having trouble learning something, and if you can address that underlying issue, then fixing the consequences of that issue get a ton easier. Agreed, but I have three points to make:
- First, a non-causal data mining engine such as Knewton will also stumble upon a way to fix the underlying problem by dint of having a ton of data and noting that people who failed a calculus test, say, did much better after having limits explained to them in a certain way. This is much like the spellcheck engine of Google works by keeping track of previous spelling errors, and not by mind reading how people think about spelling wrong.
- Second, it’s not always easy to find the underlying cause of bad testing performance, even if you’re looking for it directly. I’m not saying it’s fruitless – tutors I know are incredibly good at that – but there’s room for both “causality detectives” and tons of smart data mining in this field.
- Third, it’s definitely not always easy to address the underlying cause of bad test performance. If you find out that the grammar of Black English affects students’ math test scores, what do you do about it?
Having said all that, I’d like to once more agree with the underlying message that a mistake is a first and foremost a signal rather than a reflection of someone’s internal thought processes. The more we think of mistakes as learning opportunities the faster we learn.
When I worked as a research mathematician, I was always flabbergasted by the speed at which other people would seem to absorb mathematical theory. I had then, and pretty much have now, this inability to believe anything that I can’t prove from first principles, or at least from stuff I already feel completely comfortable with. For me, it’s essentially mathematically unethical to use a result I can’t prove or at least understand locally.
I only recently realized that not everyone feels this way. Duh. People often just assemble accepted facts about a field quickly just to explore the landscape and get the feel for something – it makes complete sense to me now that one can do this and it doesn’t seem at all weird. And it explains what I saw happening in grad school really well too.
Most people just use stuff they “know to be true,” without having themselves gone through the proof. After all, things like Deligne’s work on Weil Conjectures or Gabber’s recent work on finiteness of etale cohomology for pseudo-excellent schemes are really fucking hard, and it’s much more efficient to take their results and use them than it is to go through all the details personally.
After all, I use a microwave every day without knowing how it works, right?
I’m not sure I know where I got the feeling that this was an ethical issue. Probably it happened without intentional thought, when I was learning what a proof is in math camp, and I’d perhaps state a result and someone would say, how do you know that? and I’d feel like an asshole unless I could prove it on the spot.
Anyway, enough about me and my confused definition of mathematical ethics – what I now realize is that, as mathematics is developed more and more, it will become increasingly difficult for a graduate student to learn enough and then prove an original result without taking things on faith more and more. The amount of mathematical development in the past 50 years is just frighteningly enormous, especially in certain fields, and it’s just crazy to imagine someone learning all this stuff in 2 or 3 years before working on a thesis problem.
What I’m saying, in other words, is that my ethical standards are almost provably unworkable in modern mathematical research. Which is not to say that, over time, a person in a given field shouldn’t eventually work out all the details to all the things they’re relying on, but it can’t be linear like I forced myself to work.
And there’s a risk, too: namely, that as people start getting used to assuming hard things work, fewer mistakes will be discovered. It’s a slippery slope.
I’m happy to say that the book I’m writing with Rachel Schutt called Doing Data Science is officially out for early review. That means a few chapters which we’ve deemed “ready” have been sent to some prominent people in the field to see what they think. Thanks, prominent and busy people!
It also means that things are (knock on wood) wrapping up on the editing side. I’m cautiously optimistic that this book will be a valuable resource for people interested in what data scientists do, especially people interested in switching fields. The range of topics is broad, which I guess means that the most obvious complaint about the book will be that we didn’t cover things deeply enough, and perhaps that the level of pre-requisite assumptions is uneven. It’s hard to avoid.
Thanks to my awesome editor Courtney Nash over at O’Reilly for all her help!
And by the way, we have an armadillo on our cover, which is just plain cool:
This is a guest post by Eugene Stern.
A big reason I love this blog is Cathy’s war on crappy models. She has posted multiple times already about the lousy performance of models that rate teachers based on year-to-year changes in student test scores (for example, read about it here). Much of the discussion focuses on the model used in New York City, but such systems have been, or are being, put in place all over the country. I want to let you know about the version now being considered for use across the river, in New Jersey. Once you’ve heard more, I hope you’ll help me try to stop it.
A little background if you haven’t heard about this before. Because it makes no sense to rate teachers based on students’ absolute grades or test scores (not all students start at the same place each year), the models all compare students’ test scores against some baseline. The simplest thing to do is to compare each student’s score on a test given at the end of the school year against their score on a test given at the end of the previous year. Teachers are then rated based on how much their students’ scores improved over the year.
Comparing with the previous year’s score controls for the level at which students start each year, but not for other factors beside the teacher that affect how much they learn. This includes attendance, in-school environment (curriculum, facilities, other students in the class), out-of-school learning (tutoring, enrichment programs, quantity and quality of time spent with parents/caregivers), and potentially much more. Fancier models try to take these into account by comparing each student’s end of year score with a predicted score. The predicted score is based both on the student’s previous score and on factors like those above. Improvement beyond the predicted score is then attributed to the teacher as “value added” (hence the name “value-added models,” or VAM) and turned into a teacher rating in some way, often using percentiles. One such model is used to rate teachers in New York City.
It’s important to understand that there is no single value-added model, rather a family of them, and that the devil is in the details. Two different teacher rating systems, based on two models of the predicted score, may perform very differently – both across the board, and in specific locations. Different factors may be more or less important depending on where you are. For example, income differences may matter more in a district that provides few basic services, so parents have to pay to get extracurriculars for their kids. And of course the test itself matters hugely as well.
Testing the VAM models
Teacher rating models based on standardized tests have been around for 25 years or so, but two things have happened in the last decade:
- Some people started to use the models in formal teacher evaluation, including tenure decisions.
- Some (other) people started to test the models.
This did not happen in the order that one would normally like. Wanting to make “data-driven decisions,” many cities and states decided to start rating teachers based on “data” before collecting any data to validate whether that “data” was any good. This is a bit like building a theoretical model of how cancer cells behave, synthesizing a cancer drug in the lab based on the model, distributing that drug widely without any trials, then waiting around to see how many people die from the side effects.
The full body count isn’t in yet, but the models don’t appear to be doing well so far. To look at some analysis of VAM data in New York City, start here and here. Note: this analysis was not done by the city but by individuals who downloaded the data after the city had to make it available because of disclosure laws.
I’m not aware of any study on the validity of NYC’s VAM ratings done by anyone actually affiliated with the city – if you know of any, please tell me. Again, the people preaching data don’t seem willing to actually use data to evaluate the quality of the systems they’re putting in place.
Assuming you have more respect for data than the mucky-mucks, let’s talk about how well the models actually do. Broadly, two ways a model can fail are being biased and being noisy. The point of the fancier value-added models is to try to eliminate bias by factoring in everything other than the teacher that might affect a student’s test score. The trouble is that any serious attempt to do this introduces a bunch of noise into the model, to the degree that the ratings coming out look almost random.
You’d think that a teacher doesn’t go from awful to great or vice versa in one year, but the NYC VAM ratings show next to no correlation in a teacher’s rating from one year to the next. You’d think that a teacher either teaches math well or doesn’t, but the NYC VAM ratings show next to no correlation in a teacher’s rating teaching a subject to one grade and their rating teaching it to another – in the very same year! (Gary Rubinstein’s blog, linked above, documents these examples, and a number of others.) Again, this is one particular implementation of a general class of models, but using such noisy data to make significant decisions about teachers’ careers seems nuts.
What’s happening in New Jersey
With all this as background, let’s turn to what’s happening in New Jersey.
You may be surprised that the version of the model proposed by Chris Christie‘s administration (the education commissioner is Christie appointee Chris Cerf, who helped put VAM in place in NYC) is about the simplest possible. There is no attempt to factor out bias by trying to model predicted scores, just a straight comparison between this year’s standardized test score and last year’s. For an overview, see this.
In more detail, the model groups together all students with the same score on last year’s test, and represents each student’s progress by their score on this year’s test, viewed as a percentile across this group. That’s it. A fancier version uses percentiles calculated across all students with the same score in each of the last several years. These can’t be calculated explicitly (you may not find enough students that got exactly the same score each the last few years), so they are estimated, using a statistical technique called quantile regression.
By design, both the simple and the fancy version ignore everything about a student except their test scores. As a modeler, or just as a human being, you might find it silly not to distinguish between a fourth grader in a wealthy suburb who scored 600 on a standardized test from a fourth grader in the projects with the same score. At least, I don’t know where to find a modeler who doesn’t find it silly, because nobody has bothered to study the validity of using this model to rate teachers. If I’m wrong, please point me to a study.
Politics and SGP
But here we get into the shell game of politics, where rating teachers based on the model is exactly the proposal that lies at the end of an impressive trail of doubletalk. Follow the bouncing ball.
These models, we are told, differ fundamentally from VAM (which is now seen as somewhat damaged goods politically, I suspect). While VAM tried to isolate teacher contribution, these models do no such thing – they are simply measuring student progress from year to year, which, after all, is what we truly care about. The models have even been rebranded with a new name: student growth percentiles, or SGP. SGP is sold as just describing student progress rather than attributing it to teachers, there can’t be any harm in that, right? – and nothing that needs validation, either. And because SGP is such a clean methodology – if you’re looking for a data-driven model to use for broad “educational assessment,” don’t get yourself into that whole VAM morass, use SGP instead!
Only before you know it, educational assessment turns into, you guessed it, rating teachers. That’s right: because these models aren’t built to rate teachers, they can focus on the things that really matter (student progress), and thus end up being – wait for it – much better for rating teachers! War is peace, friends. Ignorance is strength.
Creators of SGP
You can find a good discussion of SGP’s and their use in evaluation here, and a lot more from the same author, the impressively prolific Bruce Baker, here. Here’s a response from the creators of SGP. They maintain that information about student growth is useful (duh), and agree that differences in SGP’s should not be attributed to teachers (emphasis mine):
Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality.
SGP and teacher evaluations
But guess what?
The New Jersey Board of Ed and state education commissioner Cerf are putting in place a new teacher evaluation code, to be used this coming academic year and beyond. You can find more details here and here.
Summarizing: for math and English teachers in grades 4-8, 30% of their annual evaluation next year would be mandated by the state to come from those very same SGP’s that, according to their creators, are not sufficient to make causal claims about teacher quality. These evaluations are the primary input in tenure decisions, and can also be used to take away tenure from teachers who receive low ratings.
The proposal is not final, but is fairly far along in the regulatory approval process, and would become final in the next several months. In a recent step in the approval process, the weight given to SGP’s in the overall evaluation was reduced by 5%, from 35%. However, the 30% weight applies next year only, and in the future the state could increase the weight to as high as 50%, at its discretion.
Modeler’s Note #1: the precise weight doesn’t really matter. If the SGP scores vary a lot, and the other components don’t vary very much, SGP scores will drive the evaluation no matter what their weight.
Modeler’s Note #2: just reminding you again that this data-driven framework for teacher evaluation is being put in place without any data-driven evaluation of its effectiveness. And that this is a feature, not a bug – SGP has not been tested as an attribution tool because we keep hearing that it’s not meant to be one.
In a slightly ironic twist, commissioner Cerf has responded to criticisms that SGP hasn’t been tested by pointing to a Gates Foundation study of the effectiveness of… value-added models. The study is here. It draws pretty positive conclusions about how well VAM’s work. A number of critics have argued, pretty effectively, that the conclusions are unsupported by the data underlying the study, and that the data actually shows that VAM’s work badly. For a sample, see this. For another example of a VAM-positive study that doesn’t seem to stand up to scrutiny, see this and this.
Modeler’s Role Play #1
Say you were the modeler who had popularized SGP’s. You’ve said that the framework isn’t meant to make causal claims, then you see New Jersey (and other states too, I believe) putting a teaching evaluation model in place that uses SGP to make causal claims, without testing it first in any way. What would you do?
So far, the SGP mavens who told us that “Large-scale assessment results are an important piece of evidence but are not sufficient to make causal claims about school or teacher quality” remain silent about the New Jersey initiative, as far as I know.
Modeler’s Role Play #2
Now you’re you again, and you’ve never heard about SGP’s and New Jersey’s new teacher evaluation code until today. What do you do?
I want you to help me stop this thing. It’s not in place yet, and I hope there’s still time.
I don’t think we can convince the state education department on the merits. They’ve made the call that the new evaluation system is better than the current one or any alternatives they can think of, they’re invested in that decision, and we won’t change their minds directly. But we can make it easier for them to say no than to say yes. They can be influenced – by local school administrators, state politicians, the national education community, activists, you tell me who else. And many of those people will have more open minds. If I tell you, and you tell the right people, and they tell the right people, the chain gets to the decision makers eventually.
I don’t think I could convince Chris Christie, but maybe I could convince Bruce Springsteen if I met him, and maybe Bruce Springsteen could convince Chris Christie.
I thought we could start with a manifesto – a direct statement from the modeling community explaining why this sucks. Directed at people who can influence the politics, and signed by enough experts (let’s get some big names in there) to carry some weight with those influencers.
Can you help? Help write it, sign it, help get other people to sign it, help get it to the right audience. Know someone whose opinion matters in New Jersey? Then let me know, and help spread the word to them. Use Facebook and Twitter if it’ll help. And don’t forget good old email, phone calls, and lunches with friends.
Or, do you have a better idea? Then put it down. Here. The comments section is wide open. Let’s not fall back on criticizing the politicians for being dumb after the fact. Let’s do everything we can to keep them from doing this dumb thing in the first place.
Shame on us if we can’t make this right.
This is a guest post by Kaisa Taipale. Kaisa got a BS at Caltech, a Ph.D. in math at the University of Minnesota, was a post-doc at MSRI, an assistant professor at St. Olaf College 2010-2012, and is currently visiting Cornell, which is where I met here a couple of weeks ago, and where she told me about her cool visualizations of math Ph.D. emigration patterns and convinced her to write a guest post. Here’s Kaisa on a bridge:
Math data and viz
I was inspired by this older post on Mathbabe, about visualizing the arXiv postings of various math departments.
It got me thinking about tons of interesting questions I’ve asked myself and could answer with visualizations: over time, what’s been coolest on the arXiv? are there any topics that are especially attractive to hiring institutions? There’s tons of work to do!
I had to start somewhere though, and as I’m a total newbie when it comes to data analysis, I decided to learn some skills while focusing on a data set that I have easy non-technical access to and look forward to reading every year. I chose the AMS Annual Survey. I also wanted to stick to questions really close to my thoughts over the last two years, namely the academic job search.
I wanted to learn to use two tools, R and Circos. Why Circos? See the visualizations of college major and career path here - it’s pretty! I’ve messed around with a lot of questions, but in this post I’ll look at two and a half.
Where do graduating PhDs from R1 universities end up, in the short term? I started with graduates of public R1s, as I got my PhD at one.
The PhD-granting institutions are colored green, while academic institutions granting other degrees are in blue. Purple is for business, industry, government, and research institutions. Red is for non-U.S. employment or people not seeking — except for the bright red, which is still seeking. Yellow rounds things out at unknown. Remember, these figures are for immediate plans after graduation rather than permanent employment.
While I was playing with this data (read “learning how to use the reshape and ggplot2 packages”) I noticed that people from private R1s tend to end up at private R1s more often. So I graphed that too.
Does the professoriate in the audience have any idea if this is self-selection or some sort of preference on the part of employers? Also, what happened between 2001 and 2003? I was still in college, and have no idea what historical events are at play here.
Where mathematicians go
For any given year, we can use a circular graph to show us where people go. This is a more clumped version of the above data from 2010 alone, plotted using Circos. (Supplemental table E.4 from the AMS report online.)
The other question – the question current mathematicians secretly care more about, in a gossipy and potentially catty way – is what fields lead to what fate. We all know algebra and number theory are the purest and most virtuous subjects, and applied math is for people who want to make money or want to make a difference in the world.
[On that note, you might notice that I removed statistics PhDs in the visualization below, and I also removed some of the employment sectors that gained only a few people a year. The stats ribbons are huge and the small sectors are very small, so for looks alone I took them out.]
Higher resolution version available here.
I wish I could animate a series of these to show this view over time as well. Let me know if you know how to do that! Another nice thing I could do would be to set up a webpage in which these visualizations could be explored in a bit more depth. (After finals.)
- I haven’t computed any numbers for you
- the graphs from R show employment in each field by percentage of graduates instead of total number per category;
- it’s hard to show both data over time and all the data one could explore. But it’s a start.
I should finish with a shout-out to Roger Peng and Jeff Leek, though we’ve never met: I took Peng’s Computing for Data Analysis and much of Leek’s Data Analysis on Coursera (though I’m one of those who didn’t finish the class). Their courses and Stack Overflow taught me almost everything I know about R. As I mentioned above, I’m pretty new to this type of analysis.
What questions would you ask? How can I make the above cooler? Did you learn anything?
There’ve been a couple of articles in the past few days about teacher Value-Added Testing that have enraged me.
If you haven’t been paying attention, the Value-Added Model (VAM) is now being used in a majority of the states (source: the Economist):
But it gives out nearly random numbers, as gleaned from looking at the same teachers with two scores (see this previous post). There’s a 24% correlation between the two numbers. Note that some people are awesome with respect to one score and complete shit on the other score:
Final thing you need to know about the model: nobody really understands how it works. It relies on error terms of an error-riddled model. It’s opaque, and no teacher can have their score explained to them in Plain English.
Now, with that background, let’s look into these articles.
First, there’s this New York Times article from yesterday, entitled “Curious Grade for Teachers: Nearly All Pass”. In this article, it describes how teachers are nowadays being judged using a (usually) 50/50 combination of classroom observations and VAM scores. This is different from the past, which was only based on classroom observations.
What they’ve found is that the percentage of teachers found “effective or better” has stayed high in spite of the new system – the numbers are all over the place but typically between 90 and 99 percent of teachers. In other words, the number of teachers that are fingered as truly terrible hasn’t gone up too much. What a fucking disaster, at least according to the NYTimes, which seems to go out of its way to make its readers understand how very much high school teachers suck.
A few things to say about this.
- Given that the VAM is nearly a random number generator, this is good news – it means they are not trusting the VAM scores blindly. Of course, it still doesn’t mean that the right teachers are getting fired, since half of the score is random.
- Another point the article mentions is that failing teachers are leaving before the reports come out. We don’t actually know how many teachers are affected by these scores.
- Anyway, what is the right number of teachers to fire each year, New York Times? And how did you choose that number? Oh wait, you quoted someone from the Brookings Institute: “It would be an unusual profession that at least 5 percent are not deemed ineffective.” Way to explain things so scientifically! It’s refreshing to know exactly how the army of McKinsey alums approach education reform.
- The overall article gives us the impression that if we were really going to do our job and “be tough on bad teachers,” then we’d weight the Value-Added Model way more. But instead we’re being pussies. Wonder what would happen if we weren’t pussies?
The second article explained just that. It also came from the New York Times (h/t Suresh Naidu), and it was a the story of a School Chief in Atlanta who took the VAM scores very very seriously.
What happened next? The teachers cheated wildly, changing the answers on their students’ tests. There was a big cover-up, lots of nasty political pressure, and a lot of good people feeling really bad, blah blah blah. But maybe we can take a step back and think about why this might have happened. Can we do that, New York Times? Maybe it had to do with the $500,000 in “performance bonuses” that the School Chief got for such awesome scores?
Let’s face it, this cheating scandal, and others like it (which may never come to light), was not hard to predict (as I explain in this post). In fact, as a predictive modeler, I’d argue that this cheating problem is the easiest thing to predict about the VAM, considering how it’s being used as an opaque mathematical weapon.
MathBabe recently wrote an article critical of the elitist nature of Ted Talks, which you can read here. Fortunately for her, and for the hoi polloi everywhere clamoring for populist science edutainment, there is an alternative: Nerd Nite. Once a month, in cities all over the globe, nerds herd into a local bar and turn it into a low-brow forum for innovative science ideas. Think Ted Talks on tequila.
Each month, three speakers present talks for 20-30 minutes, followed by questions and answers from the invariably sold-out audience. The monthly forum gives professional and amateur scientists an opportunity to explain their fairly abstruse specialties accessibly to a lay audience – a valuable skill. Since the emphasis is on science entertainment, it also gives the speakers a chance to present their ideas in a more engaging way: in iambic pentameter, in drag with a tuba, in three-part harmony, or via interpretive dance – an invaluable skill. The resulting atmosphere is informal, delightfully debauched, and refreshingly pro-science.
Slaking our thirst for both science education and mojitos, Nerd Nite started small but quickly went viral. Nerd Nites are now being held in 50 cities, from San Francisco to Kansas City and Auckland to Liberia. You can find the full listing of cities here; if you don’t see one near you, start one!
Last Wednesday night I was twitterpated to be one of three guest nerds sharing the stage at San Francisco’s Nerd Nite. I put the chic back into geek with a biology talk entitled “Genital Plugs, Projectile Penises, and Gay Butterflies: A Naturalist Explains the Birds and the Bees.”
A video recording of the presentation will be available online soon, but in the meantime, here’s a tantalizing clip from the talk, in which Isabella Rossellini explains the mating habits of the bee. Warning: this is scientifically sexy.
I shared the stage with Chris Anderson, who gave a fascinating talk on how the DIY community is building drones out of legos and open-source software. These DIY drones fly below government regulation and can be used for non-military applications, something we hear far too little of in the daily war digest that passes for news. The other speaker was Mark Rosin of the UK-based Guerrilla Science project. This clever organization reaches out to audiences at non-science venues, such as music concerts, and conducts entertaining presentations that teach core science ideas. As part of his presentation Mark used 250 inflated balloons and a bass amp to demonstrate the physics concept of resonance.
If your curiosity has been piqued and you’d like to check out an upcoming Nerd Nite, consider attending the upcoming Nerdtacular, the first Nerd Nite Global Festival, to be held this August 16-18th in Brooklyn, New York.
The global Nerdtacular: Now that’s an idea worth spreading.
At my new job I’ve been spending my time editing my book with Rachel Schutt (who is joining me at JRL next week! Woohoo!). It’s called Doing Data Science and it’s based on these notes I took when she taught a class on data science at Columbia last semester. Right now I’m working on the alternating least squares chapter, where we learned from Matt Gattis how to build and optimize a recommendation system. A very cool algorithm.
However, to be honest I’ve started to feel very sorry for the one parameter we call It’s also sometimes referred to as “the prior”.
Let me tell you, the world is asking too much from this little guy, and moreover most of the big-data world is too indifferent to its plight. Let me explain.
First, he’s supposed to reflect an actual prior belief – namely, his size is supposed to reflect a mathematical vision of how big we think the coefficients in our solution should be.
In an ideal world, we would think deeply about this question of size before looking at our training data, and think only about the scale of our data (i.e. the input), the scale of the preferences (i.e. the recommendation system output) and the quality and amount of training data we have, and using all of that, we’d figure out our prior belief on the size or at least the scale of our hoped-for solution.
I’m not statistician, but that’s how I imagine I’d spend my days if I were: thinking through this reasoning carefully, and even writing it down carefully, before I ever start my training. It’s a discipline like any other to carefully state your beliefs beforehand so you know you’re not just saying what the data wants to hear.
as convergence insurance
But then there’s the next thing we ask of our parameter namely we assign him the responsibility to make sure our algorithm converges.
Because our algorithm isn’t a closed form solution, but rather we are discovering coefficients of two separate matrices and , fixing one while we tweak the other, then switching. The algorithm stops when, after a full cycle of fixing and tweaking, none of the coefficients have moved by more than some pre-ordained
The fact that this algorithm will in fact stop is not obvious, and in fact it isn’t always true.
It is (mostly*) true, however, if our little is large enough, which is due to the fact that our above-mentioned imposed belief of size translates into a penalty term, which we minimize along with the actual error term. This little miracle of translation is explained in this post.
And people say that all the time. When you say, “hey what if that algorithm doesn’t converge?” They say, “oh if is big enough it always does.”
But that’s kind of like worrying about your teenage daughter getting pregnant so you lock her up in her room all the time. You’ve solved the immediate problem by sacrificing an even bigger goal.
Because let’s face it, if the prior is too big, then we are sacrificing our actual solution for the sake of conveniently small coefficients and convergence. In the asymptotic limit, which I love thinking about, our coefficients all go to zero and we get nothing at all. Our teenage daughter has run away from home with her do-nothing boyfriend.
By the way, there’s a discipline here too, and I’d suggest that if the algorithm doesn’t converge you might also want to consider reducing your number of latent variables rather than increasing your since you could be asking too much from your training data. It just might not be able to distinguish that many important latent characteristics.
as tuning parameter
Finally, we have one more job for our little , we’re not done with him yet. Actually for some people this is his only real job, because in practice this is how he’s treated. Namely, we optimize him so that our results look good under whatever metric we decide to care about (but it’s probably the mean squared error of preference prediction on a test set (hopefully on a test set!)).
In other words, in reality most of the above nonsense about is completely ignored.
This is one example among many where having the ability to push a button that makes something hard seem really easy might be doing more harm than good. In this case the button says “optimize with respect to “, but there are other buttons that worry me just as much, and moreover there are lots of buttons being built right now that are even more dangerous and allow the users to be even more big-data-blithe.
I’ve said it before and I’ll say it again: you do need to know about inverting a matrix, and other math too, if you want to be a good data scientist.
* There’s a change-of-basis ambiguity that’s tough to get rid of here, since you only choose the number of latent variables, not their order. This doesn’t change the overall penalty term, so you can minimize that with large enough but if you’re incredibly unlucky I can imagine you might bounce between different solutions that differ by a base change. In this case your steps should get smaller, i.e. the amount you modify your matrix each time you go through the algorithm. This is only a theoretical problem by the way but I’m a nerd.
If this New York Times editorial is correct, and it certainly passes the smell test, students are not well-served by online courses but are by so-called “hybrid” courses, where there’s a bit of online stuff and also a bit of one-on-one time. From the editorial:
The research has shown over and over again that community college students who enroll in online courses are significantly more likely to fail or withdraw than those in traditional classes, which means that they spend hard-earned tuition dollars and get nothing in return. Worse still, low-performing students who may be just barely hanging on in traditional classes tend to fall even further behind in online courses.
This is important news for math departments, at least in the medium term (i.e. until machine learners figure out how to successfully simulate one-on-one interactions), because it means they won’t be replacing calculus class with a computer. And as every mathematician should know, calculus is the bread and butter of math departments.
I don’t agree with everything she always says, but I agree with everything Izabella Laba says in this post called Gender Bias 101 For Mathematicians (hat tip Jordan Ellenberg). And I’m kind of jealous she put it together in such a fantastic no-bullshit way.
Namely, she debunks a bunch of myths of gender bias. Here’s my summary, but you should read the whole thing:
- Myth: Sexism in math is perpetrated mainly by a bunch of enormously sexist old guys. Izabella: Nope, it’s everyone, and there’s lots of evidence for that.
- Myth: The way to combat sexism is to find those guys and isolate them. Izabella: Nope, that won’t work, since it’s everyone.
- Myth: If it’s really everyone, it’s too hard to solve. Izabella: Not necessarily, and hey you are still trying to solve the Riemann Hypothesis even though that’s hard (my favorite argument).
- Myth: We should continue to debate about its existence rather than solution. Izabella: We are beyond that, it’s a waste of time, and I’m not going to waste my time anymore.
- Myth: Izabella, you are only writing this to be reassured. Izabella: Don’t patronize me.
Here’s what I’d add. I’ve been arguing for a long time that gender bias against girls in math starts young and starts at the cultural level. It has to do with expectations of oneself just as much as a bunch of nasty old men (by the way, the above is not to say there aren’t nasty old men (and nasty old women!), just that it’s not only about them).
My argument has been that the cultural differences are larger than the talent differences, something Larry Summers strangely dismissed without actually investigating in his famous speech.
And I think I’ve found the smoking gun for my side of this argument, in the form of an interactive New York Times graphic from last week’s Science section which I’ve screenshot here:
What this shows is that 15-year-old girls out-perform 15-year-old boys in certain countries and under-perform them in others. Those countries where they outperform boys is not random and has everything to do with cultural expectations and opportunities for girls in those countries and is explained to some extent by stereotype threat. Go read the article, it’s fascinating.
I’ll say again what I said already at the end of this post: the great news is that it is possible to address stereotype threat directly, which won’t solve everything but will go a long way.
You do it by emphasizing that mathematical talent is not inherent, nor fixed at birth, and that you can cultivate it and grow it over time and through hard work. I make this speech whenever I can to young people. Spread the word!
Last night I was waiting for a bus to go hang with my Athena Mastermind group, which consists of a bunch of very cool Barnard student entrepreneurs and their would-be role models (I say would-be because, although we role models are also very cool, I often think the students are role modeling for us).
As I was waiting at the bus stop, I overheard two women talking about the new Applied Data Science class that just started at Columbia, which is being taught by Ian Langmore, Daniel Krasner and Chang She. I knew about this class because Ian came to advertise it last semester in Rachel Schutt’s Intro to Data Science class which I blogged. One of the women at the bus stop had been in Rachel’s class and the other is in Ian’s.
Turns out I just love overhearing nerd girls talking data science at the bus stop. Don’t you??
And to top off the nerd girl experience, I’m on my way today to Nebraska to give a talk to a bunch of undergraduate women in math about what they can do with math outside of academia. I’m planning it to be an informative talk, but that’s really just cover to its real goal, which is to give a pep talk.
My experience talking to young women in math, at least when they are grad students, is that they respond viscerally to encouragement, even if it’s vague. I can actually see their egos inflate in the audience as I speak, and that’s a good thing, that’s why I’m there.
As a community, I’ve realized, nerd girls going through grad school are virtually starved for positive feedback, and so my job is pretty clear cut: I’m going to tell them how awesome they are and answer their questions about what it’s like in the “real world” and then go back to telling them how awesome they are.
By the end they sit a bit straighter and smile a bit more after I’m done, after I’ve told them, or reminded them at least, how much power they have as nerd girls – how many options they have, and how they don’t have to be risk-averse, and how they never need to apologize.
Tomorrow my audience is undergraduates, which is a bit trickier, since as an undergrad you still get consistent feedback in the form of grades. So I will tailor my information as well as my encouragement a bit, and try not to make grad school sound too scary, because I do think that getting a Ph.D. is still a huge deal. Comment below if you have suggestions for my talk, please!
I had a great time giving my “Weapons of Math Destruction” talk in San Diego, and the audience was fantastic and thoughtful.
One question that someone asked was whether the US News & World Reports college ranking model should be forced to be open sourced – wouldn’t that just cause colleges to game the model?
First of all, colleges are already widely gaming the model and have been for some time. And that gaming is a distraction and has been heading colleges in directions away from good instruction, which is a shame.
And if you suggest that they change the model all the time to prevent this, then you’ve got an internal model of this model that needs adjustment. They might be tinkering at the edges but overall it’s quite clear what’s going into the model: namely, graduation rates, SAT scores, number of Ph.D’s on staff, and so on. The exact percentages change over time but not by much.
The impact that this model has had on education and how universities apportion resources has been profound. Academic papers have been written on the law school version of this story.
Moreover, the tactics that US News & World Reports uses to enforce their dominance of the market are bullying, as you can learn from the President of Reed College, which refuses to be involved.
Back to the question. Just as I realize that opening up all data is not reasonable or desirable, because first of all there are serious privacy issues but second of all certain groups have natural advantages to openly shared resources, it is also true that opening up all models is similarly problematic.
However, certain data should surely be open: for example, the laws of our country, that we are all responsible to know, should be freely available to us (something that Aaron Swartz understood and worked towards). How can we be held responsible for laws we can’t read?
Similarly, public-facing models, such as credit scoring models and teacher value-added models, should absolutely be open and accessible to the public. If I’m being judged and measured and held accountable by some model in my daily life as a citizen, that has real impact on how my future will unfold, then I should know how that process works.
And if you complain about the potential gaming of those public-facing models, I’d answer: if they are gameable then they shouldn’t be used, considering the impact they have on so many people’s lives. Because a gameable model is a weak model, with proxies that fail.
Another way to say this is we should want someone to “game” the credit score model if it means they pay their bills on time every month (I wrote about this here).
Back to the US News & World Report model. Is it public facing? I’m no lawyer but I think a case can be made that it is, and that the public’s trust in this model makes it a very important model indeed. Evidence can be gathered by measuring the extent to which colleges game the model, which they only do because the public cares so much about the rankings.
Even so, what difference would that make, to open it up?
In an ideal world, where the public is somewhat savvy about what models can and cannot do, opening up the US News & World Reports college ranking model would result in people losing faith in it. They’d realize that it’s no more valuable than an opinion from a highly vocal uncle of theirs who is obsessed with certain metrics and blind to individual eccentricities and curriculums that may be a perfect match for a non-conformist student. It’s only one opinion among many, and not to be religiously believed.
But this isn’t an ideal world, and we have a lot of work to do to get people to understand models as opinions in this sense, and to get people to stop trusting them just because they’re mathematically presented.
I’m giving a talk at the Joint Mathematics Meeting on Thursday (it’s a 30 minute talk that starts at 11:20am, in Room 2 of the Upper Level of the San Diego Conference Center, I hope you come!).
Thinking about that talk brought something up for me that I think I want to address before the next talk. Namely, at the beginning of the talk I was explaining the title, “How Mathematics is Used Outside of Academia,” and I mentioned that most mathematicians that leave academia end up doing modeling.
I can’t remember the exact exchange, but I referred to myself at some point in this discussion as a mathematician outside of academia, at which point someone in the audience expressed incredulity:
him: Really? Are you still a mathematician? Do you prove theorems?
me: No, I don’t prove theorems any longer, now that I am a modeler… (confused look)
At the moment I didn’t have a good response to this, because he was using a different definition of “mathematician” than I was. For some reason he thought a mathematician must prove theorems.
I don’t think so. I had a conversation about this after my talk with Bob Beals, who was in the audience and who taught many years ago at the math summer program I did last summer. After getting his Ph.D. in math, Bob worked for the spooks, and now he works for RenTech. So he knows a lot about doing math outside academia too, and I liked his perspective on this question.
Namely, he wanted to look at the question through the lens of “grunt work”, which is to say all of the actual work that goes into a “result.”
As a mathematician, of course, you don’t simply sit around all day proving theorems. Actually you spend most of your time working through examples to get a feel for the terrain, and thinking up simple ways to do what seems like hard things, and trying out ideas that fail, and going down paths that are dry. If you’re lucky, then at the end of a long journey like this, you will have a theorem.
The same basic thing happens in modeling. You spend lots of time with the data, getting to know it, and then trying out certain approaches, which sometimes, or often, end up giving you nothing interesting, and half the time you realize you were expecting the wrong thing so you have to change it entirely. In the end you may end up with a model which is useful. If you’re lucky.
There’s a lot of grunt work in both endeavors, and there’s a lot of hard thinking along the way, lots of ways for you to fool yourself that you’ve got something when you haven’t. Perhaps in modeling it’s easier to lie, which is a big difference indeed. But if you’re an honest modeler then I claim the difference in the process of getting an interesting and important result is not that different.
And, I claim, I am still being a mathematician while I’m doing it.
I lied yesterday, as a friend at my Occupy meeting pointed out to me last night.
I made it seem like I look into every model before trusting it, and of course that’s not true. I eat food grown and prepared by other people daily. I go on airplanes and buses all the time, trusting that they will work and that they will be driven safely. I still have my money in a bank, and I also hire an accountant and sign my tax forms without reading them. So I’m a hypocrite, big-time.
There’s another thing I should clear up: I’m not claiming I understand everything about climate research just because I talked to an expert for 2 or 3 hours. I am certainly not an expert, nor am I planning to become one. Even so, I did learn a lot, and the research I undertook was incredibly useful to me.
So, for example, my father is a climate change denier, and I have heard him give a list of scientific facts to argue against climate change. I asked my expert to counter-argue these points, and he did so. I also asked him to explain the underlying model at a high level, which he did.
My conclusion wasn’t that I’ve looked carefully into the model and it’s right, because that’s not possible in such a short time. My conclusion was that this guy is trustworthy and uses logical argument, which he’s happy to share with interested people, and moreover he manages to defend against deniers without being intellectually defensive. In the end, I’m trusting him, an expert.
On the other hand, if I met another person with a totally different conclusion, who also impressed me as intellectually honest and curious, then I’d definitely listen to that guy too, and I’d be willing to change my mind.
So I do imbue models and theories with a limited amount of trust depending on how much sense they makes to me. I think that’s reasonable, and it’s in line with my advocacy of scientific interpreters. Obviously not all scientific interpreters would be telling the same story, but that’s not important – in fact it’s vital that they don’t, because it is a privilege to be allowed to listen to the different sides and be engaged in the debate.
If I sat down with an expert for a whole day, like my friend Jordan suggests, to determine if they were “right” on an issue where there’s argument among experts, then I’d fail, but even understanding what they were arguing about would be worthwhile and educational.
Let me say this another way: experts argue about what they don’t agree on, of course, since it would be silly for them to talk about what they do agree on. But it’s their commonality that we, the laypeople, are missing. And that commonality is often so well understood that we could understand it rather quickly if it was willingly explained to us. That would be a huge step.
So I wasn’t lying after all, if I am allowed to define the “it” that I did get at in the two hours with an expert. When I say I understood it, I didn’t mean everything, I meant a much larger chunk of the approach and method than I’d had before, and enough to evoke (limited) trust.
Something I haven’t addressed, which I need to think about more (please help!), is the question of what subjects require active skepticism. On of my commenters, Paul Stevens, brought this up:
… For me, lay people means John Q Public – public opinion because public opinion can shape policy. In practice, this only matters for a select few issues, such as climate change or science education. There is no impact to a lay person not understanding / believing in the Higgs particle for example.
Stephanie asks three important questions about trusting experts, which I paraphrase here:
- What does it take to look into a model yourself? How deeply must you probe?
- How do you avoid being manipulated when you do so?
- Why should we bother since stuff is so hard and we each have a limited amount of time?
I must confess I find the first two questions really interesting and I want to think about them, but I have a very little patience with the last question.
- I’ve seen too many people (individual modelers) intentionally deflect investigations into models by setting them up as so hard that it’s not worth it (or at least it seems not worth it). They use buzz words and make it seem like there’s a magical layer of their model which makes it too difficult for mere mortals. But my experience (as an arrogant, provocative, and relentless questioner) is that I can always understand a given model if I’m talking to someone who really understands it and actually wants to communicate it.
- It smacks of an excuse rather than a reason. If it’s our responsibility to understand something, then by golly we should do it, even if it’s hard.
- Too many things are left up to people whose intentions are not reasonable using this “too hard” argument, and it gives those people reason to make entire systems seem too difficult to penetrate. For a great example, see the financial system, which is consistently too complicated for regulators to properly regulate.
I’m sure I seem unbelievably cynical here, but that’s where I got by working in finance, where I saw first-hand how manipulative and manipulated mathematical modeling can become. And there’s no reason at all such machinations wouldn’t translate to the world of big data or climate modeling.
Speaking of climate modeling: first, it annoys me that people are using my “distrust the experts” line to be cast doubt on climate modelers.
People: I’m not asking you to simply be skeptical, I’m saying you should look into the models yourself! It’s the difference between sitting on a couch and pointing at a football game on TV and complaining about a missed play and getting on the football field yourself and trying to figure out how to throw the ball. The first is entertainment but not valuable to anyone but yourself. You are only adding to the discussion if you invest actual thoughtful work into the matter.
To that end, I invited an expert climate researcher to my house and asked him to explain the climate models to me and my husband, and although I’m not particularly skeptical of climate change research (more on that below when I compare incentives of the two sides), I asked obnoxious, relentless questions about the model until I was satisfied. And now I am satisfied. I am considering writing it up as a post.
As an aside, if climate researchers are annoyed by the skepticism, I can understand that, since football fans are an obnoxious group, but they should not get annoyed by people who want to actually do the work to understand the underlying models.
Another thing about climate research. People keep talking about incentives, and yes I agree wholeheartedly that we should follow the incentives to understand where manipulation might be taking place. But when I followed the incentives with respect to climate modeling, they bring me straight to climate change deniers, not to researchers.
Do we really think these scientists working with their research grants have more at stake than multi-billion dollar international companies who are trying to ignore the effect of their polluting factories on the environment? People, please. The bulk of the incentives are definitely with the business owners. Which is not to say there are no incentives on the other side, since everyone always wants to feel like their research is meaningful, but let’s get real.
I like this idea Stephanie comes up with:
Some sociologists of science suggest that translational “experts”–that is, “experts” who aren’t necessarily producing new information and research, but instead are “expert” enough to communicate stuff to those not trained in the area–can help bridge this divide without requiring everyone to become “experts” themselves. But that can also raise the question of whether these translational experts have hidden agendas in some way. Moreover, one can also raise questions of whether a partial understanding of the model might in some instances be more misleading than not looking into the model at all–examples of that could be the various challenges to evolution based on fairly minor examples that when fully contextualized seem minor but may pop out to someone who is doing a less systematic inquiry.
First, I attempt to make my blog something like a platform for this, and I also do my best to make my agenda not at all hidden so people don’t have to worry about that.
This raises a few issues for me:
- Right now we depend mostly on press to do our translations, but they aren’t typically trained as scientists. Does that make them more prone to being manipulated? I think it does.
- How do we encourage more translational expertise to emerge from actual experts? Currently, in academia, the translation to the general public of one’s research is not at all encouraged or rewarded, and outside academia even less so.
- Like Stephanie, I worry about hidden agendas and partial understandings, but I honestly think they are secondary to getting a robust system of translation started to begin with, which would hopefully in turn engage the general public with the scientific method and current scientific knowledge. In other words, the good outweighs the bad here.