Home > guest post, math education, modeling, statistics > Measuring Up by Daniel Koretz

## Measuring Up by Daniel Koretz

July 9, 2013

This is a guest post by Eugene Stern.

Now that I have kids in school, I’ve become a lot more familiar with high-stakes testing, which is the practice of administering standardized tests with major consequences for students who take them (you have to pass to graduate), their teachers (who are often evaluated based on standarized test results), and their school districts (state funding depends on test results). To my great chagrin, New Jersey, where I live, is in the process of putting such a teacher evaluation system in place (for a lot more detail and criticism, see here).

The excellent John Ewing pointed me to a pretty comprehensive survey of standardized testing called “Measuring Up,” by Harvard Ed School prof Daniel Koretz, who teaches a course there about this stuff. If you have any interest in the subject, the book is very much worth your time. But in case you don’t get to it, or just to whet your appetite, here are my top 10 takeaways:

1. Believe it or not, most of the people who write standardized tests aren’t idiots. Building effective tests is a difficult measurement problem! Koretz makes an analogy to political polling, which is a good reminder that a test result is really a sample from a distribution (if you take multiple versions of a test designed to measure the same thing, you won’t do exactly the same each time), and not an absolute measure of what someone knows. It’s also a good reminder that the way questions are phrased can matter a great deal.

2. The reliability of a test is inversely related to the standard deviation of this distribution: a test is reliable if your score on it wouldn’t vary very much from one instance to the next. That’s a function of both the test itself and the circumstances under which people take it. More reliability is better, but the big trade-off is that increasing the sophistication of the test tends to decrease reliability. For example, tests with free form answers can test for a broader range of skills than multiple choice, but they introduce variability across graders, and even the same person may grade the same test differently before and after lunch. More sophisticated tasks also take longer to do (imagine a lab experiment as part of a test), which means fewer questions on the test and a smaller cross-section of topics being sampled, again meaning more noise and less reliability.

3. A complementary issue is bias, which is roughly about people doing better or worse on a test for systematic reasons outside the domain being tested. Again, there are trade-offs: the more sophisticated the test, the more extraneous skills beyond those being tested it may be bringing in. One common way to weed out such questions is to look at how people who score the same on the overall test do on each particular question: if you get variability you didn’t expect, that may be a sign of bias. It’s harder to do this for more sophisticated tests, where each question is a bigger chunk of the overall test. It’s also harder if the bias is systematic across the test.

4. Beyond the (theoretical) distribution from which a single student’s score is a sample, there’s also the (likely more familiar) distribution of scores across students. This depends both on the test and on the population taking it. For example, for many years, students on the eastern side of the US were more likely to take the SAT than those in the west, where only students applying to very selective eastern colleges took the test. Consequently, the score distributions were very different in the east and the west (and average scores tended to be higher in the west), but this didn’t mean that there was bias or that schools in the west were better.

5. The shape of the score distribution across students carries important information about the test. If a test is relatively easy for the students taking it, scores will be clustered to the right of the distribution, while if it’s hard, scores will be clustered to the left. This matters when you’re interpreting results: the first test is worse at discriminating among stronger students and better at discriminating among weaker ones, while the second is the reverse.

6. The score distribution across students is an important tool in communicating results (you may not know right away what a score of 600 on a particular test means, but if you hear it’s one standard deviation above a mean of 500, that’s a decent start). It’s also important for calibrating tests so that the results are comparable from year to year. In general, you want a test to have similar means and variances from one year to the next, but this raises the question of how to handle year-to-year improvement. This is particularly significant when educational goals are expressed in terms of raising standardized test scores.

7. If you think in terms of the statistics of test score distributions, you realize that many of those goals of raising scores quickly are deluded. Koretz has a good phrase for this: the myth of the vanishing variance. The key observation is that test score distributions are very wide, on all tests, everywhere, including countries that we think have much better education systems than we do. The goals we set for student score improvement (typically, a high fraction of all students taking a test several years from now are supposed to score above some threshold) imply a great deal of compression at the lower end of this distribution – compression that has never been seen in any country, anywhere. It sounds good to say that every kid who takes a certain test in four years will score as proficient, but that corresponds to a score distribution with much less variance than you’ll ever see. Maybe we should stop lying to ourselves?

8. Koretz is highly critical of the recent trend to report test results in terms of standards (e.g., how many students score as “proficient”) instead of comparisons (e.g., your score is in the top 20% of all students who took the test). Standards and standard-based reporting are popular because it’s believed that American students’ performance as a group is inadequate. The idea is that being near the top doesn’t mean much if the comparison group is weak, so instead we should focus on making sure every student meets an absolute standard needed for success in life. There are three (at least) problems with this. First, how do you set a standard – i.e., what does proficient mean, anyway? Koretz gives enough detail here to make it clear how arbitrary the standards are. Second, you lose information: in the US, standards are typically expressed in terms of just four bins (advanced, proficient, partially proficient, basic), and variation inside the bins is ignored. Third, even standards-based reporting tends to slide back into comparisons: since we don’t know exactly what proficient means, we’re happiest when our school, or district, or state places ahead of others in the fraction of students classified as proficient.

9. Koretz’s other big theme is score inflation for high-stakes tests: if everyone is evaluated based on test scores, everyone has an incentive to get those scores up, whether or not that actually has much correlation with learning. If you remember anything from the book or from this post, remember this phrase: sawtooth pattern. The idea is that when a new high-stakes standardized test appears, average scores start at some base level, go up quickly as people figure out how to game the test, then plateau. If the test is replaced with another, the same thing happens: base, rapid growth, plateau. Repeat ad infinitum. Koretz and his collaborators did a nice experiment in which they went back to a school district in which one high-stakes test had been replaced with another and administered the first test several years later. Now that teachers weren’t teaching to the first test, scores on it reverted back to the original base level. Moral: score inflation is real, pervasive, and unavoidable, unless we bite the bullet and do away with high-stakes tests.

10. While Koretz is sympathetic toward test designers, who live the complexity of standardized testing every day, he is harsh on those who (a) interpret and report on test results and (b) set testing and education policy, without taking that complexity into account. Which, as he makes clear, is pretty much everyone who reports on results and sets policy.

Final thoughts

If you think it’s a good idea to make high-stakes decisions about schools and teachers based on standardized test results, Koretz’s book offers several clear warnings.

First, we should expect any high-stakes test to be gamed. Worse yet, the more reliable tests, being more predictable, are probably easier to game (look at the SAT prep industry).

Second, the more (statistically) reliable tests, by their controlled nature, cover only a limited sample of the domain we want students to learn. Tests trying to cover more ground in more depth (“tests worth teaching to,” in the parlance of the last decade) will necessarily have noisier results. This noise is a huge deal when you realize that high-stakes decisions about teachers are made based on just two or three years of test scores.

Third, a test that aims to distinguish “proficiency” will do a worse job of distinguishing students elsewhere in the skills range, and may be largely irrelevant for teachers whose students are far away from the proficiency cut-off. (For a truly distressing example of this, see here.)

With so many obstacles to rating schools and teachers reliably based on standardized test scores, is it any surprise that we see results like this?

1. July 9, 2013 at 8:42 am

This looks like a great book. And your comment:

Koretz makes an analogy to political polling, which is a good reminder that a test result is really a sample from a distribution (if you take multiple versions of a test designed to measure the same thing, you won’t do exactly the same each time), and not an absolute measure of what someone knows.

was especially gratifying, as I’ve tried making a very similar point here and elsewhere: Tests really are “measurements” as the term is used in the sciences and in common parlance. As you note, an educational test provides a score, contingent on the time and place taken, that is at best correlated with a certain population’s competence in a subject. Looked at that way, which I think is far more realistic than thinking of a test as a “measuring device”, the importance of Koretz’s remarks about polling and sampling becomes very clear: Testing is a tool for evaluating a student’s understanding of a subject relative to the other test takers and his highly contingent on factors outside of the control of the test administrator; it’s at best one piece of evidence in evaluating teacher and student performance, and a highly subjective piece at that. The idea tests are somehow objective because they are “mathematical” or “statistical” is just a fantasy (or a lie).

Like

• July 9, 2013 at 8:43 am

Oops! When I wrote Tests really are “measurements” as the term is used in the sciences and in common parlance. , meant to write that:

Tests really are not “measurements” … .

Like

• July 9, 2013 at 12:47 pm

Of course tests are “measurements.” Perhaps some think they aren’t measurements because they may not have the precision of some scientific measurements, but they are measurements none the less. Even the most precise scientific measurements have an error component (i.e. are not 100% precise). Finally, the notion that “testing is a tool for evaluating a student’s understanding of a subject relative to other test takers…” is only true if the “evaluation” of student performance (e.g., pass/fail) is norm referenced. Many achievement tests are evaluated via an absolute standard or a criterion-referenced standard. In these cases each student’s performance is evaluated without reference to the performance of other students.

Like

• July 9, 2013 at 1:56 pm

Oops, when I wrote “Of course tests are measurements” I should have written “Of course test SCORES are measurements.”

Like

• July 10, 2013 at 3:48 pm

Jerry,

NO, test scores are not “measurements” by any stretch of the definition of that word. The proponents contend that they are measurements, want them to be measurements and do their best to obfuscate the fact that they aren’t measurements. As Wilson states, any conclusions (test scores) are “vain and illusory”. See below and read Wilson’s work to understand the invalidities involved in the whole educational standards, standardized testing and the “grading” of students.

Like

2. July 9, 2013 at 11:34 am

This is a really informative post — thank you!

Like

3. July 9, 2013 at 2:26 pm

There is a lot here — including the use of Item Response theory, raw scores, and the problems using nominal variables; and the epistemological problem of having to “measure” something that is only constructed when interacting with the instrument; and the whole hermeneutic problem revolving around the results.

All this goes far beyond standardized distributions.

Like

• July 9, 2013 at 2:31 pm

http://en.wikipedia.org/wiki/Item_response_theory

It is based on the application of related mathematical models to testing data. Because it is generally regarded as superior to classical test theory, it is the preferred method for developing scales, especially when optimal decisions are demanded, as in so-called high-stakes tests e.g. the Graduate Record Examination (GRE) and Graduate Management Admission Test (GMAT).

NB: Mathematical modeling, again, is central, and drives scaling.

Like

4. July 9, 2013 at 2:53 pm

For me, the most important, headline-grabbing points are 6. and 7. The test is being designed to produce a distribution of scores that don’t change from year to year– if schools do a great job of educating everyone across the board, then gradually the test will be made harder (!) and the same proportion of students will test in the bottom quartile!

It’s so obvious, but I never heard it said so succinctly before. It’s a zero sum game. If scores are re-scaled to correlate to the middle two, the bottom one, and the top quartiles, then if everyone does better (or worse) than the score distributions don’t change. The only way to pull up the bottom is to pull down the top, and get more bunching in the middle. I often have quipped that “no child left behind” should also be renamed “no child let ahead” (not original to me– I heard it from a friend) but I never realized before how strongly this is baked into the numbers before this post!

An interesting question is how current populations would do on old standardized tests. Having looked at SAT test samples from 30-40 years before I personally took them once, my guess is not very well. But this measures the drift in what a person is now expected to know versus then, something the standardized test is carefully designed to correct for as it is updated each year!

Terrific post– thank you.

Like

5. July 9, 2013 at 8:55 pm

Okay, folks, let’s get down to the basics here. All the mathematical machinations that standardized test score promoters do to “validify” the process are based on one basic false assumption, that the teaching and learning process (a quality of human interaction) is amenable to being quantified. It is not as one cannot logically quantify a quality. Noel Wilson, former test maker for the State of New South Wales in Australia has shown the many epistemological and ontological errors that occur in the processes of developing educational standards, the making and giving of standardized tests and the dissemination of the results that render these egregious educational malpractices completely invalid.

Read and understand Noel Wilson’s “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700 . If you can refute what he has shown please let me know as I have yet to see any rebuttal/refutation of this.

Brief outline of Wilson’s “Educational Standards and the Problem of Error” and some comments of mine. (updated 6/24/13 per Wilson email)

1. A quality cannot be quantified. Quantity is a sub-category of quality. It is illogical to judge/assess a whole category by only a part (sub-category) of the whole. The assessment is, by definition, lacking in the sense that “assessments are always of multidimensional qualities. To quantify them as one dimensional quantities (numbers or grades) is to perpetuate a fundamental logical error” (per Wilson). The teaching and learning process falls in the logical realm of aesthetics/qualities of human interactions. In attempting to quantify educational standards and standardized testing we are lacking much information about said interactions.

2. A major epistemological mistake is that we attach, with great importance, the “score” of the student, not only onto the student but also, by extension, the teacher, school and district. Any description of a testing event is only a description of an interaction, that of the student and the testing device at a given time and place. The only correct logical thing that we can attempt to do is to describe that interaction (how accurately or not is a whole other story). That description cannot, by logical thought, be “assigned/attached” to the student as it cannot be a description of the student but the interaction. And this error is probably one of the most egregious “errors” that occur with standardized testing (and even the “grading” of students by a teacher).

3. Wilson identifies four “frames of reference” each with distinct assumptions (epistemological basis) about the assessment process from which the “assessor” views the interactions of the teaching and learning process: the Judge (think college professor who “knows” the students capabilities and grades them accordingly), the General Frame-think standardized testing that claims to have a “scientific” basis, the Specific Frame-think of learning by objective like computer based learning, getting a correct answer before moving on to the next screen, and the Responsive Frame-think of an apprenticeship in a trade or a medical residency program where the learner interacts with the “teacher” with constant feedback. Each category has its own sources of error and more error in the process is caused when the assessor confuses and conflates the categories.

4. Wilson elucidates the notion of “error”: “Error is predicated on a notion of perfection; to allocate error is to imply what is without error; to know error it is necessary to determine what is true. And what is true is determined by what we define as true, theoretically by the assumptions of our epistemology, practically by the events and non-events, the discourses and silences, the world of surfaces and their interactions and interpretations; in short, the practices that permeate the field. . . Error is the uncertainty dimension of the statement; error is the band within which chaos reigns, in which anything can happen. Error comprises all of those eventful circumstances which make the assessment statement less than perfectly precise, the measure less than perfectly accurate, the rank order less than perfectly stable, the standard and its measurement less than absolute, and the communication of its truth less than impeccable.”

In other word all the errors involved in the process render any conclusions invalid.

5. The test makers/psychometricians, through all sorts of mathematical machinations attempt to “prove” that these tests (based on standards) are valid-errorless or supposedly at least with minimal error [they aren’t]. Wilson turns the concept of validity on its head and focuses on just how invalid the machinations and the test and results are. He is an advocate for the test taker not the test maker. In doing so he identifies thirteen sources of “error”, any one of which renders the test making/giving/disseminating of results invalid. As a basic logical premise is that once something is shown to be invalid it is just that, invalid, and no amount of “fudging” by the psychometricians/test makers can alleviate that invalidity.

6. Having shown the invalidity, and therefore the unreliability, of the whole process Wilson concludes, rightly so, that any result/information gleaned from the process is “vain and illusory”. In other words start with an invalidity, end with an invalidity (except by sheer chance every once in a while, like a blind and anosmic squirrel who finds the occasional acorn, a result may be “true”) or to put in more mundane terms crap-in shit out.

7. And so what does this all mean? I’ll let Wilson have the second to last word: “So what does a test measure in our world? It measures what the person with the power to pay for the test says it measures. And the person who sets the test will name the test what the person who pays for the test wants the test to be named.”

In other words it measures “’something’ and we can specify some of the ‘errors’ in that ‘something’ but still don’t know [precisely] what the ‘something’ is.” The whole process harms many students as the social rewards for some are not available to others who “don’t make the grade (sic)” Should American public education have the function of sorting and separating students so that some may receive greater benefits than others, especially considering that the sorting and separating devices, educational standards and standardized testing, are so flawed not only in concept but in execution?

One final note with Wilson channeling Foucault and his concept of subjectivization:

In other words students “internalize” what those “marks” (grades/test scores) mean, and since the vast majority of the students have not developed the mental skills to counteract what the “authorities” say, they accept as “natural and normal” that “story/description” of them. Although paradoxical in a sense, the “I’m an “A” student” is almost as harmful as “I’m an ‘F’ student” in hindering students becoming independent, critical and free thinkers. And having independent, critical and free thinkers is a threat to the current socio-economic structure of society.

Like

• July 10, 2013 at 2:27 pm

Yes — to all this. Assessments tell us more about the assessors than the assessees!

From earlier: “The test is being designed to produce a distribution of scores…” This is important, because nominal variables are not like height, weight, etc., although the Galtonian eugenists have been successful in convincing everyone that “tests” measure just like a ruler or a weighing scale. They made it real, but only in the sense of the Thomas Theorem, “If men [and women, presumablyl] define situations as real, they are real in their consequences.” 1928.

Like

• July 10, 2013 at 3:43 pm

Higby,

Where might I find your quote?

Thanks,
Duane

Like