Student evaluations: very noisy data

Home > education, feedback loop, math education, modeling, news, statistics > Student evaluations: very noisy data

Student evaluations: very noisy data

September 4, 2014 Cathy O'Neil, mathbabe

I’ve been sent this recent New York Times article by a few people (thanks!). It’s called Grading Teachers, With Data From Class, and it’s about how standardized tests are showing themselves to be inadequate to evaluate teachers, so a Silicon Valley-backed education startup called Panorama is stepping into the mix with a data collection process focused on student evaluations.

Putting aside for now how much this is a play for collecting information about the students themselves, I have a few words to say about the signal which one gets from student evaluations. It’s noisy.

So, for example, I was a calculus teacher at Barnard, teaching students from all over the Columbia University community (so, not just women). I taught the same class two semesters in a row: first in Fall, then in Spring.

Here’s something I noticed. The students in the Fall were young (mostly first semester frosh), eager, smart, and hard-working. They loved me and gave me high marks on all categories, except of course for the few students who just hated math, who would typically give themselves away by saying “I hate math and this class is no different.”

The students in the Spring were older, less eager, probably just as smart, but less hard-working. They didn’t like me or the class. In particular, they didn’t like how I expected them to work hard and challenge themselves. The evaluations came back consistently less excited, with many more people who hated math.

I figured out that many of the students had avoided this class and were taking it for a requirement, didn’t want to be there, and it showed. And the result was that, although my teaching didn’t change remarkably between the two semesters, my evaluations changed considerably.

Was there some way I could have gotten better evaluations from that second group? Absolutely. I could have made the class easier. That class wanted calculus to be cookie-cutter, and didn’t particularly care about the underlying concepts and didn’t want to challenge themselves. The first class, by contrast, had loved those things.

My conclusion is that, once we add “get good student evaluations” to the mix of requirements for our country’s teachers, we are asking for them to conform to their students’ wishes, which aren’t always good. Many of the students in this country don’t like doing homework (in fact most!). Only some of them like to be challenged to think outside their comfort zone. We think teachers should do those things, but by asking them to get good student evaluations we might be preventing them from doing those things. A bad feedback loop would result.

I’m not saying teachers shouldn’t look at student evaluations; far from it, I always did and I found them useful and illuminating, but the data was very noisy. I’d love to see teachers be allowed to see these evaluations without there being punitive consequences.

Categories: education, feedback loop, math education, modeling, news, statistics

Comments (26)

poppopk

September 4, 2014 at 7:37 am

Your experience fits well with what I saw in 30 years of teaching. Probably a universal phenomenon.

Sent from my iPhone

>

LikeLike
Min

September 4, 2014 at 8:39 am

About students and homework: I took a “modern algebra” course in which the professor would come in, start filling the black boards with proofs (and erasing after himself), and dart out at the end of class. The text for the class was Artin’s “Galois Theory”, to which he never referred, and which nobody managed to get very far into. Finally we surrounded him after class at mid-term and demanded homework.

LikeLike
Christina

September 4, 2014 at 9:05 am

I agree that student evals are noisy for the reasons you mention, they lead to a consumerist mentality on the part of many students and in more than one instance in my career as a student led to embarrassing and shameless ‘grade-grubbing’ on the part of pre-tenure profs. However, some feedback mechanism for poor teaching needs to be in place. Good teachers should be recognized, as should poor teachers, in some way, otherwise that entire portion of the organization, the student learning community, gets ignored and undervalued.

LikeLike
- Cathy O'Neil, mathbabe
  
  September 4, 2014 at 9:52 am
  
  Be careful! You are uncomfortable close to the argument that says, “OK yes this is a bad data gathering mechanism but we need a good data-gathering mechanism so let’s take this bad one.”
  
  We need to find it in ourselves to have high standards for good data, and not just be desperate and cling to whatever we can get.
  
  In my opinion we need to ask principals for their opinion of students. If we don’t trust our principals, we need to find better principals.
  
  LikeLike
  - cat
    
    September 4, 2014 at 10:14 am
    
    “In my opinion we need to ask principals for their opinion of students.”
    
    I hope you men teachers not students. I’m not sure a principal has enough contact with their students to form an useful opinion about them.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      September 4, 2014 at 10:14 am
      
      oops! sorry yes I meant teachers
      
      On Thu, Sep 4, 2014 at 10:14 AM, mathbabe wrote:
      
      >
      
      LikeLike
wgersen

September 4, 2014 at 10:07 am

Student surveys are a necessary but not sufficient means of evaluating teachers. Technology is no substitute for face-to-face contact. Read more about how surveys can fit into teacher evaluation here: http://waynegersen.com/2014/09/04/grading-teachers/

LikeLike
cat

September 4, 2014 at 10:22 am

Based on what I’ve seen adjunct professors say about student evaluations student evaluations guide them to making the class easier since an easy class means good reviews and good reviews means they don’t run the risk of being homeless.

It boggles my mind that anyone thinks asking the opinion of someone, student, with zero knowledge about the subject, Teaching, they are reviewing is a good idea. When I also consider the fact these are usually people who society has deemed to irresponsible to do many dangerous things I’m extra boggled. It seems irrational, so I have to assume they have ulterior motives or are extra dumb.

LikeLike
Guest2

September 4, 2014 at 11:06 am

Student evaluations (from when I taught) were decidedly bi-modal, and the best that I could make out, depended on initial conditions going into the courses. No surprise there. This is what you are describing.

The big question, then, is how do you compensate for initial conditions in order to effect desired outcomes. I don’t think anyone has an answer for this.

LikeLike
vonjd

September 4, 2014 at 11:17 am

I am a professor too and can only say that these evaluations often tell more about the evaluators than about the evaluated 😉

LikeLike
Arturo Magidin

September 4, 2014 at 12:18 pm

I’ve seen the same phenomenon (with pretty much the same class in similar circumstances). Over the years, I’ve found that only a handful of student evaluations are useful, and they are the ones written by thoughtful students who actually have some constructive remarks to make (example: I give weekly 5-minute quizes on the homework the students received back, to make sure they go over it and correct mistakes; I used to give it at the end of the class and let students leave when they finished, and several students pointed out that I would often end up keeping them late because I was finishing an example and started the quiz late, so it would work better at the beginning of class when I could control the timing better. They were right, and that’s how I do it now). But when students try to “grade” the teacher, they are often doing so from a position of lack of knowledge (can they really tell if the concepts you tried to drill into them are important or not? Yet I will often get “spends too much time on things that don’t matter, such as…” followed by some major concept) or, as you point out, colored by their attitude. My university tries to take teaching evaluation seriously, but in the end they turn into “If you get good evaluations but your classes are hard and you don’t seem to have grade inflation, then you’re a good teacher; otherwise, the evaluations are just noise.” Make them too big a component of promotion, tenure, raises, etc., and the pressure to comform your class to the students desires becomes too big. And let’s face it: the students are in the class *because they don’t know the material*, so they are not in the best position to decide which of their desires are good or bad in the long term (and even if they were, they don’t always choose what is good in the long term…)

LikeLike
- Min
  
  September 4, 2014 at 4:38 pm
  
  Martin Buber said that the job of a teacher is to build a bridge to the student. The student certainly has a good idea about how well the teacher has done that.
  
  LikeLike
  - Arturo Magidin
    
    September 4, 2014 at 5:21 pm
    
    There are many different “bridges” that can be built. An imaginary professor who takes all his students out for pizza every week and discusses pop culture and becomes the student’s friend has built a (social) bridge to the student, and the student will know that, but that does not necessarily mean the professor taught the student the material he needed to learn in that class. “Building a bridge to the student” is not the *purpose* of the teacher; it’s *the method whereby* the teacher will attempt to achieve the purpose (which is to get the student to learn what the student needs to learn).
    
    LikeLike
    - Min
      
      September 4, 2014 at 9:12 pm
      
      Buber went on to say that the job of the student was to cross over the bridge. Obviously a merely social bridge was not what he had in mind.
      
      That is not to deny that building a bridge is a social act. At the same time, for the teacher to become the student’s friend is usually a mistake.
      
      LikeLike
Aaron

September 4, 2014 at 12:23 pm

All of us who teach know very well how noisy student evaluations are, and that evaluations are highly correlated with grades, and that they are often as much about the likeability of the instructor as they are about high quality teaching, and that this is partly inevitable because a student often can’t really judge the effectiveness of a teacher until long after they’ve filled out their evaluation forms.

But they aren’t totally meaningless either, and I would argue that in the spring semester class where the students had a bad attitude and gave you bad evaluations, you probably actually were a worse teacher than in the fall semester when everyone loved you. Which doesn’t mean it was your fault and I certainly am not saying you should be punished for it. But it’s not total noise, is what I’m saying.

LikeLike
- Ursula
  
  September 4, 2014 at 1:20 pm
  
  The US Air Force Academy did a study which showed student evaluations were NEGATIVELY correlated with performance in subsequent courses: the more demanding professors got lower student evaluations, but their students did better later on.
  
  Click to access profqual2.pdf
  
  LikeLike
  - Min
    
    September 4, 2014 at 4:39 pm
    
    Far out! Bad evaluations are good, and vice versa.
    
    LikeLike
- Cathy O'Neil, mathbabe
  
  September 4, 2014 at 2:26 pm
  
  True, in that I could not summon up a lively and engaged debate about the subject, due to the reluctant audience. And since we didn’t have that, it was harder for me to suss out what the students were struggling with. In other words, good teaching requires building a feedback loop with the students.
  
  LikeLike
Mike

September 4, 2014 at 1:21 pm

Noisy data sure, but there are filter functions to pull the signal out. Normalization (students in a semester, across all subjects a student evaluated, class-wide, previous year(s), etc) or statistics (Yah!), IFF used appropriately can remove noise/bias.
Alternatively, you can have students note what was their best and worst teacher each semester and I’ll bet you will see correlations that could be used as a basis for removal or promotion. That would leave the grey middle out.

LikeLike
Zathras

September 4, 2014 at 1:57 pm

The Fall/Spring discrepancy for Calc I is indeed universal. When I taught Calc I for both fall and spring, there were indeed some easy population differences to make.

Fall–1 component
(1) predominantly entering freshman meaning that they were able to place directly into Calc I from the placement test

Spring–2 separate components
(1) people who placed into Pre-calc for the fall
(2) people who placed into Calc I in the fall, but failed it or withdrew the first time

Once you see the population differences, the discrepancy in performance and attitude is obvious.

LikeLike
Eugene

September 4, 2014 at 9:45 pm

Aaron makes a good point. It looks like the real story behind your example is something like this: there are two kinds of students, motivated and unmotivated. You do really well teaching motivated students, possibly less well (compared to somebody else who might teach the class) teaching unmotivated students.

Now, can you actually build models that can tease stories like this out of the data? Not clear, but I don’t mind seeing someone try. Surveys are much more open ended than tests, so if there turns out to be important data that you’re not capturing, you can at least try to extend the framework to capture it. At the very least, it’s nice to see someone doing something “data-driven” in education that isn’t “test-driven.”

LikeLike
- Cathy O'Neil, mathbabe
  
  September 5, 2014 at 6:35 am
  
  I don’t like the way that was phrased, Eugene. I mean, yes, I am not as good at teaching unmotivated students who don’t want to be in class as I am at teaching motivated students who are eager to learn, but who is? Even if I grant you that some people are good at motivating students, it’s still easier to teach students who already are motivated. At best you can say some people have a smaller differential between the two groups. But even then, my point stands that evaluations are highly noisy.
  
  Also, it’s not clear to me that they are being data-driven and trying to tease out stories. It’s presented in the article like it’s obvious that teachers should aim to improve their student evaluation scores, which again I would argue leads to dumbing down the curriculum rather than better teaching. That may technically be data-driven but it’s not good.
  
  LikeLike
  - Ursula
    
    September 5, 2014 at 7:56 am
    
    Plus, student surveys completely ignore the possibility that the best method for identifying ways teachers could teach better might be to ask the teachers.
    
    LikeLike
  - Eugene
    
    September 5, 2014 at 9:29 am
    
    OK, better phrasing then. Let’s say, for argument’s sake, that good teaching means maximizing students’ learning. Better definitions welcome. It looks like we agree that different students have different learning capacities. Let’s oversimplify and say there are two cohorts of students, M and UnM, with high and low learning capacities. Of course students in M will learn more than students in UnM, regardless of the teacher or anything else short of a nuclear explosion. But it sounds to me like your data is saying that you’re really good, maybe better than anyone else around, at getting the students in M close to their learning capacity. Somebody else in your department may be better than you at getting the students in UnM close to their learning capacity. (I’ve met people like this, I was always better at teaching M, and I’m always impressed with people who are really good at teaching UnM.)
    
    A principal, or a department head, could act on that in different ways. One way is to decide that you should always teach M, and your colleague always teach UnM. Another is to make your colleague watch you teach M, and make you watch them teach UnM, so that the next time each of you teach the cohort you’re less naturally good at, you get more out of them. A principal might ask how many M students and UnM students there are, and whether their next hire should be an M-maximizer or an UnM-maximizer. (More ominously, they could decide to replace some M-maximizers with UnM-maximizers, or the reverse, if they thought the student distribution warranted it.) More hopefully, they might set up a long-term program, led by the UnM-maximizers, to try to to turn UnM-students into M-students (though if that’s successful, the UnM-maximizers need to learn to become M-maximizers too!). In any case, if I were a principal, I’d certainty want to have the information.
    
    And I agree that the data is noisy, and that getting generically good evaluations is the wrong metric for teachers. But hopefully this is just the beginning of the story, not the end, and perhaps people starting to poke around in this space will lead to better surveys, better data, and better metrics. Or if not, then we’ll have a better idea of what the limitations of data in this space really are. Right now, we don’t seem to know much beyond “You can’t learn that much from test scores,” and I’d like to know more than that.
    
    LikeLike
    - Cathy O'Neil, mathbabe
      
      September 5, 2014 at 9:31 am
      
      OK I love your new phrasing. It’s a matchmaking process and we haven’t really learned what attributes are important. I get it and I agree!
      
      On Fri, Sep 5, 2014 at 9:29 AM, mathbabe wrote:
      
      >
      
      LikeLike
revuluri

September 8, 2014 at 3:22 pm

I’ve been interested by the comments that have been shared so far, but also somewhat perturbed at the readiness to generalize from personal experience and anecdote. Personally, I agree that “big data” methods deserve our careful, critical scrutiny — but “small data” (or “anecdata”) do too.

In any case, I think it’s interesting to look at what actual social scientists have done on using student surveys to evaluate teaching. For example, Ron Ferguson’s Tripod Project (http://tripodproject.org). More recently (and descended in part from Ferguson’s work, and others’) there is the Measures of Effective Teaching Project (http://www.metproject.org). (Disclaimers: The overall research design of MET has always seemed a bit circular to me, but that’s beside the point for this particular issue.)

To get more concrete, you might want to look at the actual prompts on the student surveys. They’re not asking kids whether they “liked the class.” They’re much more concrete, low-inference statements – which not only get around student “likes” but also make things more actionable. Take a look at them (http://www.metproject.org/get_file.php?filename=MET_Project_Secondary_Student_Survey.pdf – or if that doesn’t work, go on the resources page – you can enter a fake name and email).

It would certainly be great if every principal were excellent at recognizing teachers’ strengths and challenges, and skilled at giving effective feedback. It would be even more wonderful if they had the time to do so. But these are just wishes. This work actually has great potential to enlist far more people — not just principals, but also teachers, and students — in thinking about and working to improve teaching.

LikeLike