Evaluating professor evaluations

Home > data science, math, math education, statistics > Evaluating professor evaluations

Evaluating professor evaluations

September 24, 2012 Cathy O'Neil, mathbabe

I recently read this New York Times “Room for Debate” on professor evaluations. There were some reasonably good points made, with people talking about the trend that students generally give better grades to attractive professors and easy grading professors, and that they were generally more interested in the short-term than in the long-term in this sense.

For these reasons, it was stipulated, it would be better and more informative to have anonymous evaluations, or have students come back after some time to give evaluations, or interesting ideas like that.

Then there was a crazy crazy man named Jeff Sandefer, co-founder and master teacher at the Acton School of Business in Austin, Texas. He likes to call his students “customers” and here’s how he deals with evaluations:

Acton, the business school that I co-founded, is designed and is led exclusively by successful chief executives. We focus intently on customer feedback. Every week our students rank each course and professor, and the results are made public for all to see. We separate the emotional venting from constructive criticism in the evaluations, and make frequent changes in the program in real time.

We also tie teacher bonuses to the student evaluations and each professor signs an individual learning covenant with each student. We have eliminated grade inflation by using a forced curve for student grades, and students receive their grades before evaluating professors. Not only do we not offer tenure, but our lowest rated teachers are not invited to return.

First of all, I’m not crazy about the idea of weekly rankings and public shaming going on here. And how do you separate emotional venting from constructive criticism anyway? Isn’t the customer always right? Overall the experience of the teachers doesn’t sound good – if I have a choice as a teacher, I teach elsewhere, unless the pay and the students are stellar.

On the other hand, I think it’s interesting that they have a curve for student grades. This does prevent the extra good evaluations coming straight from grade inflation (I’ve seen it, it does happen).

Here’s one think I didn’t see discussed, which is students themselves and how much they want to be in the class. When I taught first semester calculus at Barnard twice in consecutive semesters, my experience was vastly different in the two classes.

The first time I taught, in the Fall, my students were mostly straight out of high school, bright eyed and bushy tailed, and were happy to be there, and I still keep in touch with some of them. It was a great class, and we all loved each other by the end of it. I got crazy good reviews.

By contrast, the second time I taught the class, which was the next semester, my students were annoyed, bored, and whiny. I had too many students in the class, partly because my reviews were so good. So the class was different on that score, but I don’t think that mattered so much to my teaching.

My theory, which was backed up by all the experienced Profs in the math department, was that I had the students who were avoiding calculus for some reason. And when I thought about it, they weren’t straight out of high school, they were all over the map. They generally were there only because they needed some kind of calculus to fulfill a requirement for their major.

Unsurprisingly, I got mediocre reviews, with some really pretty nasty ones. The nastiest ones, I noticed, all had some giveaway that they had a bad attitude- something like, “Cathy never explains anything clearly, and I hate calculus.” My conclusion is that I get great evaluations from students who want to learn calculus and nasty evaluations from students who resent me asking them to really learn calculus.

What should we do about prof evaluations?

The problem with using evaluations to measure professor effectiveness is that you might be a prof that only has ever taught calculus in the Spring, and then you’d be wrongfully punished. That’s where we are now, and people know it, so instead of using them they just mostly ignore them. Of course, the problem with not ever using these evaluations is that they might actually contain good information that you could use to get better at teaching.

We have a lot of data collected on teacher evaluations, so I figure we should be analyzing it to see if there really is a useful signal or not. And we should use domain expertise from experienced professors to see if there are any other effects besides the “Fall/Spring attitude towards math” effect to keep in mind.

It’s obviously idiosyncratic depending on field and even which class it is, i.e. Calc II versus Calc III. If there even is a signal after you extract the various effects and the “attractiveness” effect, I expect it to be very noisy and so I’d hate to see someone’s entire career depend on evaluations, unless there was something really outrageous going on.

In any case it would be fun to do that analysis.

Categories: data science, math, math education, statistics

Comments (19)

Susama Agarwala

September 24, 2012 at 8:43 am

“We also tie teacher bonuses to the student evaluations ”

A friend of mine went to grad school where good TAs as determined by evaluations, were rewarded by getting a coveted class. I forget the details, but I think you could get this position once or twice in your 4-5 years as a TA. There seemed to be 2 ways in which to be a “good TA.” One was to be a genuinely good teacher, which comes from years of experience and hard word, the other was to give away answers during sections. You see where this leads….

At my grad school, I was told by the person in charge of TA assignments that the grad students assigned to the more difficult teaching loads were those that were perceived to be better TAs, not by the students, but by the professors in charge of the class. I’m not quite sure why this didn’t have an immediate effect on the quality of attention I gave my students and the professors I was TAing for. Perhaps it can be chalked up the to unmeasurable value I put on getting a good letter from said person.

LikeLike
JSE

September 24, 2012 at 8:58 am

“That’s where we are now, and people know it, so instead of using them they just mostly ignore them. Of course, the problem with not ever using these evaluations is that they might actually contain good information that you could use to get better at teaching.”

It’s not all or nothing, and I don’t think people ignore them. I find useful information that aids my teaching in my student evaluations every single semester. They’re kind of like Yelp reviews; too noisy to use at scale for anything with real stakes, but if you actually read them it’s pretty easy to separate out the useful criticism from the “I hate math and I hate you because you made me do math.”

LikeLike
- Cathy O'Neil, mathbabe
  
  September 24, 2012 at 10:21 am
  
  I believe you, Jordan, but on the other hand you’re not typical. I’m tempted to write another post about how people who should find read their evaluations tend not to, and people who shouldn’t tend to take them too hard; there are not too many that have the solid ego to happily filter away the abuse, combined with the earnest desire to improve their teaching through the few comments that warrant attention.
  
  In other words, I’d love to see this stuff cleaned up into a form that’s useful to more people. Or what might even be easier and more useful in the short term is if you’re handed the unfiltered evaluations along with statistics on what other professors got for that class for that semester etc. so you’d know that 75% of people hating it is normal.
  
  Cathy
  
  LikeLike
  - JSE
    
    September 24, 2012 at 11:00 am
    
    We do get the average eval scores for the course here at Wisconsin, is that not usual?
    
    LikeLike
    - Nathan Dunfield
      
      September 24, 2012 at 8:46 pm
      
      Jordan, here at Illinois we only get comparison eval scores on the level of department and university broken down into 3 broad kinds of classes (required/elective/mixed). In particular, you get the same comps if you teach an honors upper-division course or Calculus I.
      
      LikeLike
  - Cynicism
    
    September 24, 2012 at 6:42 pm
    
    I do not envy the task of the person who has to clean up that particular batch of data. Case study: what does the word “Confusing” mean on an evaluation?
    
    Does it mean, “I hate math and I hate you?”
    
    Does it mean, “You have an accent/speech impediment and I feel this made the class harder than it otherwise would have been?”
    
    Does it mean, “Your boardwork/notation sucks?”
    
    Does it mean, “You need to stick closer to the book so it’s easier for us to study?”
    
    Because in each case a student may decide to simply write “The professor was confusing.”
    
    Worse still, I had a chance to look at the aggregate scores of numerical questions in a large course in a relatively large department recently. You know, the scores that departments ask for in job applications? There was little correlation between any one factor (faculty rank, time of day, etc.) and high numerical scores on any of the questions. There was however a very high correlation between high scores on one question and high scores on all questions. This is to say that the numerical scores are very much a thumbs-up/thumbs-down vote on whether the class liked the professor overall.
    
    LikeLike
Mike Maltz

September 24, 2012 at 11:51 am

I taught a required course (statistics) in a criminal justice curriculum, and my evaluations went from average to fantastic. In looking back, I realize that one of the best indicators of my evaluations were whether there were one or two spark plugs in the class, kids who may not have had much experience with numbers, who may not have taken algebra, but who were interested in getting statistical concepts straight. They energized the class (and me), making it a pleasure to teach. Other classes made me feel like Sisyphus.

LikeLike
Michelle

September 24, 2012 at 1:20 pm

Jeff Sandefer said “… our lowest rated teachers are not invited to return.”

No matter how good the faculty is, some teachers will be rated lowest. That’s just how linear rankings work. It says nothing about where they are in relation to some absolute scale of good teaching (whatever the hell that means). Nor does it tell you where the “best rated teachers” are in relation to that scale. It just says who’s better than who (maybe) among the small cast of characters currently on faculty.

This kind of stuff drives me nuts. Someone is always going to be in last place, no matter how good we all are. Why is that a useful piece of information? Last place at the Olympics is still pretty impressive, and first place at a neighborhood bake-off not so much.

LikeLike
- Michelle
  
  September 24, 2012 at 2:00 pm
  
  The forced bell curve on students is a crap idea for similar reasons…
  
  Think about your two calculus classes. Surely if you had been forced to put both of those on a bell curve, some C or C+ students in the fall would have been A students if they had only waited a semester to take the course.
  
  My exam scores more often fall on a reverse bell curve than an actual bell curve. Students either get it & do really well or don’t & do really poorly. I have very few “in the middle” for most of my classes. How do you force that onto a bell curve?
  
  But even more than that, I don’t like for my students to be in competition with each other or even to have that perception. Their job is to learn, and to help each other learn. I tell them (truthfully) that I will happily give everyone an A if they all earn an A.
  
  And bringing it back to the evaluations, I’ve got one that says, “She doesn’t want anyone to fail.” Isn’t that a good thing in a teacher? Not according to Sandefer, who would much prefer “she wants exactly 10% of the class to fail” or whatever.
  
  LikeLike
  - Michelle
    
    September 24, 2012 at 2:26 pm
    
    I realized I should add… this is not the same as “everyone does get an A” and my evaluation does not say “no one fails.” I am actually known to be a pretty tough grader. “Tuff but fare” according to one of my student evaluations. Ugh. So glad I don’t teach English.
    
    Anyway, A’s are rare in my class, and F’s are rare but do happen each semester. But my students still know that I *want* them to succeed, and that in theory everyone *can* get an A. There’s not some arbitrary rule holding them back… it’s up to them and how well they do in the class. It has nothing to do with whether the students on either side of them do “better” by a few points.
    
    LikeLike
jim

September 24, 2012 at 2:07 pm

The difficulty would be getting hold of a good dataset. Can you anonymize the evaluations of upper-level classes and still have enough information about each class to sort out content effects?

LikeLike
Greg Taylor

September 24, 2012 at 4:02 pm

Analyzing student teaching evaluations is difficult – at best you might be able to identify some who are more likely to be bad teachers than others. There are many factors that influence student motivation. Sections offered at times of the day preferred by good students tend to be much more highly motivated than others. Traditional day students forced into night sections or early morning sections because they are less motivated and registered late will affect evaluations.

In my mind, the biggest problem evaluating teaching is defining and measuring what a teacher should be doing. Student learning outcomes? Nurturing? Attracting qualified students into the major? Keeping students happy? Long-term impact on lives? Tough love? Once you develop a measure and rate professors by that measure, you destroy the diversity that allows for a variety of needs to be met.

Everyone can’t play point guard on the basketball team. A player’s contribution can’t be reduced to the points scored. Look at the role a professor plays on the team and whether the team is meeting student needs.

Without some way to measure good teaching, there’s not much hope for analyzing student evaluations.

Exposing students to a wide variety of teaching philosophies and techniques is one way a school can make an impact on a variety of student needs. A student who needs a nurturer or tough-love might get it from someone less focused on learning outcomes or preparing students for future coursework in the major.

Efforts to rank professors/teachers by learning outcomes or evaluations or some other criteria will inevitably homogenize the faculty leaving many student needs unmet. The Jack Welch / GE employee retention policy used by Acton drives fear and conformity into the organization – the opposite of what you’d like from an educational institution.

LikeLike
tagoutit

September 24, 2012 at 4:29 pm

The idea of a forced curve seems particularly constricting. In effect it tells students also: I can only give out 1 or 2 A’s so I can’t give you one, even though you deserve it. You discourage students from trying too hard. They will ascertain their standing and invest just that amount of effort; why bother otherwise? it is a false setup. How would you like it if the administration told you ‘well we give vacation days only to the top performing professors’ sorry. or tie it to the teachers’ salary, fairly parallel to their hearts as is a grade to a student. Do you think it is fair for a University to say ‘we give salaries according to a forced curve’?
that is nonsense, an arbitrary and thoughtless rule for lazy staff. It is not the way the rest of the world works, and it penalizes your best students.

LikeLike
Nathan Dunfield

September 24, 2012 at 9:09 pm

There’s definitely signal in student evals, though whether it’s really “quality teaching” is a different discussion. While I’ve had experiences like you where I repeated the same course with wildly different results, this is (on average) quite rare. The education folks have done big studies definitively showing that student evaluations for the same professor in the same course (including Fall vs. Spring) are *very* highly correlated. Unlike asking colleagues to evaluate the teaching skills of their fellow instructors (which varies wildly depending on who is doing the evaluation), student evaluations are highly repeatable as a system of measurement.

In the context of math departments, some time ago I did a quick analysis of about 500 courses taught by about 50 instructors, and once you grouped courses by type (large vs small, grad vs undergrade) and normalized scores by type averages, the scores for individual instructors were remarkably uniform…

LikeLike
Manoel Galdino

September 25, 2012 at 12:42 pm

Maybe it’s worthwhile to look at what Andrew Gelman wrote at his blog about grade inflation.
http://andrewgelman.com/2012/09/grade-inflation-why-werent-the-instructors-all-giving-all-as-already/

LikeLike
Zathras

September 25, 2012 at 2:20 pm

I did an analysis of teaching evaluations for my Big 10 school for those professors (in math) that had a minimum of 12 course-semesters over a 4 years period. The end result was not at all noisy. There were 3 distinct populations. There were 3 groups of professors: one that overall got very good evaluations, one group whose evaluations were consistently poor, and a third group whose evaluations were all over, but the mean of which fell in between the two other groups. The distribution among the 3 groups was about 15% good, 60% mixed, and 25% bad.

Some courses intrinsically had lower reviews than others. Senior level undergrad analysis in particular had consistently terrible reviews, below the mean of every one who taught it. The fall/spring Calc I dichotomy was in full display, with Calc II having it in the opposite direction.

LikeLike
jmm

September 25, 2012 at 4:28 pm

There’s a nice paper about the Air Force Academy data set here:

Click to access profqual2.pdf

LikeLike
Dan L

September 27, 2012 at 7:44 pm

Forced curves are probably the only reliable way to stop grade inflation, but I think it’s a cure that’s worse than the disease. Not only are there the problems described by others above, but students would choose courses based on how smart (or not smart) their classmates are likely to be. It would be a race to the bottom for course selection. No one would want to take difficult courses, because those classes are usually filled with good students. Forced curves work best for large required courses and pretty much nowhere else. However, there probably exists a good system of flexibly enforced curving.

Tangent: This Acton place sounds fishy. Supposedly, all of the teachers are successful businessmen. If that’s true, then presumably they aren’t teaching there for the money. But these teachers are supposed to be motivated by bonuses? If they’re such great businessmen, I would think that building businesses would be a more efficient way for them to earn money than honing their teaching craft.

LikeLike
isomorphismes

November 16, 2012 at 12:38 pm

You can also just ask different questions. Think about the difference between five-star rating “good/bad” versus a written answer. Are people going to spend paragraphs making up a fallacious story to explain what their ugly professor “does wrong”? I don’t think so.

You could also evaluate professors more than once during the semester, and in subsequent evaluations ask if they’d improved on X that you complained about before.

LikeLike