At the end of each semester, college students have a chance to give feedback to their professors in the form of student evaluations, basically “rating” them. On the whole, professors are often not really comfortable with this, which has led to a great deal of research on the topic. Critics of student evaluations tend to point out that they don’t really measure the actual teaching effectiveness of a professor, only perceived teaching effectiveness. They also readily point out that the correlation between student evaluations and future student performance is weak. There are other fundamental concerns: Arguably, student evaluations come at the worst possible time—at the end of the semester, when it is too late for the professor to change anything about the course and usually too early for the students to fully appreciate the utility of what they learned in the class. Finally, there are fundamental conflicts of interest involved, given that professors who grade more harshly tend to receive lower ratings, and students have clear interest in their grades while professors depend on receiving good ratings for promotion, tenure, and so on. We’re both researchers who have taught college students, so we’re quite familiar both with the process and the response from our colleagues.
It is debatable what the best way of implementing student evaluations is and how this process can be improved. But few people dispute that—given the ever-increasing cost of higher education, which effectively turns students into customers—students should have a voice to point out when teaching standards are not met. In an era where teaching quality matters less and less for the professional standing of professors—relative to grants, publications, and administrative service—it might be inconsiderate if not reckless to eliminate them altogether.
This consensus has come increasingly under attack on a new front: by those who claim student evaluations are fundamentally flawed because they are inherently biased due to sexism and, as such, should be banned. We do not share this perspective, mostly due to our methodological concerns about the research undergirding the assumption. We understand that concerns about sexism pervade society and are difficult to assess in general, but we do not believe the existing literature on this topic provides a sufficient evidentiary basis for the claims they are making. The available research is complicated and varies in terms of quality as well as findings—some studies report a bias against women, but some note that female instructors receive slightly better ratings.
For instance, a recent study purported to show that if students believe that their instructor is female—regardless of the actual gender of the instructor—they rate the instructor statistically significantly lower in terms of fairness, praise, promptness, and overall rating. But this study was severely underpowered, with a sample size of 20 per group. And because the authors deployed many statistical tests on this sample, that has to be taken into account in the analysis, specifically by lowering the criterion for statistical significance. But it was not done here—if it had been, these results would no longer be significant, which suggests what they report are likely false positives, not real results.
In addition, these results were based on an online study. How relevant could any attribute of the professor have been, given that interactions in online studies are minimal? In other words, we are concerned that such an online course is not representative of in-class courses, with different student-professor interactions, which still make up the bulk of college courses. This is particularly problematic because it has been shown that people rely on stereotypes more if there is more uncertainty. If there is less uncertainty, as in an in-class setting, the effects of stereotypes could be expected to be much diminished. Indeed, there is usually no significant gender-based difference in student evaluations for in-class (not online) settings.
Most recently, another paper made a radical case against student evaluations on the grounds that they are gender-biased—one of the authors expanded on this case in Slate, arguing that because evaluations are biased against women, using them to assess professors is a violation of anti-discrimination laws and should not be done. Again, we do not believe that the available empirical evidence warrants such drastic policy suggestions. In the study conducted by the Slate writer and her co-authors, they compared student evaluations of one pair of a male and a female instructor in an online course and found that the male professor received significantly higher evaluations than the female one. But we’re afraid nothing can be definitively concluded from a comparison based on a sample size of one pair. This is literally an anecdote—presumably there are a near-infinite number of differences between the female and the male instructor, gender being only one of many one could use as a basis for comparison. How do the authors know that it was gender that the students focused on and not—say—teaching style? People are very different to begin with, so any experiment in psychology needs a large sample size to make sure that the signal one is looking for (in this case gender differences) is not drowned out by what is pejoratively called noise. (People are complicated, so pre-existing variability might be more appropriate.)
However, we realize that we won’t convince anyone who doesn’t already agree with us on the basis of these arguments, so we gathered data from RateMyProfessors.com, a popular website where students can rate their professors. Specifically, we scraped the reviews from 1 million faculty profiles and classified them as male or female, where an assignment could be made unambiguously. We used these reviews as a stand-in for teaching evaluations because those are not publicly available, but RMP profiles are. This is an acceptable proxy because the correlation between RMP ratings and evaluations is surprisingly—and sufficiently—high.
If female professors suffered from lower rankings based on their gender—as has been suggested—the results of the analysis of the data we pulled would look something like this:

Basically, both men and women can be expected to vary in their teaching ability and the rating they consequently receive from students, but the entire male distribution would be shifted to the right, either because gender stereotypes confer an unearned ratings boost to them, by the virtue of being male, or because the female rates are artificially lowered relative to male ones, due to sexism. The larger this effect is, the more the average ratings of both groups will differ. Of course, there will be some overlap—the best female professors might be receiving higher ratings than the worst male professors, which is rendered in purple.
Here’s what we found:

You can see three important points in this figure. First, the two distributions are mostly overlapping, so gender and stereotype effects are extremely subtle. Second, there is no meaningful mean difference between the groups. The average of the male group is ever so slightly higher, but not by much. Here’s an analogy to help explain the size of the effect we see here: The average U.S. adult takes 5,117 steps a day. If there were a performance-enhancing drug that could increase the number of steps someone takes in the same way gender affects student ratings, people taking that drug would get about 168 additional steps per day. If 5,117 steps is about 2.42 miles, the extra boost is about .07 miles, about 123 yards, or an additional walk to the mailbox. In other words, it’s an extremely small effect. And that presumes that there are no gender-related confounds, which is a big presumption. As the gender distribution of professors has changed over the years, male professors have—on average—more experience than female ones. Professors with more experience (which in our study equals to more ratings) get better ratings. If we statistically account for this, even this subtle difference disappears. So we predict that as gender ratios of the faculty equilibrate in the future and male and female professors gain experience at an equal rate, this subtle mean difference will go away entirely. Finally, and most interestingly, women are overrepresented in the tails of the distribution. In other words, there are relatively more women among professors who are rated as truly amazing and who are perceived as absolutely terrible, with an overrepresentation of male professors sandwiched in between.
We want to emphasize how surprised and shocked we were when we saw this. That’s because in virtually all other domains anyone has looked—income, general life outcomes (e.g., having a powerful position vs. being homeless or incarcerated), and even number of offspring—men are overrepresented in the tails of these distributions, with women being overrepresented in the middle. These discrepancies are often attributed to the difference in men’s increased propensity to take risks. This makes intuitive sense: Risk-taking corresponds to gambling and gambles can be lost, in an uncertain world. Some risk-takers will come out on top, whereas others will lose and end up at the bottom of the distribution (or possibly die). The theory on why men take more risks on average is undergirded by evolution: No social group can afford to lose most of their women, whereas men are largely disposable and have been treated as such throughout history. Even today, men die several years earlier than women in most countries.
Anyway, all of this is to say we were quite surprised. But back to the case at hand, student evaluations don’t deal with actual life outcomes and few teachers take much risk in the classroom in the first place. Instead, our data reflect students’ perception of teaching effectiveness. So here is what we think might explain the effect we found, a theory we call the Divergent Interpretation Due to Expectations model: Because we are evaluating professors, all of these are high-status leadership positions. As women have entered this profession relatively recently (in large numbers), it is reasonable to posit that stereotypes against them manifest as lowered expectations overall. The rest is basic psychology. If a female professor is objectively good, beating the low expectations, she will be perceived as amazing—people will think that she overcame all the gender-based obstacles that were put in her path and came out ahead anyway, and she will end up on the far-right tail of the distribution. If a female professor is objectively bad, underperforming even the low expectations, she will be perceived as truly awful, students might suspect that she was hired just because she is a woman, and she will end up on the far-left tail of the distribution.
“Bad” female professors could end up feeling like they are discriminated against, and that might be understandable, because our results suggest this group is experiencing an effect that might be attributable to gender, albeit an extremely small one (These women might also realize that they are not great but not quite as bad as all the hostility they receive suggests.) In contrast, the “amazing” women in the right tail will simply attribute their success to themselves, not to stereotypes, as suggested by research on the self-serving bias in the attribution of causality. This also meshes with the often-expressed perception of minorities that they have to work harder or be better than average in order to be perceived as good.
We want to emphasize that this model is still very preliminary. To validate it, we would need to show that a similar effect exists for other minorities, e.g., for race or other characteristics, but RateMyProfessor does not provide race-related information about professors. Another avenue to validate this model would be to do cross-cultural research.
Our data set had another characteristic that we decided to assess. In addition to asking basic questions about level of difficulty and whether you’d take another class with the professor, students are invited to assess the “hotness” of a professor, with the options of “yeah” and “um, no.” It’s worth noting that this is not a required evaluation criterion.

If a professor has more positive hotness ratings than negative ones, they receive a “pepper” on the site.
Looking at the distributions in Figure 3, it is clear that having a pepper (perceived hotness) correlates extremely closely with positive rankings:

As this distribution shows, it is not impossible for a professor without the pepper to get a terrific rating, but it is much less likely. Moreover, professors with a pepper basically don’t get terrible ratings: 85 out of 100 professors with a rating of 4.9 have the pepper, whereas only 2 in 100 with a rating of 2.1 do. This difference is striking. Again, if it was the performance-enhancing drug from before, and if it was as effective at increasing steps as peppers are at increasing ratings, it would now add about 3,884 extra steps per day, bringing the daily average of those who take this drug to more than 9,000 steps. In other words, it would be far more effective than any known public health intervention aimed at increasing physical activity. That is a dramatic effect, and in our sample, it is as strong in men as it is in women.
So is the most important factor in student evaluations the professor’s physical appearance? We refuse to believe that people—even young students—are actually that shallow. As this is purely correlational, we can’t discern if attractive professors get good ratings or if professors who get good ratings are perceived as attractive. There are even other possibilities: Maybe professors who are good at their job are more confident than others and are thus perceived as more attractive. Maybe students “award” a pepper as a reward for a job well done. We can’t discern these possibilities here, but we don’t have to. We just wanted to illustrate what a strong effect looks like, and the effect of perceived physical attractiveness on student ratings is strong, whereas gender effects are not.

Figure 4: There is one final strong effect to consider, and that is the effect of perceived difficulty on perceived quality. As you can see (here, we show only profiles with more than 50 ratings to minimize visual clutter), this effect is also strong—there is a large negative correlation between difficulty and rating, per professor. In other words, professors who are perceived as difficult are perceived as low-quality, whereas those who are perceived as easy receive a high rating. It looks like students don’t take too kindly to professors who are too demanding (“difficult”). However—and that is what is important for the purpose of this discussion—there is absolutely no evidence that students do this differently for men and women. As you can see from this figure, women and men are completely interspersed. There are no clear gendered trends—women are not perceived as more difficult on the whole, and they are not more penalized for being more difficult. The correlations are the same in both subgroups.
As instructors—and instructors that teach subjects that students tend to perceive as difficult and demanding at that—we can sometimes empathize with the desire to get rid of student evaluations altogether. In our experience, students do not hold back in evaluations when they feel wronged, and often for a student, being “wronged” means “receiving a bad grade.” This can certainly seem unjust to the instructor if that grade was well-deserved. So there is no question that the evaluation process itself can be improved. Maybe the very fact that sites like RateMyProfessors exist in the first place suggests that even students perceive the existing formal evaluation process as inadequate. But the available evidence does not suggest we should ban evaluations altogether, let alone for reasons that are not supported by empirical evidence. Though we maybe should have a serious talk about the pepper.
We are also engaged in research on how individuals experience the world. If you want to help us with that, take this brief survey.