Two days ago I posted an item, Questioning Student Evaluations, which reported on two studies that call into doubt the utility of student evaluations of teaching as a mechanism for determining teacher effectiveness. In response, AAUP President Rudy Fichtenbaum, an economist, shared the following paper, adapted from an FAQ he prepared years ago for faculty at Wright State University explaining the use of numerical evaluation in promotion and tenure documents.
Explaining Why Numerical Scores on Student Evaluations of Teaching Are Inherently Flawed
By Rudy Fichtenbaum
Why doesn’t it make sense to compare average scores on teaching evaluations? To understand why let’s start by explaining that there are four types of measurement scales.
The first is a nominal scale in which data are simply categorized by assigning them a number. The term nominal implies that we are just giving names to different categories. For example, women are 1 and men are 2 or we define four regions of the country 1 is Northeast, 2 is Midwest, 3 is Mountain and 4 is West. In both of these cases we could easily reverse the numbers e.g., women are 2 and men are 1. Taking the average of things that are categorized on a nominal scale is meaningless.
The second type of scale is an ordinal scale. Ordinal scales allow for rankings but differences between numbers are not important. A Likert scale measuring satisfaction on a scale of 1 to 5 is an ordinal scale. When a student says they strongly agree we assign a 5, when a student says they agree we assign a 4 and when a student is neutral we assign a 3. The difference between a 5 and 4 is not necessarily the same as the difference between a 4 and a 3. The numbers merely reflect an ordering but not an intensity of preference. Just like it would not make sense to take the average of males and females it does not make sense to take the average of strongly agree, agree, neutral, disagree and strongly disagree.
The third type of scale is an interval scale. An interval scale means that data is ordered, has a constant scale but no natural zero. An example is temperature. 30° is hotter than 20° which in turn is hotter than 10°. The difference between 30° and 20° is the same as the difference between 20° and 10°. But 20° is not twice as hot as 10°. Another example of interval data is time. 3pm is later than 2pm which in turn is later than 1pm. The difference between 1pm and 2pm is 1 hour as is the difference between 2pm and 3pm. But it does not make sense to say that 3pm is 3 times 1pm.
The fourth type of scale is a ratio scale. A ratio scale is ordered, has a constant scale and has a natural zero. Age is measured on a ratio scale. 40 is older than 30 which in turn is older than 20. The difference between being 40 and 30 is the same as the difference between 30 and 20. Finally, someone who is 40 years old is twice as old as someone who is 20 years old. Income is measured on a ratio scale.
In order for an average to be meaningful, data must be measurable on an interval scale or a ratio scale. Hence it makes sense to talk about average temperature, e.g., global warming, and it may make sense to talk about average income, e.g., workers at GM earn an average $35 per hour while workers at Honda earn an average of $33 per hour. However, in some cases while you can take an average it may not be particularly meaningful, e.g. if your neighbor has $2 million and you have $0 on average, you are both millionaires.
Back to student evaluations, it is simply invalid to conclude that someone who has a 4 on his or her teaching evaluation is twice as good a teacher as someone who has a 2. Therefore, trying to use average scores on teaching evaluations to assign merit increases is totally without merit (pun intended). This measurement problem is similar to knowing the order in which people finish a race but not knowing their exact times. The person who comes in second may be only a fraction of a second behind the person who finishes first, but the third place finisher may be five minutes behind the second place finisher. The rankings are 1, 2, and 3, but just knowing the rankings does not allow us to make a judgment about whether the race was close.