July-August 2000

Flunking the Test: The Dismal Record of Student Evaluations

Though most schools use them, numerical evaluations of faculty members get bad grades. They aren’t accurate and they’re dumbing down undergraduate education.

Back in the 1960s, when it was gaining a foothold on campus, the evaluation of teaching was infrequent, informal, and unthreatening. Faculty members could design their own evaluations, use them voluntarily, and keep the results to themselves. The goal was to improve teaching, not to measure it for personnel decisions.

Today, of course, the situation is drastically different. Now almost 90 percent of U.S. campuses require evaluation of teaching, according to a 1993 survey by Peter Seldin, a professor of management at Pace University. The results of these evaluations are routinely, and sometimes thoughtlessly, factored into decisions about faculty retention, tenure, promotion, and merit pay. Although questionnaires and the weight given them differ from department to department and college to college, no other method of evaluating teaching comes close to matching the popularity and importance of the numerical survey. Seldin reported that half of the colleges he surveyed relied on student evaluations alone to assess "teaching" (defined, apparently, as merely classroom instruction).

Early on, faculty Cassandras warned that the bureaucratization of evaluation would have deleterious effects on classroom teaching. Forty years later, with some of these effects quite prominent, many faculty members believe that this method of "evaluating" teaching needs to be reconsidered, if not abandoned. I want to restate some of the main grounds for their concern, overview possible alternatives to the present system, and suggest some things that faculty can do to confront this powerful threat to academic standards.

Consumer Satisfaction

The numbers on bubble sheets (and sometimes other forms) are said—and widely believed—to measure the quality of classroom instruction. The problem, however, is that no scholarly consensus exists about what "good" or "effective" teaching is. As Richard Meeth, writing as the director of Change magazine’s Project on Undergraduate Teaching, explained years ago:

[T]he evaluation of teaching presumes consensus among educators about what constitutes effective teaching. But educators don’t know what makes up effective teaching; they don’t have a good research base, don’t agree on the validity of what research they do have, don’t believe the evidence that is presented in that research, and don’t act on it in a broad, systematic way throughout higher education. More than eighty centers have been developed in colleges and universities in the last decade that focus on improving teaching. All have in common a lack of consensus about what is most important to improve.

The conceptual muddle over how to define "good" teaching helps explain the enormous and seemingly unbridgeable differences of opinion about the validity, the legitimate uses, and the dangers of these forms. How can any form gauge the "effectiveness" of classroom instruction, when we have no cogent definition of what is being measured?

What numerical forms apparently measure is the degree to which students are happy or satisfied with the instructor (personality), the course (requirements), and the outcome (grade). Yet few on campus—and certainly no administrators—ever refer to these forms as "consumer satisfaction surveys," even though that is a more honest label for them.

Invalid Measures

Another problem with the current evaluation regime is that most forms used to evaluate "teaching" have not been validated; that is, they have not been proven to measure what they say they measure—effective teaching (however defined). Validating a form is very expensive and therefore rarely done. At Montana State University, for example, the most commonly used form—the same one used by my department to reward and punish me and my colleagues—has not been validated, and I would bet that the form used to measure your teaching hasn’t been either.

Is this a problem? It certainly is. When Harry T. Tagamori and Laurence A. Bishop, specialists in assessing teaching and faculty, examined a random sample of the hundreds of forms now in use, they found that more than 90 percent contained questions or items "that were ambiguous, unclear, or vague; 76 percent contained subjectively stated items, and over 90 percent contained evaluation items that did not correlate with classroom teaching behavior." Many faculty members, the researchers reported, were being evaluated by "questionable, even invalid, instruments" that likely yielded "unfair assessments of their teaching performance." They concluded that "unfairness is not acceptable at a time of controversy regarding tenure, promotion, and retention decisions." Similarly, Michael Scriven, who founded the American Evaluation Association, asserted in 1993 that "based on examination of hundreds of forms that are or have been used for personnel decisions, . . . not more than one or two could stand up in a serious hearing."

Most numerical forms are not only invalid, but also unreliable and inaccurate. Studies have shown that students are not the reliable witnesses of objective classroom behavior that campus folklore holds them to be. In a study by David V. Reynolds, a psychology professor at the University of Windsor, a thousand students in an introductory psychology course rated a movie they had not seen as better than a lecture they had not heard. Both the movie and the lecture were scheduled but then canceled, but they were left on the end-of-term course evaluation form. Moreover, a majority of students also rated both of these phantom experiences as better than several lectures and films they had heard and seen.

More recently, a telling experiment by Larry Stanfel, a professor of management at the University of Alabama, found that most of the students in a sample group—when making out evaluations—contradicted what they had said earlier in the semester. All the students in the three classes Stanfel surveyed proved, by getting a perfect score on a test, that they perfectly understood the instructor’s course objectives, grading procedures, and practice of returning tests promptly. So, at the end of the semester, the students should have circled "strongly agree" on the evaluation items asking whether they understood these things.

Yet 46 percent strongly disagreed that the objectives had been made clear, 40 percent strongly disagreed that the grading procedures had been made clear, and only 3 percent "strongly agreed" that the tests were handed back promptly—even though every test was handed back at the next class meeting, the most timely response possible. This experiment, its author concluded, "established beyond any doubt that . . . the student responses to the controlled questions were so incorrect as to fail even to be remotely related to actual circumstances." And since "three-sevenths of the total result" were proven to be "patently incorrect," he argued, one can reasonably conclude that all the data on the form were just as "erroneous." It would be "demonstrably wrong," Stanfel insisted, to use these numbers to assess teaching performance.

Dumbed-Down Education

For me, the key indictment against using numerical forms to reward and punish the classroom behavior of instructors is that they encourage instructors to dumb down their teaching. The dynamic is simple and widely understood, if seldom acknowledged. College instructors, as incentive-driven as everybody else outside a Trappist monastery, are economic beings who calculate their self-interest when making decisions affecting their income. If it takes consistently high evaluation scores to get raises, tenure, promotion, and other perks, many instructors—consciously or unconsciously—will do what it takes to get those scores.

That means that they will give students what they want—and many want lighter workloads, easier tests, and higher grades. As Richard Renner, professor of education at the University of Florida, wrote in Phi Delta Kappan in 1981, "When students have input, they tend to avoid defining the good teacher as one whose assignments are difficult, whose examinations are demanding, and whose standards of performance are high." Since 1981, the pressures and demands for lower standards and lighter workloads have intensified, as journalist and educator Peter Sacks makes painfully clear in his 1996 book, Generation X Goes to College. Some students still want challenging courses and demanding instructors, but their numbers are probably shrinking relative to the number of those who are "disengaged" from academic pursuits and values.

A colleague recently told me that, after having spent an evening grading exams, he had thought, "Why should I give anybody a C when it’s entirely within my interest to give them a B?" He said he thinks this every time he gives out a grade. If even Mark Edmundson, a full professor at the University of Virginia with a six-figure salary, admitted in the September 1997 issue of Harpers to complying with student demands for "comfortable, less challenging" classes, what sort of heroic resistance can be expected from those trying to reach $50,000 by retirement?

What’s intriguing is that almost no research has examined how the administrative use of numerical forms affects classroom standards. The results would probably embarrass everyone on campus—judging from the findings of the one study I’ve managed to locate on this issue. Back in 1980, James J. Ryan, professor of psychology at the University of Wisconsin, analyzed what happened to classroom practices after an administration imposed numerical evaluation forms on all faculty members on one campus. He found that 22 percent of the institution’s instructors admitted reducing the amount of material covered (7 percent increased it), and almost 40 percent acknowledged making courses and exams easier (9 percent said they made them tougher). Ryan reasonably concluded that the administrative use of numerical evaluations had "more adverse than positive effects on faculty instructional behavior."

Think for a moment: these percentages reflect only instructors who were aware of what they were doing, and who admitted it to themselves and to the interviewer. How many others—by using multiple-choice tests instead of essays, by choosing fewer or less difficult texts, by assigning fewer papers, by not requiring attendance, by having a lenient grading system—have done the same thing without acknowledging it even to themselves? If the rather conservative percentages from this study held true for the almost 1 million instructors in higher education, it would mean that nearly 300,000 of those who teach the future leaders of American society have consciously lowered their standards and requirements. No wonder a 1987 Carnegie Foundation study found that 67 percent of professors reported a widespread lowering of standards in American higher education.

No Help for Teachers

Given the invalidity, unreliability, and pernicious effects of the evaluation regime, it is hardly surprising that classroom instruction has not "improved" enough—despite almost thirty years of ever-greater scrutiny and pressure—to warrant the slightest relaxation of surveillance. How could it improve with a method of scrutiny that corrupts it?

What is really sad about the situation is that we have less noxious ways to make sure that college instructors do the right thing—meet their classes, intelligently design their courses, hand out syllabi, not trade grades for sex or cash, not call students names, hold office hours, and so on. Narrative evaluations, self-evaluations, peer visitation and review, and intensive focus-group interviews are more than adequate for monitoring classroom instruction.

Should the goal actually be to improve instruction, then several steps should do the trick: have smaller classes, periodically solicit written comments from students, offer fewer lectures and more group discussions, create a professional development program (including seminars, a resource center, and mentoring), and develop a university system that rewards classroom rigor at each stage of an instructor’s career (complete with cash awards, grants, and sabbaticals to study and improve pedagogy).

Given that we have these more benign and effective alternatives, why does such a fraudulent and perverse method of evaluating instruction have an apparent stranglehold on higher education? Because the current regime serves the interests of campus "stakeholders." Students want numerical forms because it gives them power to pressure instructors to keep requirements, demands, and standards at "comfortable" levels. I once overheard a student advise some friends to take courses from teaching assistants because TAs have to give out lots of high grades to keep their jobs. As Arthur Levine, president of Columbia Teachers College, reported in his 1998 book, When Hope and Fear Collide, about 60 percent of students hold jobs while taking classes, with 24 percent working full time; 72 percent come to college to learn how to make money. Students with these traits don’t appreciate faculty members who make their quick progress to material well-being arduous and painful. Numerical forms allow disgruntled students to settle scores with instructors who don’t provide education lite.

Then there are the administrators—CEOs in the business of selling a valuable commodity to as many customers as the store can hold. They like numerical forms because—no matter how bogus—these forms provide a cheap way to convince taxpayers and politicians that their institution rewards classroom instruction to the second decimal. And, of course, they like these forms for the same reason students do—as a device for pressuring instructors to please, or at least not upset, the valued customers. How else to explain the fact that highly numerate administrators—many from the social and hard sciences—treat evaluation scores as meaningful when they must know they are as fraudulent as most grades? Is this a calumny? Well, how many administrators on your campus have publicly demanded that evaluation forms contain items that would reward instructors who have high standards and challenging workloads?

Why do so many instructors accept the instructional regime? There are several reasons. Most professional educators, I believe, know little about evaluation even though it affects their careers. Their ignorance gives rise to some rather naïve and self-protective views on the subject. For example, some defend the practice because it seems "fair" that if faculty can grade students, then students should be able to "grade" faculty (a false equivalency, of course). Others accept the practice because it is so widespread, hence "legitimate" (case closed). Some have simply learned how to survive within the prevailing evaluation process, engaging in "influential tactics" (such as flattering students, throwing pizza parties before evaluations are handed out, and the like) that help keep evaluation scores high enough so that one’s career is not endangered.

Others are quite happy to dumb down their classrooms, because less work for students usually means less work for instructors as well. And then there are those who, with a clear conscience, do whatever it takes to get high scores because high scores functionally define "excellent" teaching. They are excellent teachers because they get high scores. Any questioning of the validity of these scores is therefore an implied attack on their self-image and reputation, and very unwelcome.

Fight to Restore Sanity

Yet some faculty perceive the pernicious effects of these forms and want to end them. The first thing to do is to challenge the taken-for-granted status of these forms and their use in administrative decisions. Concerned faculty members could form, for example, an ad hoc departmental committee that researches the issue, examines the form used in the department, and analyzes how colleagues and department administrators interpret the data. How did the forms come to be used? Have they been validated? Do colleagues who use the data to make qualitative distinctions about instructional abilities possess the expertise to do so? Once the ad hoc committee has done its homework, faculty should openly discuss the issue at department meetings and retreats.

But other faculty organizations on campus should also get involved. Faculty councils or senates can encourage the discussion of this issue at the department level and also establish their own standing committee on faculty assessment. The committee can collect data on how each department assesses teaching, monitor how evaluation forms are used, and record all controversies arising from their use. How many forms are used? How and when was the policy governing the use of these forms established? Have any of the forms been validated?

Need for Research

Each campus should tackle the problem locally, but use of numerical forms to reward and punish faculty should also be discussed at the national level. Think tanks and university research centers can study the issue and publish reports. And an update of Ryan’s study is sorely needed.

To get this item on our national educational agenda, faculty members should urge disciplinary and scholarly organizations to examine it at regional and national meetings. It is high time that the use of these forms be discussed and debated, not just passively accepted. Yet over the past several years, I have not found one conference advertised in the Chronicle of Higher Education devoted to academic standards or the culture of faculty assessment. Perhaps the AAUP or some other organization concerned with faculty well-being and the integrity of higher education can sponsor a conference examining how the evaluation regime affects classroom standards and faculty morale.

Although these discussions would probably focus on the negative effects of numerical evaluations, they could also help faculty members better understand the teaching-learning transaction and some of the many practices studies have associated with "effective" teaching in this day and age. Discussions could also identify and assess alternative and less noxious ways to evaluate classroom teaching and learning for administrative purposes.

We have no reason to keep the negative effects of these forms from the general public. Indeed, most nonacademics are flabbergasted to learn that the university actually allows their little Tysons and Tiffanys to judge the competence of professional educators and in effect to set classroom standards. Many taxpayers, employers, and parents are potential allies in the struggle to restore sanity to the evaluation process.

Eventually, the incompetent use of bogus "teaching" evaluation forms to reward and punish instructional behavior may have to be challenged in court. Robert Lechtreck, an expert on the legal ramifications of faculty assessment, wrote in 1990 that "in the past few decades, courts have struck down numerous tests used for hiring and/or promotions on the grounds that the tests were discriminatory or allowed the evaluator to discriminate. The question, ‘How would you rate the teaching ability of this instructor?’ is wide open to abuse." Perry A. Zirkel, professor of education and law at Lehigh University, believes that courts will not uphold evaluations that are based on subjective criteria or data.

If the evaluations cannot be shown to measure what they claim to measure (namely, teaching quality or ability), then faculty members who have been denied retention, tenure, promotion, or perhaps even merit pay because of the data on these forms stand a good chance of prevailing in the courts. Professional and scholarly organizations can play an invaluable role in mounting a legal challenge by providing research on the issue and by finding a public-interest law firm willing to go to court in defense of academic freedom and standards.

A successful class-action lawsuit on behalf of those denied tenure within a prominent state university system would surely deal a blow to the apparent inevitability and invincibility of what has become yet another tyrannical machine of higher education. Once the administrative use of numerical forms is legally discredited in a couple of large states, hitherto cowed and resigned instructors across the country may clamor for an end to it.

I urge faculty members collectively and publicly to challenge the irresponsible, humiliating, and destructive ways in which their classroom teaching is rewarded and punished by administrators. A critical mass of faculty up to the task can liberate almost a million professional educators from the wrong-headed evaluation regime now corrupting many classrooms and consciences.

Paul Trout is professor of English at Montana State University.