Making the Grade

Save to favorites
Print

Email Facebook LinkedIn Twitter

Copy URL

Back in the early 1950’s when Ellis B. Page was a high school English teacher, he spent a lot of time grading essays. “One of the things I learned was that my weekends looked different from the math teachers’ because I had a box of papers to grade and they didn’t,” recalls Page, who is now a professor of educational psychology and research at Duke University.

Imagine, he might have thought at the time, how much easier it would be if a computer could grade students’ essays for teachers. Nearly four decades later, Page has done more than just imagine that possibility. He has made it real.

Page’s computer-grading system is known as Project Essay Grade, or PEG. He has been testing it over the past three years with good results. The computer, he and his colleagues have found, is better at predicting essay scores than are individual human judges or small groups of judges. And Page has been trumpeting that success of late at conferences and in journal articles.

That a computer could grade something as subjective as writing is bound to generate some skepticism and controversy among English educators. But Page looks at it this way: “What do we know about human judgments?”

“As formal measurements, they’re notoriously unstable,” Page says. On the other hand, he asserts, computers can introduce some objectivity into the essay-grading process, and they can get the job done faster and cheaper than human judges can.

All of those are pluses for large-scale assessment programs. But Page also wants to see PEG used in classrooms, where he hopes it might free teachers to give their students more writing assignments.

Recalling his own weekends chained to a desk, he says, “This would be a godsend to teachers.”

Right Time and Place

This is not the first time the researcher has unveiled his computer-grading system. Three decades ago, he devised a computer program that could do as well as a single human judge at grading essays. But Page, who had just become president of the American Educational Research Association at the time, put the project aside.

“The world was not yet ready for it,” he explains. “The technology was not there.”

The personal computer had not yet been invented, and students typically wrote essays in longhand. Now, however, computers are part of the landscape in schools. National surveys suggest that schools have an average of one computer for every 10 students.

Moreover, educators have since become increasingly fed up with multiple-choice tests and are moving toward assessments that include more essay writing as a way to get a truer measure of what students can do with what they know. Both the S.A.T.-II, formerly known as the achievement tests, and the Graduate Record Examination, for example, now include--or will soon include--essay writing.

Such large-scale assessment programs typically use two human judges to rate essays. But, as Page and Nancy S. Peterson point out in the March issue of Phi Delta Kappan, two judges tend to agree weakly with one another, correlating at about 0.50 or 0.60. A cor~re~la~tion of 1.0 would denote perfect agreement.

That means, they write, that “on a 5-point scale, if Judge A gives a paper a 5, Judge B will often give it a 4; if Judge B gives an essay a 1, Judge A will often give it a 2.” The reliability of scores goes up as more judges are used, but using more judges is often prohibitively expensive for testing companies.

To determine how PEG would do in comparison, Page and his colleagues worked with the Educational Testing Service, the Princeton, N.J.-based assessment company, on a blind test using essays written for the Praxis Series: Assessments for Beginning Teachers, a program that is used to license teachers in 33 states.

The company provided the researchers with 1,314 essays that had been rated by at least two human judges. Of these, 1,014 were used to fine tune the computer program.

The software Page developed cannot “read” essays in the same way that an English teacher would. But it can extract from the graded essays the measurable characteristics that correlate closely with the things that human raters look for, such as diction, fluency, grammar, and creativity. For example, raters tend to give longer essays higher scores. These variables are called “proxes” because they approximate the true or “intrinsic” values of the essay, which Page calls “trins.”

The goal, Page says, is to measure the “trins,” but “the ‘proxes’ do a good job and we’re getting closer to the ‘trins’ the whole time.”

The researchers then tested the software on the remaining 300 test essays. The testing company had collected four more ratings by human judges for each of those essays, for a total of six. And these scores were not disclosed to the researchers until the experiment ended.

Their task was to determine how well PEG could predict the average scores of the pool of six human judges who had already evaluated those essays. The computer did predict well, correlating at about 0.88. Moreover, it predicted better than pools of two or three judges each. (The idea was to surpass only two judges combined, since that is what most testing programs use.)

The Real Thing

Timothy Z. Keith, a psychology professor at Alfred University in Alfred, N.Y., took Page’s results a step further. He used the same data to calculate how well PEG would do at predicting the “true” essay score--a statistical measure--rather than just the average of six human judge ratings.

“When you have a group of scores, you can estimate the extent to which the scores correlate with the ‘true’ score,” he says. He said the computer correlated slightly better with the true score than the individual human judges did.

“That’s telling you the computer is doing a better job at getting at the essence of the essay,” he says.

Page’s experiment also found that the computer program was not just good at giving an overall score. It could also give subscores for the content, organization, style, mechanics, and creativity of the essays so that students would know exactly what their strengths and weaknesses were.

What the researchers needed to know next was how students would take to writing on computers. To find that out, they approached Dale Truman, a high school English teacher who had studied with Page at Duke.

“I was skeptical,” says Truman, who teaches at Windham High School in eastern Connecticut. The teacher says he’s watched Stanley Kubrick’s film “2001: A Space Odyssey” four times, “and I know you can’t trust computers. But the more I thought about it, the more I liked it.”

Truman was surprised to find that almost all of the 122 students at his school who tried the computer-grading system had no trouble doing their essays on the computer. One student did, however, mistakenly retype the directions she saw on her computer screen.

The teachers of those eight classes also exchanged and graded the essays themselves and found, once again, that the computer came closer to the teachers’ average scores than did any of the individual teachers themselves.

Plus, Truman says, “it was much faster.”

Truman would like to see PEG used as a way to track the writing progress of entire grades of students and to diagnose how well students might do on the state’s standardized assessments, which include essay writing.

“Ninety percent of our time we spend telling kids they made the same mistakes over and over again,” he says. “If a computer can do that, we’ll be happy to work on the other stuff.”

Rabid Aardvarks

One student did, however, attempt to fool the computer. Nicolas W. Wright, who is now a freshman at Colby College in Waterville, Me., says he got a high score for an essay that was grammatically correct but nonsensical.

“My essay logically proved that a recreation club outdoors would be too dangerous for children because of rabid aardvarks, sharks in the swimming pools ... and the threat of tsunamis in the swimming pools,” he says. “Any human grader would have realized the gibberish, but the computer could not possibly find any errors with the essay.”

But in a real testing situation, the researchers say, most of the essays would be done in good faith because few students would want to chance failing. Also, they say, crank essays could be “flagged” in some way by the computer for a warm body to review.

Even so, Page recommends that, for now, at least one human judge look at every computer-graded essay.

But Wright’s experience raises other unanswered questions for the researchers. How would the computer grade such writers as William Faulkner, who used run-on sentences, or Ernest Hemingway, who sometimes wrote sentences without subjects? No one knows for sure yet.

Some educators also worry about the message that PEG gives to both teachers and students. If PEG is basing its evaluations on indirect variables like length, would teachers start coaching students to write long rather than to write well?

“In the early days of the S.A.T., researchers hit upon a bunch of items that predicted college success better than anything else--things like contemporary world knowledge,” says Carl Bereiter, a researcher at the Center for Applied Cognitive Studies at the Ontario Institute for Studies in Education."But they never used them because colleges and universities didn’t like the way it looked.” If students could glean contemporary world knowledge from newspapers, for example, why should they study?

Page and his colleagues respond by pointing out that teachers will not know what indirect variables the computer measures. They don’t plan on disclosing them.

“If you were to ask me what tips Princeton Review could give to students, it would be things like ‘stay on the subject,’ ‘address the issue'--the sort of things we tell people to do to write well,” Page says.

Sarah W. Freedman, the director of the Center for the Study of Writing and Literacy at the University of California at Berkeley, raises another concern. She points out that essay grading as it is practiced now in assessment programs is good professional development for teachers.

“Teachers benefit a great deal by getting together and talking about standards and practice,” she says. “I don’t think it would be a good idea to talk about getting rid of that even if [the alternative is] more cost-effective.”

Despite such skepticism, Page says he is getting a good reception from the English educators with whom he talks. He and a partner have formed a small company called TruJudge Inc. to market PEG. He plans to take off a year from teaching to nurture it.

“The time has come,” says Alfred University’s Keith, “for some real-world applications of PEG.”

Further information on this topic is available from:

Keith, T.J. (1995, April). The computer, human judges, and construct validity. Paper presented at the American Educational Research Association meeting, San Francisco, Calif.

Page, E.B., & Peterson, N.S. (1995). “The computer moves into essay grading: Updating the ancient test.” Phi Delta Kappan. 76(7), 561-565.

Page, E.B. (1994). “Computer grading of student prose, using modern concepts and software.” Journal of Mathematical and Statistical Psychology. 62, 127-42.

Debra Viadero

Assistant Managing Editor, Education Week

Debra Viadero was an assistant managing editor for Education Week.

A version of this article appeared in the May 31, 1995 edition of Education Week as Making the Grade