When it comes to grading student writing, a new computer program can consistently outperform humans
Back in the early 1950s when Ellis Page was a high school English teacher, he spent a lot of time grading essays. “One of the things I learned was that my weekends looked different from the math teachers’ because I had a box of papers to grade, and they didn’t,” recalls Page, who is now a professor of educational psychology and research at Duke University.
Countless English teachers like Page have probably imagined how much easier their lives would be if a machine could grade their student essays. Now, nearly four decades after he entered the classroom, Page has turned that dream into a reality.
For the past three years, Page has been testing his grading machine—a computer program he calls Project Essay Grade, or PEG for short—with good results. The program, it turns out, can score an essay typed on a computer with more reliability than a handful of human judges, something Page has been trumpeting at conferences and in journal articles.
Page’s claim that a computer can grade something as subjective as student writing as well as, if not better than, humans is bound to generate skepticism and controversy among English teachers. But Page stands firm. “What do we know about human judgments?” he says. “As formal measurements, they’re notoriously unstable.” Computers, on the other hand, offer an objective assessment. What’s more, they can get the job done faster and cheaper than humans.
All of these factors are big pluses for large-scale assessments, such as standardized tests. But Page thinks PEG also has a place in the classroom; it could free teachers from mounds of paperwork, enabling them to give their students more writing assignments. Recalling his own weekends chained to a desk, he says, “This would be a godsend to teachers.”
Three decades ago, Page devised a similar grading program that was as consistent as a single human judge at grading essays. But Page, who had just become president of the American Educational Research Association at the time, put the project aside. The personal computer had not yet been invented, and students typically wrote essays in longhand. “The world was not yet ready for it,” he explains. “The technology was not there.”
Now, however, computers are a common part of the school landscape. Moreover, educators, fed up with multiple-choice testing, are moving toward assessments that include more essay writing. Both the SAT-II and the Graduate Record Examination, for example, now—or soon will—include essay writing.
Such large-scale assessment programs typically use two human judges to rate each essay. But, as Page and researcher Nancy Peterson point out in the March issue of Phi Delta Kappan, two judges do not usually agree with one another, correlating at about 0.50 or 0.60. A correlation of 1.0 would denote perfect agreement. “On a 5-point scale,” they write, “if Judge A gives a paper a 5, Judge B will often give it a 4; if Judge B gives an essay a 1, Judge A will often give it a 2.” The reliability of scores goes up as more judges are used, but using more judges is often prohibitively expensive for testing companies.
To find out how PEG would stand up next to humans, Page and his colleagues worked with the Educational Testing Service, using essays written for the Praxis Series: Assessments for Beginning Teachers, an examination 33 states use to license teachers. The company provided the researchers with 1,314 essays that had been rated by at least two human judges. Of these, 1,014 were used to fine-tune the computer program.
The software Page developed cannot “read” an essay the same way an English teacher does. But it can extract measurable characteristics that correlate closely with the things that human raters look for, such as diction, fluency, grammar, and creativity. Research shows, for example, that raters tend to give longer essays higher scores. Such variables are called “proxes” because they “approximate” the true or “intrinsic” value of the essay, which Page calls “trins.” The goal is to measure the trins, but, Page says, “the proxes do a good job, and we’re getting closer to the trins the whole time.”
The researchers then tested the software on the remaining 300 ETS essays. The testing company collected four more ratings by human judges for each of those essays, for a total of six. These scores were not disclosed to the researchers until the experiment ended.
What Page wanted to determine was how well PEG could predict the average scores of the pool of six human judges who had already evaluated each essay. The computer did a good job, correlating at about 0.88. But more important, it predicted the scores better than pools of two or three judges. The goal was to surpass two judges combined, since that is what most testing programs use.
Timothy Keith, a psychology professor at Alfred (N.Y.) University took Page’s results a step further. He used the same data to calculate how well PEG could predict an essay’s “true” score—a kind of statistical measurement rather than an average human judgment. “When you have a group of scores, you can estimate the extent to which the scores correlate with the ‘true’ score,” he says. The computer correlated slightly better with the true score than the individual human judges did. “That’s telling you the computer is doing a better job at getting at the essence of the essay,” he says.
Page’s experiment found that the computer program was not only good at giving an overall score but also subscores on content, organization, style, mechanics, and creativity. Such subscores, Page says, could show students where their strengths and weaknesses are.
What the researchers needed to know next was how well students would take to writing essays on computer rather than paper. They approached Dale Truman, a high school English teacher who had studied with Page at Duke. “I was skeptical,” admits Truman, who teaches at Windham High School in eastern Connecticut. The teacher, who is very familiar with Stanley Kubrick’s movie 2001: A Space Odyssey, which features a moody computer named HAL, says, “I know you can’t trust computers. But the more I thought about it, the more I liked it.”
Truman was surprised to find that almost all of the 122 students who tried the computer-grading system had no trouble writing on the computer. The teachers of those eight classes of students exchanged and graded the essays themselves. But once again, the computer came closer to the teachers’ average scores than did any of the individual teachers themselves. Plus, Truman says, the computer “was much faster.”
Truman would like to see PEG used as a way to track the writing progress of entire grades of students and to diagnose how well students might do on the state’s standardized assessments, which now include essay writing. “Ninety percent of our time we spend telling kids they made the same mistakes over and over again,” he says. “If a computer can do that, we’ll be happy to work on the other stuff.”
One student, however, did manage to fool the computer. Nicolas Wright, who is now a freshman at Colby College in Waterville, Me., says he got a high score for an essay that was grammatically correct but nonsensical. “My essay logically proved that a recreation club outdoors would be too dangerous for children because of rabid aardvarks, sharks in the swimming pools, and the threat of tsunamis,” he says. “Any human grader would have realized the gibberish, but the computer could not possibly find any errors with the essay.”
In a real testing situation, the researchers say, most essays would be done in good faith because few students would want to chance failing. Also, they say, crank essays could be “flagged” in some way by the computer for a warm body to review. Still, Page recommends that at least one human judge look at every computer-graded essay, at least for now.
Some educators worry about the message that PEG sends both teachers and students. If PEG is basing its evaluations on indirect variables, such as length, wouldn’t teachers start coaching students to write longer rather than better?
“In the early days of the SAT, researchers hit upon a bunch of items that predicted college success better than anything else—things like contemporary world knowledge,” says Carl Bereiter, a researcher at the Center for Applied Cognitive Studies at the Ontario Institute for Studies in Education. “But they never used them because colleges and universities didn’t like the way it looked.” If students could glean contemporary world knowledge from newspapers, for example, why should they study?
Page and his colleagues point out that teachers will not know what indirect variables the computer measures because the researchers don’t plan to disclose them. “If you were to ask me what tips Princeton Review could give to students,” Page says, “it would be things like `stay on the subject,’ `address the issue,’ the sorts of things we tell people to do to write well.”
Sarah Freedman, director of the Center for the Study of Writing and Literacy at the University of California at Berkeley, raises another concern. She points out that essay grading as it is practiced now is good professional development for teachers. “Teachers benefit a great deal by getting together and talking about standards and practice,” she says. “I don’t think it would be a good idea to talk about getting rid of that even if the alternative is more cost-effective.”
Despite such skepticism, Page says he is getting a good reception from English educators. He and a partner have formed a small company called TruJudge Inc. to market PEG, and he plans to take off a year from teaching to nurture it.As Timothy Keith of Alfred University puts it, “The time has come for some real-world applications of PEG.”
Vol. 07, Issue 01, Pages 18-20