Les Perelman, a researcher at the Massachusetts Institute of Technology, has been doing some interesting experiments to test the capacity of computerized grading systems to accurately judge the quality of written work. In a recent interview with the Canadian Broadcasting Company’s Carol Off, he shared what he has learned.
As our government agencies and various reform efforts seek to shift high stakes testing away from multiple choice questions, there is growing interest in computer programs that can read and score student essays. But questions persist, given the limitations of the algorithms these programs use. So Mr. Perelman has done an experiment. He created something he calls the Basic Automatic BS Essay Language Generator, BABEL for short. During his interview with Carol Off, Perelman fed his machine a topic she suggested, “Fair Elections Act.”
Here is what the BABEL machine provided in response:
Fun fair for adherents and presumably will never be altruistic in the extent to which we purloin the analysis. Fair is the most fundamental postulate of humankind. Whiner to act in the study of semiotics in addition to the search for reality. Act is intrepidly and clandestinely axiomatic by most of the scenarios. As I have learned in my semiotics class, act is the most fundamental exposition of humanity.
Mr. Perelman then submits this essay for grading. The result, a score of 5.4 out of 6, placing this essay in the 90th percentile.
Perelman explains his purpose:
I did this as an experiment to show that what these computers are grading does not have anything to do with human communication. If you think about writing or any kind of human communication as the transfer of thoughts from one mind to another mind, then if the machine takes something that anyone would say is complete incoherent nonsense, and scores it highly, and we know that it's not, then we know that it's not grading human communication.
Two years ago there was some excitement about computer-scored essays, when a demonstration showed that computers could yield results that aligned reasonably well with the scores given by human scorers.
This article in Education Week reported:
The demonstration showed conclusively that automated essay-scoring systems are fast, accurate, and cost-effective," said Tom Vander Ark, the chief executive officer of Open Education Solutions, and a co-director of the study, in a press release. (Vander Ark is also a former top education official at the Bill & Melinda Gates Foundation.)
The study compared computer grading to human scoring, but the humans doing the scoring were not, in fact, teachers. They were mostly temporary employees paid low wages, working in mass scoring facilities, as former testing company employee Todd Farley pointed out. And under these conditions, human scoring is nothing to brag about.
...this study confirms the fact humans don't do that great a job when assessing essays but also wants to celebrate the success of automated scoring engines by saying that they do "similar" work, "by and large." Unfortunately, that means the study's final conclusion is really no more than a lame claim that automated scoring engines are able to give scores to student essays that are in the ballpark of the scores human readers give, even though those human scores are probably only in the ballpark of what the student writers really deserve.
There is great urgency behind the search for this magic combination -- test questions that can prompt student essays which can then be scored fairly accurately by computers. This may even be guiding the design of some of the test questions we are seeing on the latest generation of tests, as was discussed here a few months ago. The urgency comes from widespread dissatisfaction with “bubble tests,” as Secretary of Education Duncan has referred to them.
But there are severe constraints. We cannot afford to pay humans - even low wage ones working in hot warehouses somewhere - to score millions of essays. And with Common Core, we want to test even more often. Obviously we cannot trust teachers to score their own students’ work, because we are planning to use these scores to determine bonuses and for teacher evaluation, and even to close down their schools. So there is tremendous pressure to move in the direction of computerized grading of student work.
Unfortunately, the system breaks down when we do what Mr. Perelman has done. He has figured out the algorithm the computer is using to score the student work. This algorithm prizes the use of obscure vocabulary, along with length. Throw enough big words in an essay, and write long enough, and you will get a good score. Given human capacity to do what Mr. Perelman has done with his software, it is likely that once students figure out these algorithms, they can similarly generate essays that are loquacious without being elucidating.
In closing his interview, Mr. Perelman offered what he called Perelman’s conjecture: “People’s belief in computerized essay marking is proportional to the square of the intellectual distance from people who actually know what they’re talking about.”
That sounds frustratingly similar to much of what passes for education reform these days.
Update: Here is an op-ed authored by Les Perelman that appeared last week.
What do you think? Have you had any experience with computerized grading systems?
Continue the dialogue with Anthony on Twitter.