How do you feel about the prospect of computers grading students’ essays?
That’s a pretty pertinent question right now, since it’s mired in controversy, and two big groups of states are giving serious consideration to using artificial intelligence to score essays on assessments they’re designing for the common standards.
Excitement and opposition—depending on where you stand on the issue—have surrounded this question for some time. But it’s getting a bit of a boost lately, with a “Human Reader” petition launched by some folks in the higher education world objecting to computer-scoring for student essays.
A Page 1 story in The New York Times today explores the use of artificial intelligence in essay scoring. A study from the University of Akron last year found that computers can do just as nice a job on this as people can, and the philanthropic world has been soliciting automated-scoring solutions with cash rewards.
The two assessment consortia, PARCC and Smarter Balanced, which are specifically mentioned in the petition, tell me they’re probing the possibilities of automated scoring in their pilot- and field-testing, and will await feedback from those tests before making a final decision on using those technologies.
Joe Willhoft, the executive director of SBAC, told me in an email that written responses from students participating in the ongoing pilot tests will be hand-scored by the consortium’s contractor, with guidance from SBAC staff. The contractor will then use the scored responses to try to “train” artificial-intelligence software to score the papers.
Scoring, both human and artificial, will focus on three aspects of students’ writing, Willhoft explained: 1) overall organization and style (things like how well it’s written, whether the sentences are complete and coherent, and the voice and style appropriate) 2) conventions of the language, and 3) students’ use of evidence (whether the essay refers appropriately to the reading materials on which it is based). Based on what is known about computer scoring, he said, Smarter Balanced officials are more confident that it will succeed with conventions, organization, and style than with use of evidence.
They’ll divide the papers into two chunks: a training set and a validity set. Programmers will use the training set to teach the computerized scoring engine to replicate the human scores. They’ll use the validity set to see if the software actually replicates the human scores. With that feedback in hand, SBAC will get its arms around the reliability of computer scoring.