The Test No One Needs
Only a year after boasting of its creation of a new computer-administered admission test, the Educational Testing Service was forced last December to suspend temporarily its administration of the computerized version of the Graduate Record Examination, the test used for admission to most of the nation's graduate schools. Regrettably, this was not just a timeout in the admissions-testing game. Readers should not for a moment suppose that computer-adaptive testing, or C.A.T., is meant only for college students. Early introduction of C.A.T. at the secondary level is now being planned.
Like many of the technological innovations our society embraces with such enthusiasm, however, computer-adaptive testing has not been adequately tested itself for adverse, unanticipated consequences. In my view, the E.T.S. should abandon, not suspend, this test no one needs.
In computer-adaptive testing, a test-taker, seated before a computer console, is routed by software to harder or easier items depending on whether he or she has answered the previous item(s) correctly or incorrectly. At each step, an estimate of the difficulty of the new item that should be presented is made by the computer, based upon answers given by the test-taker to prior items. Naturally, this requires that the software have information stored about the difficulty of the items, so all of them must have been pretested on other groups of people to obtain their "item difficulty." The test-taker is successively presented with additional items, until a point is reached at which the estimate of the test-taker's ability becomes stabilized. At that point, the test is terminated. Test-takers are spared the effort of responding to questions too easy for them, or too hard for them. They save perhaps an hour of testing time and get their scores right away, which is good but not dramatically good.
In a C.A.T. center, students are scheduled throughout the day. It is much like going to the barber shop and waiting your turn. The computer is programmed so that it has available a very large number of items. This is necessary to get a reliable estimate of the student's true score, but even more important because, if the same items were exposed for use again and again, students--ever industrious--would memorize the questions and return again to the testing shop, armed with the correct answers and thereby inflate their scores, invalidating the usefulness of the test for admissions. By contrast, the old-fashioned way requires students to answer questions in a printed test booklet, and the same test ordinarily is not administered twice in the same year, making it much harder to use a crib sheet.
Because the E.T.S. was trying to prevent the cost of computer-adaptive testing from going through the roof (a C.A.T. version of the graduate-admissions exam costs roughly double what the paper-and-pencil version costs to administer), it skimped on the size of the item pools, making it easy for students to encounter familiar items. So, in the game of testing, a timeout was called. Too many kids were outsmarting the testers.
At the very least, we ought to extend the Educational Testing Service's "timeout" to afford educators an opportunity for sober reconsideration of computer-adaptive testing. Are the gains--shorter testing time and test-taker convenience--worth the cost, which may reach several hundred dollars per test-taker? Beyond the inflated costs, there are other disadvantages to c.a.t. Let me touch on only a few:
History seems to suggest that those who are continuously engaged in pressing the technological horizon are more likely to reach the far shore. The use of C.A.T. is a detour on that voyage, one that extracts resources from students and preserves the status quo of the multiple-choice mentality. For making true progress in the realm of testing, the E.T.S. got on the wrong boat.
All of this raises questions about the testing service's quality, cost, and managerial expertise. As is the case with a privately owned utility, we all have a stake in how well the E.T.S. performs. No one wants a blackout. But beware; the power grid of standardized, multiple-choice testing is beginning to flicker.
In fact, the frequency with which would-be graduate students were encountering the same items on the computerized G.R.E. was revealed to the E.T.S. by the people at Kaplan Educational Centers, who operate coaching schools for tests like the G.R.E. They tried to alert the test-maker to the problem of students' memorizing oft-repeated items. (As purveyors of services in a secondary testing market, it is in the interest of coaching schools to keep the multiple-choice industry healthy.) After grudgingly acknowledging that a serious problem existed, and suspending computer-administered versions of the G.R.E., the E.T.S. subsequently filed suit against Kaplan on the grounds that the coaching company had no right to send its staff in to take the computer-generated version of the G.R.E. repeatedly. (See Education Week, 1/11/95.) This has reminded some observers of the famous case of last summer, when an elderly New Jersey gentleman killed a rat in his vegetable garden only to be served with a summons for cruelty to animals.
Computer-adaptive testing may well be a transitional form of assessment whose time has come and gone in the flickering of an eyelash. It is transitional because it arose out of a shotgun marriage of multiple-choice questions and the computer. Multiple-choice questions themselves grew to dominate modern testing because of their compatibility with the optical scanner. As we move from a scanner to a computer, the issue that begs to be asked is what kinds of questions or exercises are most compatible with computer technology? There is an "organic" connection between the medium--scanner or computer--and the form of test questions. In C.A.T. we find a mismatch, and for that reason it is likely to be a transitional phase.
How we go about admitting students to college and to graduate study is important to the health of our nation, whose continued prosperity is intimately linked with the question of how we evaluate human talent. Let us take the suspension of the computerized G.R.E. as an opportunity to widen the circle of those who determine what should be the future shape of the nation's admissions-testing programs beyond the small group of psychometricians huddling over the G.R.E. debacle in Princeton, N.J. Once again, the kids have in their own ironic way given the Educational Testing Service a second chance to do it right. We should demand that they do.