The Test No One Needs
Only a year after boasting of its creation of a new computer-administered admission test, the Educational Testing Service was forced last December to suspend temporarily its administration of the computerized version of the Graduate Record Examination, the test used for admission to most of the nation's graduate schools. Regrettably, this was not just a timeout in the admissions-testing game. Readers should not for a moment suppose that computer-adaptive testing, or C.A.T., is meant only for college students. Early introduction of C.A.T. at the secondary level is now being planned.
Like many of the technological innovations our society embraces with such enthusiasm, however, computer-adaptive testing has not been adequately tested itself for adverse, unanticipated consequences. In my view, the E.T.S. should abandon, not suspend, this test no one needs.
In computer-adaptive testing, a test-taker, seated before a computer console, is routed by software to harder or easier items depending on whether he or she has answered the previous item(s) correctly or incorrectly. At each step, an estimate of the difficulty of the new item that should be presented is made by the computer, based upon answers given by the test-taker to prior items. Naturally, this requires that the software have information stored about the difficulty of the items, so all of them must have been pretested on other groups of people to obtain their "item difficulty." The test-taker is successively presented with additional items, until a point is reached at which the estimate of the test-taker's ability becomes stabilized. At that point, the test is terminated. Test-takers are spared the effort of responding to questions too easy for them, or too hard for them. They save perhaps an hour of testing time and get their scores right away, which is good but not dramatically good.
In a C.A.T. center, students are scheduled throughout the day. It is much like going to the barber shop and waiting your turn. The computer is programmed so that it has available a very large number of items. This is necessary to get a reliable estimate of the student's true score, but even more important because, if the same items were exposed for use again and again, students--ever industrious--would memorize the questions and return again to the testing shop, armed with the correct answers and thereby inflate their scores, invalidating the usefulness of the test for admissions. By contrast, the old-fashioned way requires students to answer questions in a printed test booklet, and the same test ordinarily is not administered twice in the same year, making it much harder to use a crib sheet.
Because the E.T.S. was trying to prevent the cost of computer-adaptive testing from going through the roof (a C.A.T. version of the graduate-admissions exam costs roughly double what the paper-and-pencil version costs to administer), it skimped on the size of the item pools, making it easy for students to encounter familiar items. So, in the game of testing, a timeout was called. Too many kids were outsmarting the testers.
At the very least, we ought to extend the Educational Testing Service's "timeout" to afford educators an opportunity for sober reconsideration of computer-adaptive testing. Are the gains--shorter testing time and test-taker convenience--worth the cost, which may reach several hundred dollars per test-taker? Beyond the inflated costs, there are other disadvantages to c.a.t. Let me touch on only a few:
- The cost to real innovation in testing. In many ways, the great
promise of computers in assessment lies not in C.A.T. but in the
creative use of computers in devising genuinely interactive exercises
that make full use of the information-handling, retrieval, and
dialogue capabilities of the machines. New modes of assessing human
ability have not yet been fully realized but are clearly on the
threshold of development--within a computer environment.
Computer-adaptive testing makes no intelligent, creative advances in
this quest. It merely preserves the same old multiple-choice
questions, using the computer as an elaborate page-turner.
History seems to suggest that those who are continuously engaged in pressing the technological horizon are more likely to reach the far shore. The use of C.A.T. is a detour on that voyage, one that extracts resources from students and preserves the status quo of the multiple-choice mentality. For making true progress in the realm of testing, the E.T.S. got on the wrong boat.
- The demands for enormous (and costly) item pools. The pool of test
items necessary to support a C.A.T.-based testing program is very
large; thousands upon thousands of items are needed. This is a much
larger task than that of producing enough items for two or three
new test forms (fewer than 500 items) each year. The combined
demand for test development necessary to support several testing
programs at the E.T.S. that simultaneously employ computer-adaptive
testing will be overwhelming if it is not managed superbly
(assuming enough people can be found to write the items). Efforts
to replace human item-writers with computer-generated items are in
their infancy at the testing concern; they seem feasible in
mathematics but not in the verbal areas.
All of this raises questions about the testing service's quality, cost, and managerial expertise. As is the case with a privately owned utility, we all have a stake in how well the E.T.S. performs. No one wants a blackout. But beware; the power grid of standardized, multiple-choice testing is beginning to flicker.
In fact, the frequency with which would-be graduate students were encountering the same items on the computerized G.R.E. was revealed to the E.T.S. by the people at Kaplan Educational Centers, who operate coaching schools for tests like the G.R.E. They tried to alert the test-maker to the problem of students' memorizing oft-repeated items. (As purveyors of services in a secondary testing market, it is in the interest of coaching schools to keep the multiple-choice industry healthy.) After grudgingly acknowledging that a serious problem existed, and suspending computer-administered versions of the G.R.E., the E.T.S. subsequently filed suit against Kaplan on the grounds that the coaching company had no right to send its staff in to take the computer-generated version of the G.R.E. repeatedly. (See Education Week, 1/11/95.) This has reminded some observers of the famous case of last summer, when an elderly New Jersey gentleman killed a rat in his vegetable garden only to be served with a summons for cruelty to animals.Exorbitant capital demands. Putting in place a large "utility" of computer-equipped testing centers creates a demand for capital that is much greater than needed when using empty classrooms on Saturday mornings to administer paper-and-pencil tests. The burden of recovering these costs will not be inconsiderable, and should be carefully weighed, bearing in mind that a gold-plated monkey wrench is still a monkey wrench, and that it is the student who pays all the bills. Of course, the E.T.S. is hedging its bets by buying stock in the company that operates its computer-adaptive-testing shops on contract, thereby demonstrating how double-dipping on student fees can be honed to a fine point by a nonprofit corporation truly interested in the bottom line.
- Settling for mediocrity. One could imagine a truly innovative G.R.E. consisting of computer-based problems devised to tap into dynamic problem-solving abilities (these might include, for example, diagnosis and inductive reasoning, verbal fluency, and mathematical insight)--abilities that are not handled well in the conventional multiple-choice format. In this way, a broader and more penetrating assessment could be attained than is possible using only multiple-choice questions. At the E.T.S., promising new forms of test questions have lain fallow because they were not easily adaptable to c.a.t. It is no exaggeration to speak of a "multiple-choice culture" there that clings to C.A.T., and saps funds and talent that would be better directed to devising creative new forms of testing.
Computer-adaptive testing may well be a transitional form of assessment whose time has come and gone in the flickering of an eyelash. It is transitional because it arose out of a shotgun marriage of multiple-choice questions and the computer. Multiple-choice questions themselves grew to dominate modern testing because of their compatibility with the optical scanner. As we move from a scanner to a computer, the issue that begs to be asked is what kinds of questions or exercises are most compatible with computer technology? There is an "organic" connection between the medium--scanner or computer--and the form of test questions. In C.A.T. we find a mismatch, and for that reason it is likely to be a transitional phase.
How we go about admitting students to college and to graduate study is important to the health of our nation, whose continued prosperity is intimately linked with the question of how we evaluate human talent. Let us take the suspension of the computerized G.R.E. as an opportunity to widen the circle of those who determine what should be the future shape of the nation's admissions-testing programs beyond the small group of psychometricians huddling over the G.R.E. debacle in Princeton, N.J. Once again, the kids have in their own ironic way given the Educational Testing Service a second chance to do it right. We should demand that they do.
Vol. 14, Issue 29, Pages 22, 24Published in Print: April 12, 1995, as The Test No One Needs