A Question of Direction
‘Adaptive’ testing puts federal officials and experts at odds
Computer adaptive testing is used to test recruits to the U.S. military, for licensing nurses and computer technicians, for entrance tests to graduate school, and for a popular placement test used by community colleges—but not for academic testing in all but a handful of K-12 schools.
Most notably, computer adaptive testing has been left out of nearly all the large-scale testing programs that states are ramping up to meet the requirements of the federal “No Child Left Behind Act” of 2001.
A prime reason: The U.S. Department of Education interprets the law’s test-driven accountability rules as excluding so-called “out-of-level” testing. Federal officials have said the adaptive tests are not “grade-level tests,” a requirement of the law.
“Psychometricians regard that decision as humorous,” Robert Dolan, a testing expert at the nonprofit Center for Applied Special Technology in Wakefield, Mass., says of the department’s stance.
Adaptive tests deliver harder or easier items, depending on how well the individual test-taker is doing. They are considered out-of-level because the difficulty range could include skills and content offered in higher and lower grades.
Dolan and other test experts concede states may have reason to say no to computer adaptive testing, because of cost, uneven technology levels in schools, and even educators’ unfamiliarity with the method—but not because of grade-level testing.
“The span of [test item] difficulty from easiest to hardest is entirely under the control of the test developer,” says Tim Davey, the senior research director of the Educational Testing Service, based in Princeton, N.J.
Some experts say adaptive tests give schools a better return on the time and money devoted to testing—including more accurate measurement of the proficiency of students who are above and below average, and speedier access to the test results.
But Education Department officials say their hands are tied. “The regulations are very clear in saying all students have to be held to the same standard as the foundation for school accountability,” says Sue Rigney, an education specialist in the department. “The focus here is very explicitly on the grade level the state has defined.”
Federal officials worry that out-of-level testing might lead to lower expectations for below-average students.
They also note that states are free to use computer-adaptive tests outside the accountability purposes of the No Child Left Behind law, which requires yearly assessments in reading and mathematics of students in grades 3-8.
But the upshot, for now, is that computer adaptive tests are left out of the federal law, along with the public attention and federal money for test development that come with it. And the developers of adaptive tests feel they are missing out on what may be the greatest precollegiate testing boom in history.
’Made Us a Pariah’
“[The Education Department’s] decision made us a pariah,” says Allan L. Olson, the president of the Northwest Evaluation Association, a nonprofit testing organization in Portland, Ore. The group was developing a computer adaptive test for Idaho’s assessment when the department ruled its method out just over a year ago.
Federal officials gave the same message to South Dakota and Oregon. South Dakota subsequently made voluntary its once-required computer adaptive test, and has adopted a conventional paper-and-pencil test for its statewide program. Oregon has postponed for a year the addition of a computer adaptive feature to its online test.
“I think the [department’s] interpretation in the case of South Dakota was based on a sort of misunderstanding of what adaptive testing does,” says Davey of the ETS. He says computer adaptive tests typically span more than a single grade level—a diagnostic benefit—but they don’t have to, and in any case, grade-level information is recorded for each test item.
Researchers express puzzlement because the federal government has been deeply involved in the development of computer adaptive testing, starting with seminal research at the U.S. Office of Naval Research in the 1970s and 1980s. A decade later, Education Department grants paid for new computer adaptive reading tests in foreign languages, and department officials lauded the method’s potential for school-based testing.
David J. Weiss, one of the original leaders of the Navy research, says there is “no reason” why computer adaptive testing is not appropriate for K-12.
Now the director of the psychometric-methods program at the University of Minnesota, Twin Cities, Weiss notes that a study of children who took such tests in Oregon for several years produced “beautiful data” on improvements in math and reading.
Federal officials say they would consider the use of a computer adaptive test if it tested within the grade level.
But other test experts say the federal government is right to be wary of computer adaptive testing.
“The technology is not ready for prime time,” contends Robert A. Schaeffer, the public education director for the National Center for Fair & Opening Testing, or FairTest, a Cambridge, Mass.-based advocacy group that opposed the No Child Left Behind Act because of its testing mandates.
He says the computer adaptive version of the Graduate Record Examination launched at ETS testing centers in 1994 was initially flawed because it had a pool of test items that was too small, and there were insufficient facilities for the number of test-takers.
ETS spokesman Tom Ewing acknowledges those problems occurred but says they were quickly resolved through enlarging the pool of questions and improving test scheduling.
But Schaeffer warns that schools could face a rougher transition, considering their budget limitations and the high stakes involved in testing.
W. James Popham, a professor emeritus and educational testing authority at the University of California, Los Angeles, says the theoretical accuracy of computer adaptive testing does not necessarily translate into reality: “Even though [such testing] makes measurement types quite merry, they can play games with numbers and it doesn’t help kids.”
Popham, a former president of the American Educational Research Association, contends that the testing technology is “opaque” to the public and policymakers.
He says federal officials may believe the testing method could introduce loopholes into the education law.
“They fear educational con artists who have historically frustrated congressional attempts to safeguard disadvantaged youngsters,” Popham says, referring to educators who wish to avoid accountability. “The fear is, they’ll pull a fast one and downgrade expectations.”
Zeroing In on Skills
But proponents of adaptive, computer-based testing fear that schools may wait decades for access to a major improvement over conventional, “linear” standardized tests, which present each student with the same set of test items.
The logic of the new tests is that of a coach pitching to a young batter: If the youngster is missing, the coach eases up a little; if not, he increases the challenge. Sooner or later, the coach zeroes in on the batter’s skill level.
Some testing experts argue that the adjustment improves test accuracy.
“In paper-and-pencil tests, items tend to be grouped around average kids. Those in the tails of distribution— we don’t get as much information about those kids,” says Michael L. Nering, the senior psychometrician at Measured Progress, a testing company in Dover, N.H.
“The great thing about adaptive testing is that it has equal precision,” meaning the results are accurate at all proficiency levels, says Nering, who helped design two state assessments and developed computer adaptive tests for ACT Inc. “No matter what your ability is, whether you’re really smart or not, the test will stop administering items when equal precision is reached.”
By contrast, most of the items on conventional tests—on paper or computer—are aimed at the “average” student in the target population.
“If I’m a very low-performing student, there may be only two or three items on the [conventional] test that are appropriate to my level of performance,” Davey of the ETS says, adding that the same is true for high-performing students.
Inside the IRT
Computer adaptive tests often use the same types of questions as conventional tests, though with adjustments for display on a screen. Other features are distinctive, such as the order of items being irreversible. Students are not allowed to recheck or change answers.
This one-way street is necessary because of the process that takes place after each answer: A central computer recalculates the test-taker’s ability level, then selects the next item, based on the individual’s success to that point.
As the student completes more items, the computer tracks the statistical accuracy of the score until a set accuracy level is reached. Then the test moves to another skill or content area. Reaching that level may require fewer items if the student answers with consistent proficiency—or many more items, if the student answers inconsistently.
“Adaptive testing doesn’t waste the examinees’ time by asking questions that we’re already pretty sure we know how the student is going to answer,” says Davey.
To make the crucial decisions about which items to present, the test is outfitted with an “item response theory” model—essentially its brains and the part of the system that some critics consider opaque.
The IRT model governs the interaction between the test-taker and the test items. It weighs the student’s record of right and wrong answers against several known characteristics of the test items—such as difficulty, the ability to discriminate between higher- and-lower-ability students, the degree to which guessing may succeed, and coverage of academic content.
By solving the complex algorithm written into the IRT model, the computer determines which test item should be presented to the student next.
Test developers concede that IRT models are unfathomable to lay people and even challenge the intellects of experts unfamiliar with a given test.
Schaeffer of FairTest calls the IRT model the “pig in a poke” that makes computer adaptive testing hard for policymakers to accept.
“Who knows what the algorithm is for test delivery?” he asks. “You have to accept the test manufacturer’s claims about whether the test is equivalent for each student.”
Scott Elliot, the chief executive officer of Vantage Learning, a major maker of computer-based tests located in Yardley, Pa., says, “There are many technical nuances under the IRT; some differences [between IRTs] are sort of like religion.”
Davey of the ETS agrees that the IRT resists attempts to explain it, but adds that the apparent simplicity of conventional testing is “based largely on oversimplification of how paper testing typically is done.”
In fact, he says, virtually identical IRT models are used with some conventional state tests to ensure that the same score in different years represents approximately the same proficiency level on the test—a vital issue for accountability.
Breaking With the Past
Because of technology hurdles and spotty acceptance of computer adaptive testing, experts generally predict that the field will struggle for the next five or 10 years, but that schools will eventually turn to the approach.
Davey believes educators will be persuaded by the greater amount of diagnostic information the tests produce from fewer school days spent testing.
That’s not to overlook other formidable problems that computer-based testing poses for schools—notably, the difficulty of providing technology that is reliable and consistent for all students, so the playing field is kept level. The tests must be delivered over a robust infrastructure to avoid processing and communications delays that would leave students waiting for their next test items.
Computer adaptive tests also require larger banks of test items than conventional tests do. Yet the adaptive method gives items a longer useful life because it’s harder for test-takers to predict which items they will encounter.
Finally, adaptive tests are subject to some of the same well-documented problems as other standardized tests, such as cultural biases, says FairTest’s Schaeffer. “Automating test items that are used inappropriately, in many ways makes matters worse— you add technical problems and dissemination-of-information problems,” he says.
Referring to the ETS adaptive Graduate Record Examination, he adds, “The GRE, in spite of all the hoopla, is the same lame questions put out using a hidden algorithm, instead of linearly on a sheet of paper.”
Ewing of the ETS counters that its test items are “what the graduate deans have said are the math and verbal skills that they want students to be able to handle.”
Meanwhile, researchers are working on new kinds of adaptations that could be applied in computer adaptive tests—including presenting items using multimedia or computer simulations and catering to an individual’s preferred learning style. Already, some tests present items in different languages.
Those changes highlight another potential pitfall. Today, policymakers insist on having new tests demonstrate “comparability” with old tests, a task that Davey says becomes vastly more difficult as testing methods change.
Benefiting from many promising innovations will require letting go of comparability, Davey maintains.
“It’s like when we moved away from strictly essays and constructed-response items 100 years ago and introduced multiple-choice items,” he says. “For tests based on simulations, there’s no paper-and-pencil equivalent anymore. You have to make a clean break with the past.”
Vol. 22, Issue 35, Pages 17-18, 20-21Published in Print: May 8, 2003, as A Question of Direction