Stanford Report Questions Accuracy of Tests
How often will a student who really belongs at the 50th percentile according to national test norms actually score within 5 percentile points of that ranking on a test?
The answer, a Stanford University statistician says in a new report, is only about 30 percent of the time in mathematics and 42 percent in reading.
For More Information
"How Accurate Are the Star National
Percentile Rank Scores for Individual Students?--An Interpretative
Guide" is available online at www.cse.ucla.edu/
CRESST/Reports/drrg uide .html.
David R. Rogosa says his calculations shed new light on traditional, technical methods of describing test accuracy.
"Here's a different look at what we're getting from standardized tests expressed in what I hope are common-sense forms," said Mr. Rogosa, an associate professor in Stanford's school of education. "And the question that I'm putting out is 'Are these numbers good enough?'"
His findings add to a growing debate about the use of tests for important decisions such as student promotion or teacher pay, illustrating the point that even the best tests are not perfect.
Last month, CTB/McGraw-Hill, one of the nation's largest test publishers, apologized for a calibration error that skewed percentile rankings for students taking the popular TerraNova test in at least six states. The most serious consequences occurred in New York City, where officials used the rankings to determine which students should attend summer school, and which were held back a grade. ("Error Affects Test Results in Six States," Sept. 29, 1999.)
Tracking Test 'Uncertainty'
"The lessons that this study brings are very important now in days of attention to accountability systems that have high stakes attached," said Robert J. Mislevy, a distinguished research scientist at the Educational Testing Service in Princeton, N.J. "There's a certain amount of uncertainty inherent even in a well-controlled system."
He added that, "even if there had been no equating problems in New York City, some of those kids who had to stay for summer school--if they had taken the test on another day--might have passed it."
Traditionally, experts describe test accuracy in terms of reliability coefficients, which are fractions between O and 1, with 1 being perfect accuracy.
For his study, Mr. Rogosa focused on the reading and math portions of the Stanford Achievement Test-9th Edition, which is published by Harcourt Educational Measurement of San Antonio. The reading test has a reliability coefficient of between .94 and .96 for grades 2 through 11.
That indicates a very high probability that the score on the test reflects a student's actual standing. But it is not so high that every student's achievement level will be identified correctly, especially on a test given to millions of students.
The math test's reliability coefficient is .94 or .95 for grades 2 through 8. It drops as low as .87, however, for higher grades.
"When people see a number like .95, they say that's got to be awfully good," Mr. Rogosa observed. "We're better off knowing exactly what we're getting for our money."
Longer Tests Needed?
Mr. Rogosa said his findings might also apply to other commonly used tests, most of which have similar reliability coefficients. He chose the reading and math portions of the Stanford-9, he said, because they are longer and, thus, more reliable than test sections covering other subjects, such as social studies or science.
A spokesman for Harcourt did not return repeated phone calls last week.
Mr. Rogosa calculated the standard errors for the tests and then translated the numbers to scenarios intended to make them easier to understand.
What are the chances, he notes in one such example, that two students with identical "real achievement"--a hypothetical gold standard for the tests--will score more than 10 percentile points apart on the same test? For two 9th graders who are really at the 45th percentile, the answer is 57 percent. In 4th grade reading, the probability is 42 percent.
Such a wide range in scores does not, however, argue for discarding such tests, Mr. Rogosa writes in his report. Longer tests might enhance accuracy, "especially if readers interpret these results to indicate that current tests do not have adequate accuracy."
Vol. 19, Issue 6, Page 3