Accountability

Stanford Report Questions Accuracy of Tests

By Debra Viadero — October 06, 1999 3 min read
  • Save to favorites
  • Print

How often will a student who really belongs at the 50th percentile according to national test norms actually score within 5 percentile points of that ranking on a test?

The answer, a Stanford University statistician says in a new report, is only about 30 percent of the time in mathematics and 42 percent in reading.

For More Information

“How Accurate Are the Star National Percentile Rank Scores for Individual Students?--An Interpretative Guide” is available online at www.cse.ucla.edu/
CRESST/Reports/drrg uide .html
.

David R. Rogosa says his calculations shed new light on traditional, technical methods of describing test accuracy.

“Here’s a different look at what we’re getting from standardized tests expressed in what I hope are common-sense forms,” said Mr. Rogosa, an associate professor in Stanford’s school of education. “And the question that I’m putting out is ‘Are these numbers good enough?’”

His findings add to a growing debate about the use of tests for important decisions such as student promotion or teacher pay, illustrating the point that even the best tests are not perfect.

Last month, CTB/McGraw-Hill, one of the nation’s largest test publishers, apologized for a calibration error that skewed percentile rankings for students taking the popular TerraNova test in at least six states. The most serious consequences occurred in New York City, where officials used the rankings to determine which students should attend summer school, and which were held back a grade. (“Error Affects Test Results in Six States,” Sept. 29, 1999.)

Tracking Test ‘Uncertainty’

David R. Rogosa

“The lessons that this study brings are very important now in days of attention to accountability systems that have high stakes attached,” said Robert J. Mislevy, a distinguished research scientist at the Educational Testing Service in Princeton, N.J. “There’s a certain amount of uncertainty inherent even in a well-controlled system.”

He added that, “even if there had been no equating problems in New York City, some of those kids who had to stay for summer school--if they had taken the test on another day--might have passed it.”

Traditionally, experts describe test accuracy in terms of reliability coefficients, which are fractions between O and 1, with 1 being perfect accuracy.

For his study, Mr. Rogosa focused on the reading and math portions of the Stanford Achievement Test-9th Edition, which is published by Harcourt Educational Measurement of San Antonio. The reading test has a reliability coefficient of between .94 and .96 for grades 2 through 11.

That indicates a very high probability that the score on the test reflects a student’s actual standing. But it is not so high that every student’s achievement level will be identified correctly, especially on a test given to millions of students.

The math test’s reliability coefficient is .94 or .95 for grades 2 through 8. It drops as low as .87, however, for higher grades.

“When people see a number like .95, they say that’s got to be awfully good,” Mr. Rogosa observed. “We’re better off knowing exactly what we’re getting for our money.”

Longer Tests Needed?

Mr. Rogosa said his findings might also apply to other commonly used tests, most of which have similar reliability coefficients. He chose the reading and math portions of the Stanford-9, he said, because they are longer and, thus, more reliable than test sections covering other subjects, such as social studies or science.

A spokesman for Harcourt did not return repeated phone calls last week.

Mr. Rogosa calculated the standard errors for the tests and then translated the numbers to scenarios intended to make them easier to understand.

What are the chances, he notes in one such example, that two students with identical “real achievement"--a hypothetical gold standard for the tests--will score more than 10 percentile points apart on the same test? For two 9th graders who are really at the 45th percentile, the answer is 57 percent. In 4th grade reading, the probability is 42 percent.

Such a wide range in scores does not, however, argue for discarding such tests, Mr. Rogosa writes in his report. Longer tests might enhance accuracy, “especially if readers interpret these results to indicate that current tests do not have adequate accuracy.”

Events

School Climate & Safety K-12 Essentials Forum Strengthen Students’ Connections to School
Join this free event to learn how schools are creating the space for students to form strong bonds with each other and trusted adults.
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Student Well-Being Webinar
Reframing Behavior: Neuroscience-Based Practices for Positive Support
Reframing Behavior helps teachers see the “why” of behavior through a neuroscience lens and provides practices that fit into a school day.
Content provided by Crisis Prevention Institute
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Mathematics Webinar
Math for All: Strategies for Inclusive Instruction and Student Success
Looking for ways to make math matter for all your students? Gain strategies that help them make the connection as well as the grade.
Content provided by NMSI

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Accountability Opinion What’s Wrong With Online Credit Recovery? This Teacher Will Tell You
The “whatever it takes” approach to increasing graduation rates ends up deflating the value of a diploma.
5 min read
Image shows a multi-tailed arrow hitting the bullseye of a target.
DigitalVision Vectors/Getty
Accountability Why a Judge Stopped Texas from Issuing A-F School Ratings
Districts argued the new metric would make it appear as if schools have worsened—even though outcomes have actually improved in many cases.
2 min read
Laura BakerEducation Week via Canva  (1)
Canva
Accountability Why These Districts Are Suing to Stop Release of A-F School Ratings
A change in how schools will be graded has prompted legal action from about a dozen school districts in Texas.
4 min read
Handwritten red letter grades cover a blue illustration of a classic brick school building.
Laura Baker, Canva
Accountability What the Research Says What Should Schools Do to Build on 20 Years of NCLB Data?
The education law yielded a cornucopia of student information, but not scalable turnaround for schools, an analysis finds.
3 min read
Photo of magnifying glass and charts.
iStock / Getty Images Plus