Assessment Commentary

In Testing, How Reliable Are Year-to-Year Comparisons?

By Alan Tucker — August 11, 2004 4 min read
Surprising answers to a question that gets too little attention.

A central tenet of the federal No Child Left Behind Act is that educational improvement at a school can be measured by comparing student scores on standards-based tests from one year to the next. An important question about such a strategy, one that has gotten surprisingly little attention, is this: How accurate are such year-to-year comparisons? The answer is that they are much less accurate than people assume—and in some cases, wildly inaccurate.

Psychometric methods are used to equate proficiency cut-scores from year to year, so that a consistent level of knowledge is required over time, irrespective of a particular year’s test. The margins of error in these equated proficiency cut-scores are impossible to compute, and equating calculations can easily be off by a point or two, and sometimes much more.

Suppose a school is expected to show an annual 10 percent improvement in the proficiency rate (the percentage of students scoring at or above the proficiency cut-score) on an 8th grade math test; say, the proficiency rate of 40 percent last year must rise to at least 44 percent this year. Unfortunately, the proficiency rate rises from 40 percent to only 42 percent, which represents just a 5 percent improvement, a failing performance. Suppose the proficiency cut-score was calculated to be 28 out of 50 on last year’s test and 30 out of 50 on this year’s test. If the proficiency cut-score this year had been set a point lower at 29, the school would have exceeded the 44 percent proficiency target level, because a 1-point drop in the proficiency cut-score on a 50-point test would increase the percentage of proficient students from 42 percent to 45 percent (or higher).

At a recent conference at the Mathematical Sciences Research Institute in Berkeley, Calif., I presented an analysis of a state test with a huge error in the proficiency cut-score. Flawed psychometric equating over the past four years on the New York Math A graduation test set the proficiency cut-score about 20 points too high, at 51 out of 85, instead of about 30 out of 85. If the No Child Left Behind law were tracking Math A proficiency rates (graduation tests are not yet mandatory), most New York high schools would probably have been labeled as “needing improvement” on the Math A test. Its high cut-score led to a huge failure rate on the June 2003 Math A test, which in turn led New York to rework all its state math tests. However, many other state tests likely have less drastic problems with their test-equating calculations that could lead some schools to be unfairly labeled as needing improvement under the No Child Left Behind Act.

Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of improvement.

Most standards-based tests are based on a technical psychometric methodology called Item Response Theory, or IRT. Item-response theory makes many critical assumptions, both of a practical nature—in getting all the technical details of test development right—and of a theoretical nature—in its one-dimensional model for assessing student knowledge. Few states have the resources to implement IRT-based tests with the attention to detail they require. Such tests should be reliable for assessing standard procedural skills, such as solving a quadratic equation. Unfortunately, the more thoughtful, and thus unpredictable, a test, the more likely it is that equating methods will misperform.

A big problem with New York’s Math A tests over time was that teachers’ instruction evolved as students’ skills improved. The Math A equating calculations were missing year-to-year improvements, because they used a dated set of “anchor” questions assessing skills that were no longer emphasized.

Item-response theory assumes that a single “ability value” can be assigned to each student, and that this value accurately predicts, within small bounds, how that student will perform on a future question. Coaching is known to undermine this assumption. On the New York state math test, students’ performance on a question appeared, not surprisingly, to be a function of whether they were drilled on that type of question, as much as of their general mathematical ability.

The use of IRT-based tests for high-stakes, year-to-year comparisons has been controversial in the educational testing community. The New York Math A test crisis resulted in the first well-documented analysis of what can go wrong with year-to-year equating on such tests—and how badly it can go wrong. (For a nontechnical analysis, see the New York State Regents Math A Panel report; for a more technical analysis, see http://www.ams.sunysb.edu/~tucker/MathA.html.)

Tests have a role to play in efforts to improve our schools. But great care is needed in annual comparisons of test performance. Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of how much year-to-year improvement has occurred.

Alan Tucker is a distinguished teaching professor in the department of applied mathematics and statistics at the State University of New York at Stony Brook.


Student Well-Being Webinar Boosting Teacher and Student Motivation During the Pandemic: What It Takes
Join Alyson Klein and her expert guests for practical tips and discussion on how to keep students and teachers motivated as the pandemic drags on.
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Student Well-Being Webinar
A Holistic Approach to Social-Emotional Learning
Register to learn about the components and benefits of holistically implemented SEL.
Content provided by Committee for Children
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Student Well-Being Webinar
How Principals Can Support Student Well-Being During COVID
Join this webinar for tips on how to support and prioritize student health and well-being during COVID.
Content provided by Unruly Studios

EdWeek Top School Jobs

Interdisciplinary STEAM Specialist
Smyrna, Georgia
St. Benedict's Episcopal School
Interdisciplinary STEAM Specialist
Smyrna, Georgia
St. Benedict's Episcopal School
Arizona School Data Analyst - (AZVA)
Arizona, United States
K12 Inc.
Software Engineer
Portland, OR, US
Northwest Evaluation Association

Read Next

Assessment Timing of Food Stamps Can Affect Students' Test Scores, Study Finds
Hungry students don't test as well, say researchers who found a link between food stamp disbursements and students' exam scores.
5 min read
A sign advertises a program that allows food stamp recipients to use their EBT cards to shop at a farmer's market in Topsham, Maine on March 17, 2017.
Food stamps can be used in some farmers' markets, as at this one in Topsham, Maine. New research shows a link between timing of the aid and student performance on key tests.
Robert F. Bukaty/AP
Assessment New Mexico Asks to Skip Student Testing Again This Year
State officials are seeking permission from federal officials to waive standardized testing for another year, citing the pandemic.
3 min read
Assessment Opinion Five Intuitions to Guide Assessment in 2021 and After
Beyond the question of whether to test during COVID-19, there’s the equally crucial question of how to approach testing in 2021 and after.
3 min read
Image shows a multi-tailed arrow hitting the bullseye of a target.
DigitalVision Vectors/Getty
Assessment Opinion To Keep Primary Students Learning and Growing, Start With Data
A district’s dedication to gathering and analyzing data provides stability in uncertain times.
Janice Pavelonis
5 min read
Image shows a speech bubble divided into 4 overlapping, connecting parts.
DigitalVision Vectors/Getty and Laura Baker/Education Week