Assessment Opinion

In Testing, How Reliable Are Year-to-Year Comparisons?

By Alan Tucker — August 11, 2004 4 min read
  • Save to favorites
  • Print
Surprising answers to a question that gets too little attention.

A central tenet of the federal No Child Left Behind Act is that educational improvement at a school can be measured by comparing student scores on standards-based tests from one year to the next. An important question about such a strategy, one that has gotten surprisingly little attention, is this: How accurate are such year-to-year comparisons? The answer is that they are much less accurate than people assume—and in some cases, wildly inaccurate.

Psychometric methods are used to equate proficiency cut-scores from year to year, so that a consistent level of knowledge is required over time, irrespective of a particular year’s test. The margins of error in these equated proficiency cut-scores are impossible to compute, and equating calculations can easily be off by a point or two, and sometimes much more.

Suppose a school is expected to show an annual 10 percent improvement in the proficiency rate (the percentage of students scoring at or above the proficiency cut-score) on an 8th grade math test; say, the proficiency rate of 40 percent last year must rise to at least 44 percent this year. Unfortunately, the proficiency rate rises from 40 percent to only 42 percent, which represents just a 5 percent improvement, a failing performance. Suppose the proficiency cut-score was calculated to be 28 out of 50 on last year’s test and 30 out of 50 on this year’s test. If the proficiency cut-score this year had been set a point lower at 29, the school would have exceeded the 44 percent proficiency target level, because a 1-point drop in the proficiency cut-score on a 50-point test would increase the percentage of proficient students from 42 percent to 45 percent (or higher).

At a recent conference at the Mathematical Sciences Research Institute in Berkeley, Calif., I presented an analysis of a state test with a huge error in the proficiency cut-score. Flawed psychometric equating over the past four years on the New York Math A graduation test set the proficiency cut-score about 20 points too high, at 51 out of 85, instead of about 30 out of 85. If the No Child Left Behind law were tracking Math A proficiency rates (graduation tests are not yet mandatory), most New York high schools would probably have been labeled as “needing improvement” on the Math A test. Its high cut-score led to a huge failure rate on the June 2003 Math A test, which in turn led New York to rework all its state math tests. However, many other state tests likely have less drastic problems with their test-equating calculations that could lead some schools to be unfairly labeled as needing improvement under the No Child Left Behind Act.

Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of improvement.

Most standards-based tests are based on a technical psychometric methodology called Item Response Theory, or IRT. Item-response theory makes many critical assumptions, both of a practical nature—in getting all the technical details of test development right—and of a theoretical nature—in its one-dimensional model for assessing student knowledge. Few states have the resources to implement IRT-based tests with the attention to detail they require. Such tests should be reliable for assessing standard procedural skills, such as solving a quadratic equation. Unfortunately, the more thoughtful, and thus unpredictable, a test, the more likely it is that equating methods will misperform.

A big problem with New York’s Math A tests over time was that teachers’ instruction evolved as students’ skills improved. The Math A equating calculations were missing year-to-year improvements, because they used a dated set of “anchor” questions assessing skills that were no longer emphasized.

Item-response theory assumes that a single “ability value” can be assigned to each student, and that this value accurately predicts, within small bounds, how that student will perform on a future question. Coaching is known to undermine this assumption. On the New York state math test, students’ performance on a question appeared, not surprisingly, to be a function of whether they were drilled on that type of question, as much as of their general mathematical ability.

The use of IRT-based tests for high-stakes, year-to-year comparisons has been controversial in the educational testing community. The New York Math A test crisis resulted in the first well-documented analysis of what can go wrong with year-to-year equating on such tests—and how badly it can go wrong. (For a nontechnical analysis, see the New York State Regents Math A Panel report; for a more technical analysis, see http://www.ams.sunysb.edu/~tucker/MathA.html.)

Tests have a role to play in efforts to improve our schools. But great care is needed in annual comparisons of test performance. Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of how much year-to-year improvement has occurred.

Alan Tucker is a distinguished teaching professor in the department of applied mathematics and statistics at the State University of New York at Stony Brook.


Jobs October 2021 Virtual Career Fair for Teachers and K-12 Staff
Find teaching jobs and other jobs in K-12 education at the EdWeek Top School Jobs virtual career fair.
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Data Webinar
Using Integrated Analytics To Uncover Student Needs
Overwhelmed by data? Learn how an integrated approach to data analytics can help.

Content provided by Instructure
Classroom Technology Webinar How Pandemic Tech Is (and Is Not) Transforming K-12 Schools
The COVID-19 pandemic—and the resulting rise in virtual learning and big investments in digital learning tools— helped educators propel their technology skills to the next level. Teachers have become more adept at using learning management

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Assessment Data Young Adolescents' Scores Trended to Historic Lows on National Tests. And That's Before COVID Hit
The past decade saw unprecedented declines in the National Assessment of Educational Progress's longitudinal study.
3 min read
Assessment Long a Testing Bastion, Florida Plans to End 'Outdated' Year-End Exams
Florida Gov. Ron DeSantis said the state will shift to "progress monitoring" starting in the 2022-23 school year.
5 min read
Florida Governor Ron DeSantis speaks at the opening of a monoclonal antibody site in Pembroke Pines, Fla., on Aug. 18, 2021.
Florida Gov. Ron DeSantis said he believes a new testing regimen is needed to replace the Florida Standards Assessment, which has been given since 2015.
Marta Lavandier/AP
Assessment Spotlight Spotlight on Assessment in 2021
In this Spotlight, review newest assessment scores, see how districts will catch up with their supports for disabled students, plus more.
Assessment 'Nation's Report Card' Has a New Reading Framework, After a Drawn-Out Battle Over Equity
The new framework for the National Assessment of Educational Progress will guide development of the 2026 reading test.
10 min read
results 925693186 02