Opinion
Assessment Opinion

In Testing, How Reliable Are Year-to-Year Comparisons?

By Alan Tucker — August 11, 2004 4 min read
  • Save to favorites
  • Print
Surprising answers to a question that gets too little attention.

A central tenet of the federal No Child Left Behind Act is that educational improvement at a school can be measured by comparing student scores on standards-based tests from one year to the next. An important question about such a strategy, one that has gotten surprisingly little attention, is this: How accurate are such year-to-year comparisons? The answer is that they are much less accurate than people assume—and in some cases, wildly inaccurate.

Psychometric methods are used to equate proficiency cut-scores from year to year, so that a consistent level of knowledge is required over time, irrespective of a particular year’s test. The margins of error in these equated proficiency cut-scores are impossible to compute, and equating calculations can easily be off by a point or two, and sometimes much more.

Suppose a school is expected to show an annual 10 percent improvement in the proficiency rate (the percentage of students scoring at or above the proficiency cut-score) on an 8th grade math test; say, the proficiency rate of 40 percent last year must rise to at least 44 percent this year. Unfortunately, the proficiency rate rises from 40 percent to only 42 percent, which represents just a 5 percent improvement, a failing performance. Suppose the proficiency cut-score was calculated to be 28 out of 50 on last year’s test and 30 out of 50 on this year’s test. If the proficiency cut-score this year had been set a point lower at 29, the school would have exceeded the 44 percent proficiency target level, because a 1-point drop in the proficiency cut-score on a 50-point test would increase the percentage of proficient students from 42 percent to 45 percent (or higher).

At a recent conference at the Mathematical Sciences Research Institute in Berkeley, Calif., I presented an analysis of a state test with a huge error in the proficiency cut-score. Flawed psychometric equating over the past four years on the New York Math A graduation test set the proficiency cut-score about 20 points too high, at 51 out of 85, instead of about 30 out of 85. If the No Child Left Behind law were tracking Math A proficiency rates (graduation tests are not yet mandatory), most New York high schools would probably have been labeled as “needing improvement” on the Math A test. Its high cut-score led to a huge failure rate on the June 2003 Math A test, which in turn led New York to rework all its state math tests. However, many other state tests likely have less drastic problems with their test-equating calculations that could lead some schools to be unfairly labeled as needing improvement under the No Child Left Behind Act.

Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of improvement.

Most standards-based tests are based on a technical psychometric methodology called Item Response Theory, or IRT. Item-response theory makes many critical assumptions, both of a practical nature—in getting all the technical details of test development right—and of a theoretical nature—in its one-dimensional model for assessing student knowledge. Few states have the resources to implement IRT-based tests with the attention to detail they require. Such tests should be reliable for assessing standard procedural skills, such as solving a quadratic equation. Unfortunately, the more thoughtful, and thus unpredictable, a test, the more likely it is that equating methods will misperform.

A big problem with New York’s Math A tests over time was that teachers’ instruction evolved as students’ skills improved. The Math A equating calculations were missing year-to-year improvements, because they used a dated set of “anchor” questions assessing skills that were no longer emphasized.

Item-response theory assumes that a single “ability value” can be assigned to each student, and that this value accurately predicts, within small bounds, how that student will perform on a future question. Coaching is known to undermine this assumption. On the New York state math test, students’ performance on a question appeared, not surprisingly, to be a function of whether they were drilled on that type of question, as much as of their general mathematical ability.

The use of IRT-based tests for high-stakes, year-to-year comparisons has been controversial in the educational testing community. The New York Math A test crisis resulted in the first well-documented analysis of what can go wrong with year-to-year equating on such tests—and how badly it can go wrong. (For a nontechnical analysis, see the New York State Regents Math A Panel report; for a more technical analysis, see http://www.ams.sunysb.edu/~tucker/MathA.html.)

Tests have a role to play in efforts to improve our schools. But great care is needed in annual comparisons of test performance. Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of how much year-to-year improvement has occurred.

Alan Tucker is a distinguished teaching professor in the department of applied mathematics and statistics at the State University of New York at Stony Brook.

Events

School Climate & Safety K-12 Essentials Forum Strengthen Students’ Connections to School
Join this free event to learn how schools are creating the space for students to form strong bonds with each other and trusted adults.
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Reading & Literacy Webinar
Creating Confident Readers: Why Differentiated Instruction is Equitable Instruction
Join us as we break down how differentiated instruction can advance your school’s literacy and equity goals.
Content provided by Lexia Learning
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
IT Infrastructure & Management Webinar
Future-Proofing Your School's Tech Ecosystem: Strategies for Asset Tracking, Sustainability, and Budget Optimization
Gain actionable insights into effective asset management, budget optimization, and sustainable IT practices.
Content provided by Follett Learning

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Assessment Opinion To Replace Skill Mastery for Seat Time, There Are 3 Requirements
Time for learning and student support take on a whole new meaning in the mastery-based learning model.
4 min read
Image shows a multi-tailed arrow hitting the bullseye of a target.
DigitalVision Vectors/Getty
Assessment More States Could Drop Their High School Exit Exams
There's movement afoot in nearly half the states that still mandate high school exit exams to end the requirement.
4 min read
A student looks at questions during a college test preparation class at Holton Arms School in Bethesda, Md., on Jan. 17, 2016. The SAT exam will move from paper and pencil to a digital format, administrators announced Tuesday, Jan. 25, 2022, saying the shift will boost its relevancy as more colleges make standardized tests optional for admission.
A student looks at questions during a college test preparation class at Holton Arms School in Bethesda, Md., on Jan. 17, 2016. More states are looking to abandon high school exit exams as support for standardized testing cools.
Alex Brandon/AP
Assessment Cardona Says Standardized Tests Haven't Always Met the Mark, Offers New Flexibility
The U.S. Department of Education is seeking to reinvigorate a little-used pilot program to create new types of assessments.
7 min read
Education Secretary Miguel Cardona speaks during an interview with The Associated Press in his office at the Department of Education on Sept. 20, 2023 in Washington.
Education Secretary Miguel Cardona speaks during an interview with The Associated Press in his office at the Department of Education on Sept. 20, 2023 in Washington.
Mark Schiefelbein/AP
Assessment Opinion The 4 Common Myths About Grading Reform, Debunked
Grading reformers and their critics all have the same goal: grades that truly reflect student learning. Here’s how we move forward.
Sarah Ruth Morris & Matt Townsley
5 min read
Venn diagram over a macro shot of A- on white results sheet. Extremely shallow focus. Letter grades are highlighted.
E+/Getty + Vanessa Solis/Education Week