Opinion
Assessment Opinion

In Testing, How Reliable Are Year-to-Year Comparisons?

By Alan Tucker — August 11, 2004 4 min read
  • Save to favorites
  • Print
Surprising answers to a question that gets too little attention.

A central tenet of the federal No Child Left Behind Act is that educational improvement at a school can be measured by comparing student scores on standards-based tests from one year to the next. An important question about such a strategy, one that has gotten surprisingly little attention, is this: How accurate are such year-to-year comparisons? The answer is that they are much less accurate than people assume—and in some cases, wildly inaccurate.

Psychometric methods are used to equate proficiency cut-scores from year to year, so that a consistent level of knowledge is required over time, irrespective of a particular year’s test. The margins of error in these equated proficiency cut-scores are impossible to compute, and equating calculations can easily be off by a point or two, and sometimes much more.

Suppose a school is expected to show an annual 10 percent improvement in the proficiency rate (the percentage of students scoring at or above the proficiency cut-score) on an 8th grade math test; say, the proficiency rate of 40 percent last year must rise to at least 44 percent this year. Unfortunately, the proficiency rate rises from 40 percent to only 42 percent, which represents just a 5 percent improvement, a failing performance. Suppose the proficiency cut-score was calculated to be 28 out of 50 on last year’s test and 30 out of 50 on this year’s test. If the proficiency cut-score this year had been set a point lower at 29, the school would have exceeded the 44 percent proficiency target level, because a 1-point drop in the proficiency cut-score on a 50-point test would increase the percentage of proficient students from 42 percent to 45 percent (or higher).

At a recent conference at the Mathematical Sciences Research Institute in Berkeley, Calif., I presented an analysis of a state test with a huge error in the proficiency cut-score. Flawed psychometric equating over the past four years on the New York Math A graduation test set the proficiency cut-score about 20 points too high, at 51 out of 85, instead of about 30 out of 85. If the No Child Left Behind law were tracking Math A proficiency rates (graduation tests are not yet mandatory), most New York high schools would probably have been labeled as “needing improvement” on the Math A test. Its high cut-score led to a huge failure rate on the June 2003 Math A test, which in turn led New York to rework all its state math tests. However, many other state tests likely have less drastic problems with their test-equating calculations that could lead some schools to be unfairly labeled as needing improvement under the No Child Left Behind Act.

Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of improvement.

Most standards-based tests are based on a technical psychometric methodology called Item Response Theory, or IRT. Item-response theory makes many critical assumptions, both of a practical nature—in getting all the technical details of test development right—and of a theoretical nature—in its one-dimensional model for assessing student knowledge. Few states have the resources to implement IRT-based tests with the attention to detail they require. Such tests should be reliable for assessing standard procedural skills, such as solving a quadratic equation. Unfortunately, the more thoughtful, and thus unpredictable, a test, the more likely it is that equating methods will misperform.

A big problem with New York’s Math A tests over time was that teachers’ instruction evolved as students’ skills improved. The Math A equating calculations were missing year-to-year improvements, because they used a dated set of “anchor” questions assessing skills that were no longer emphasized.

Item-response theory assumes that a single “ability value” can be assigned to each student, and that this value accurately predicts, within small bounds, how that student will perform on a future question. Coaching is known to undermine this assumption. On the New York state math test, students’ performance on a question appeared, not surprisingly, to be a function of whether they were drilled on that type of question, as much as of their general mathematical ability.

The use of IRT-based tests for high-stakes, year-to-year comparisons has been controversial in the educational testing community. The New York Math A test crisis resulted in the first well-documented analysis of what can go wrong with year-to-year equating on such tests—and how badly it can go wrong. (For a nontechnical analysis, see the New York State Regents Math A Panel report; for a more technical analysis, see http://www.ams.sunysb.edu/~tucker/MathA.html.)

Tests have a role to play in efforts to improve our schools. But great care is needed in annual comparisons of test performance. Often, the margin of error in setting the proficiency rate is too large to permit any meaningful assessment of how much year-to-year improvement has occurred.

Alan Tucker is a distinguished teaching professor in the department of applied mathematics and statistics at the State University of New York at Stony Brook.

Events

This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
School Climate & Safety Webinar
Belonging as a Leadership Strategy for Today’s Schools
Belonging isn’t a slogan—it’s a leadership strategy. Learn what research shows actually works to improve attendance, culture, and learning.
Content provided by Harmony Academy
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
School & District Management Webinar
Too Many Initiatives, Not Enough Alignment: A Change Management Playbook for Leaders
Learn how leadership teams can increase alignment and evaluate every program, practice, and purchase against a clear strategic plan.
Content provided by Otus
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Artificial Intelligence Webinar
Beyond Teacher Tools: Exploring AI for Student Success
Teacher AI tools only show assigned work. See how TrekAi's student-facing approach reveals authentic learning needs and drives real success.
Content provided by TrekAi

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Assessment Online Portals Offer Instant Access to Grades. That’s Not Always a Good Thing
For students and parents, is real-time access to grades an accountability booster or an anxiety provoker?
5 min read
Image of a woman interacting with a dashboard and seeing marks that are on target and off target. The mood is concern about the mark that is off target.
Visual Generation/Getty
Assessment Should Teachers Allow Students to Redo Classwork?
Allowing students to redo assignments is another aspect of the traditional grading debate.
2 min read
A teacher talks with seventh graders during a lesson.
A teacher talks with seventh graders during a lesson. The question of whether students should get a redo is part of a larger discussion on grading and assessment in education.
Allison Shelley for All4Ed
Assessment Grade Grubbing—Who's Asking and How Teachers Feel About It
Teachers are being asked to change student grades, but the requests aren't always coming from parents.
1 min read
Ashley Perkins, a second-grade teacher at the Dummerston, Vt., School, writes a "welcome back" message for her students in her classroom for the upcoming school year on Aug. 22, 2025.
Ashley Perkins, a 2nd grade teacher at the Dummerston, Vt., School, writes a "welcome back" message for her students in her classroom on Aug. 22, 2025. Many times teachers are being asked to change grades by parents and administrators.
Kristopher Radder/The Brattleboro Reformer via AP
Assessment Letter to the Editor It’s Time to Think About What Grades Really Mean
"Traditional grading often masks what a learner actually knows or is able to do."
1 min read
Education Week opinion letters submissions
Gwen Keraval for Education Week