Assessment

Study Questions Reliability Of Single-Year Test-Score Gains

By Lynn Olson — May 23, 2001 5 min read

More than half the states reward or punish schools based largely on test scores. But a new analysis suggests the methods used to identify good and bad schools are far less reliable than state policymakers may think.

The study, which will be published next year, found that between 50 percent and 80 percent of the improvement in a school’s average test scores from one year to the next was temporary and was caused by fluctuations that had nothing to do with long-term changes in learning or productivity.

“This is a paper that’s well worth going through and understanding,” said David W. Grissmer, a senior management scientist in Washington for the RAND Corp., a Santa Monica, Calif.-based research organization. “The question is, are we picking out lucky schools or good schools, and unlucky schools or bad schools? The answer is, we’re picking out lucky and unlucky schools.”

Thomas J. Kane

The paper, written by economists Thomas J. Kane and Douglas O. Staiger, was presented here last week at the annual conference of the Brown Center on Education Policy, a division of the Brookings Institution, a Washington think tank.

The study is based on math- and reading-test scores for nearly 300,000 students in grades 3-5 in North Carolina each year between 1992-93 and 1998-99. The researchers also analyzed school and grade-level data from the index used to rate California’s schools from 1998 through 2000.

But many of the findings apply to test-based accountability systems in other states as well, said Mr. Kane, a fellow at the Hoover Institution at Stanford University and a professor of policy studies at the University of California, Los Angeles. “Unfortunately,” he said, “most of these systems have been set up with very little recognition of the strengths and weaknesses of the measures that they’re based on.”

‘A Bigger Haystack’

Even small fluctuations in a school’s scores can have a large impact on a school’s ranking, Mr. Kane said, simply because schools’ overall test scores don’t differ that much in the first place. Schools differ even less in the rate at which their test scores change or improve over time.

David W. Grissmer

“It’s just harder to discern,” he said. “We’re looking for a smaller needle in a bigger haystack.”

The study examined the amount of built-in volatility—or “noise"—from two sources that could cause a school’s test scores to swing from one year to the next.

The first is the bouncing-around in test scores that occurs based on the particular population of students in a grade in any given year. With the average elementary school enrolling only 68 students per grade level, the variation in test scores stemming from who happens to be in 4th grade in a school in any particular year can be large. The problem is particularly severe for small schools, where the presence of just a few students who score particularly high or low on state tests can skew the average.

Based on their analysis of reading and mathematics scores for 4th graders in North Carolina, the researchers found between 14 percent and 15 percent of the variation in test scores among schools of median size could be attributed to the particular sample of students tested in a given year.

The second source of volatility in test scores is what the researchers call “other non-persistent variance,” such as a dog barking in the parking lot on test day or the presence of a particularly disruptive student.

The researchers calculated the amount of volatility in test scores that could be attributed to such one-time events. Then they combined that figure with the volatility credited to changes in the population of students to see how much of the variation in test scores among schools was attributable to such random noise, rather than to real differences in productivity. The researchers also divided the approximately 1,000 North Carolina elementary schools in the analysis into quintiles, based on size.

They found that the amount of “nonpersistent” variation was worst when schools were judged based on annual gains or changes in test scores. When they looked at the gains North Carolina students made in combined math and reading scores between 3rd and 4th grade, they found that roughly half the variation among medium-sized schools—and 57 percent of the variation among the smallest schools—stemmed from such built-in volatility in test scores.

When they looked at the change in combined reading and math scores from one year’s 4th graders to the next year’s, more than 70 percent of the variation among schools of any size was attributable to such nonpersistent factors.

“In other words,” the researchers write,"if one were to look for signs of improvement by closely tracking changes in mean [test] scores from one year to the next, 50 to 80 percent of what one observed would be temporary— either due to sampling variation or some other nonpersistent cause.”

Avoidance Advice

The researchers emphasize three policies that accountability systems should avoid if they don’t want to misidentify schools:

  • Incentives for schools with test scores at either extreme— rewards for those with very high scores or penalties for those with very low scores—are likely to affect small schools the most and to provide very weak inducements for large schools.

The use of extremes is a particular problem when focusing on changes or gains in performance from year to year, the researchers note, because they are both so volatile. One solution, they suggest, would be for states to set performance thresholds that lie more in the middle of the test-score distribution. Another would be to separate schools by size category, just as in high school sports, and set different thresholds for each category.

  • Incentive systems that establish separate thresholds for each racial or ethnic subgroup put integrated schools at a disadvantage and could actually encourage districts to segregate students.

The accountability systems in a number of states, including Texas and California, set separate performance or growth expectations for each racial or ethnic subgroup. Congress also is considering requiring separate performance targets for each subgroup under the federal Title I program for disadvantaged students.

  • It isn’t sound policy to identify “best practice” or “fastest improving” schools based on a single year’s change in test scores. Given the huge variability in test scores from one year to the next, the researchers suggest it would make more sense to pool information across years and across schools to identify schools worth emulating.

Helen F. Ladd, a professor of public-policy studies and economics at Duke University in Durham, complimented the paper as making a “superb” methodological contribution. But she argued that, despite some of the issues the researchers identified for North Carolina and other states, North Carolina has “a very powerful accountability system.”

The main part of the system, she said, focuses on whether schools have met their expected growth targets—based on the past performance of the state as a whole and adjusted for the previous performance of students in the school—and not on rewarding scores at the extremes of the continuum.

A version of this article appeared in the May 23, 2001 edition of Education Week as Study Questions Reliability Of Single-Year Test-Score Gains

Events

This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Student Well-Being Webinar
How Districts Are Centering Relationships and Systemic SEL for Back to School 21-22
As educators and leaders consider how SEL fits into their reopening and back-to-school plans, it must go beyond an SEL curriculum. SEL is part of who we are as educators and students, as well as
Content provided by Panorama Education
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Student Achievement Webinar
The Fall K-3 Classroom: What the data imply about composition, challenges and opportunities
The data tracking learning loss among the nation’s schoolchildren confirms that things are bad and getting worse. The data also tells another story — one with serious implications for the hoped for learning recovery initiatives
Content provided by Campaign for Grade-Level Reading
Student Well-Being Online Summit Student Mental Health
Attend this summit to learn what the data tells us about student mental health, what schools can do, and best practices to support students.

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Assessment Opinion Alternatives to Standardized Tests During a Pandemic Year
Three educators suggest alternatives to federally mandated standardized testing during this year undercut by COVID-19.
7 min read
Images shows colorful speech bubbles that say "Q," "&," and "A."
iStock/Getty
Assessment Opinion AP Exams Can't Be Business as Usual This Year
The College Board seems unconcerned with the collateral damage of its pandemic approach, writes an assistant superintendent of curriculum and instruction.
Pete Bavis
5 min read
Illustration of large boat in turbulent waters with other smaller boats falling into the abyss.
iStock/Getty Images Plus
Assessment Federal Lawmakers Urge Miguel Cardona to Let States Cancel Tests, Highlighting Discord
A letter from Democratic members to the new education secretary calls for an end to the "flawed" system of annual standardized exams.
3 min read
Jamaal Bowman speaks to reporters after voting at a polling station inside Yonkers Middle/High School in Yonkers, N.Y. on June 23, 2020.
Jamaal Bowman speaks to reporters after voting at a polling station inside Yonkers Middle/High School in Yonkers, N.Y. on June 23, 2020.
John Minchillo/AP
Assessment How Two Years of Pandemic Disruption Could Shake Up the Debate Over Standardized Testing
Moves to opt out of state tests and change how they're given threaten to reignite fights over high-stakes assessments.
9 min read
Image of a student at a desk.
patat/iStock/Getty