Assessment Opinion

Ensuring Failure

By Walt Haney — July 10, 2002 5 min read
  • Save to favorites
  • Print
How effective is the current test-driven accountability movement?

How effective is the current test-driven accountability movement? To a remarkable extent, the only evidence of success offered by proponents is a rise in scores on the very tests that are being used to mandate change. These results, however, are said to be meaningful as long as the tests in question are good-quality, criterion-referenced exams, the Massachusetts Comprehensive Assessment System exam being a commonly cited example.

My recent research has uncovered two facts about the MCAS that call these claims into question and raise concerns that ought to reverberate across the nation. First, a jump in a school’s average score from one year to the next is unlikely to continue and therefore probably does not signal real improvement. Second, the MCAS is actually designed to produce a certain range of scores—in effect, artificially limiting how well students can do.

If a school reports higher scores this year than last year, that would of course be a cause for celebration, if we had reason to believe that the test was a good measure of the kind of learning regarded as important. But doubts about the value or validity of the exam—or concerns about what had to be sacrificed from the curriculum to boost scores— would raise questions from the outset about whether such a result was indeed good news.

Even putting aside those reservations, however, consider what happens to schools that proudly report better test results. In a high- profile ceremony at the Massachusetts Statehouse in December 1999, five school principals were presented with gifts of $10,000 each for “helping their students make significant gains on the MCAS.” Four of the five were elementary schools, all of which had reported remarkable increases in average 4th grade math scores from 1998 to 1999. Three of those four schools showed declines the following year.

Was this a fluke? When we look at all the Massachusetts elementary schools that showed a gain of at least 10 points from 1999 to 2000, we see that most showed declines in 2001—declines often as large as the gains posted during the previous year. In fact, a comparison of the changes in 4th grade scores for all schools (1998 to 1999 vs. 1999 to 2000) finds a statistically significant negative relationship between the two time periods. A school that did better the first time was more likely than not to do worse the second time, and vice versa.

These results don’t mean that teachers or students became lazy and tried to coast on their success. They mean that there was never really evidence of success at all. Particularly in small schools, as other research has confirmed, changes in score averages from year to year are poor measures of school quality. (“Republicans Reject Programs on Facilities, Class Size,” May 23, 2001.) If fewer than 100 students are tested in each grade, averages may swing widely from year to year simply because of the particular samples of students tested and the vagaries of annual test content and administration.

It ought to be clear just how little a gain in average scores really means—and what the test was really designed to do.

The other major finding from my research is even more unsettling, providing one understands the difference between two kinds of standardized tests. Some tests, those that are called criterion-referenced, measure students against an absolute standard: how much they know and are able to do. In theory, all students taking the test might score very high or very low.

Other tests, including the SAT, the Iowa Test of Basic Skills, and the Stanford Achievement Test, are called norm-referenced, which means they are concerned with ranking students (or schools) against one another. The results are reported in relative terms. To learn that a child scored in the 88th percentile, for example, tells you nothing about how proficient she was, only what proportion of the population she bested. Half of those who take such tests will always score below the median.

What’s more, the questions on norm-referenced tests are selected not for their importance (that is, whether they reflect knowledge students should have), but for their effectiveness in spreading out the scores. Questions that most students answer correctly will be dropped from these exams and replaced with those that only about half the students get right.

The MCAS, like other state tests, is widely assumed—even among its critics—to be a criterion- referenced test. Remarkably, an examination of its technical manuals reveals that this is not so. Questions for the MCAS are selected and rejected on the basis of their usefulness in discriminating among test-takers. For example, pilot test questions answered correctly by a large proportion of students in 1998 were mostly gone from the operational version of the MCAS in 1999.

Tests like the MCAS are designed so that all students can never succeed.

This is not just a matter of interest to statisticians. As the author Alfie Kohn has pointed out, the question driving norm-referenced tests is not “How well are our students learning?” but “Who’s beating whom?” Moreover, when questions answered correctly by more than 70 percent of students are systematically excluded from the exam, this guarantees continuing failure. Tests like the MCAS are designed so that all students can never succeed. Evidence suggests that other state tests (in Texas, California, and New York, for example) also have been constructed using norm-referenced test-construction procedures.

The lesson from this investigation, which just happened to focus on Massachusetts, is universal: Before newspapers report standardized-test results, before educators concentrate on trying to raise scores, before politicians allow these scores to determine the fate of students and schools, and before parents permit their children to be tested, it ought to be clear just how little a gain in average scores really means—and what the test was really designed to do.

Walt Haney is a professor of education at Boston College. This essay is based on his article “Lakewoebeguaranteed: Misuse of Test Results in Massachusetts,” which appeared in Educational Policy Analysis Archives (http://epaa.asu.edu/epaa/v10n24) in May.

A version of this article appeared in the July 10, 2002 edition of Education Week as Ensuring Failure


This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Data Webinar
Working Smarter, Not Harder with Data
There is a new paradigm shift in K-12 education. Technology and data have leapt forward, advancing in ways that allow educators to better support students while also maximizing their most precious resource – time. The
Content provided by PowerSchool
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
School & District Management Webinar
Deepen the Reach and Impact of Your Leadership
This webinar offers new and veteran leaders a unique opportunity to listen and interact with four of the most influential educational thinkers in North America. With their expert insights, you will learn the key elements
Content provided by Solution Tree
Science K-12 Essentials Forum Teaching Science Today: Challenges and Solutions
Join this event which will tackle handling controversy in the classroom, and making science education relevant for all students.

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Assessment Data Young Adolescents' Scores Trended to Historic Lows on National Tests. And That's Before COVID Hit
The past decade saw unprecedented declines in the National Assessment of Educational Progress's longitudinal study.
3 min read
Assessment Long a Testing Bastion, Florida Plans to End 'Outdated' Year-End Exams
Florida Gov. Ron DeSantis said the state will shift to "progress monitoring" starting in the 2022-23 school year.
5 min read
Florida Governor Ron DeSantis speaks at the opening of a monoclonal antibody site in Pembroke Pines, Fla., on Aug. 18, 2021.
Florida Gov. Ron DeSantis said he believes a new testing regimen is needed to replace the Florida Standards Assessment, which has been given since 2015.
Marta Lavandier/AP
Assessment Spotlight Spotlight on Assessment in 2021
In this Spotlight, review newest assessment scores, see how districts will catch up with their supports for disabled students, plus more.
Assessment 'Nation's Report Card' Has a New Reading Framework, After a Drawn-Out Battle Over Equity
The new framework for the National Assessment of Educational Progress will guide development of the 2026 reading test.
10 min read
results 925693186 02