Does high stakes testing hurt students? Read the early evidence with caution.
According to a recent front-page story in The New York Times, high-stakes testing does more harm than good, increasing the proportion of students who drop out of school, decreasing the proportion who graduate, and diminishing students’ performance on standardized tests of achievement. But before President Bush and his education team abandon their efforts to hold students, teachers, and schools accountable, they should read the actual report on which the news story was based.
The study contained in the report, conducted by Arizona State University researchers Audrey Amrein and David Berliner and paid for by several affiliates of the National Education Association, analyzed over time trends in student achievement and school completion in states that have implemented high- stakes testing and compared these trends to national averages for the same indicators. If student performance following the introduction of the test appeared to decline relative to the national trend over the same time period, the authors concluded that the testing had a negative effect. (“Reports Find Fault With High-Stakes Testing,” Jan. 8, 2003.)
Based on their analyses, the authors compiled a score card that tallied the number of states in which testing had a negative effect, the number of states in which the effect was positive, and the number in which the impact of testing was mixed or unclear. Because the number of states in the negative column exceeded the number in the positive column, they concluded that, on average, high-stakes testing is bad.
The so-called declines the authors used to categorize states into the winning or losing columns are often so small as to be meaningless, however. Consider, for example, the “strong” evidence that the implementation of high-stakes testing in New York had adversely affected school completion. From the sound of it, one would think that thousands of students had dropped out as a result of the testing. In fact, during the period after the 1994 introduction of graduation exams in New York, the state’s dropout rate didn’t increase at all—it remained flat, whereas during the same period, the national dropout rate declined by 1 percent. And what was the staggering drop in New York’s graduation rate following the introduction of the test? The rate declined by three-tenths of 1 percent during a time when graduation rates remained unchanged nationally. Nevertheless, on the basis of this “strong” evidence, New York ends up in the column of states whose students were ostensibly harmed by testing.
By the time one gets to the authors’ summary table, though, much less the hyperbolic press release that trumpeted the report, the actual sizes of the effects that are under discussion are long forgotten. In other words, the list of states where students were allegedly harmed by testing could include states whose indicators barely changed as well as those where they changed a great deal. In fact, there were many of the former and few of the latter. Indeed, of all the states whose graduation rates declined following the implementation of testing, none saw a decline that differed from the national average by more than 1.6 percent. Moreover, the average relative decline in graduation rates among states whose rates fell was smaller than the average relative increase in graduation rates among states whose rates rose. The data showing changes in achievement-test scores are equally meaningless, with the putative effects of testing usually smaller than the margins of error in the tests.
When a trend being analyzed is brief, it is easy to be fooled into thinking it is meaningful.
Social scientists generally are interested not only in the size of an effect, but in whether the result is statistically significant. In fact, nowhere do the authors of this report say whether the effects they have alleged to uncover are statistically significant, most likely because they are not. (I corresponded with Ms. Amrein and learned that no significance-testing had been done.) This is important, because findings that look impressive are frequently chance occurrences. When a trend being analyzed is brief, it is easy to be fooled into thinking it is meaningful. Suppose, for example, a coin I flipped four times in a row landed on heads each time. Would you be willing to believe that I had discovered a magic coin that always turned up heads, or would you want to see a few more flips? In the analyses presented in this report, not only are the effects often minuscule, few of the trends the authors describe are long enough to draw any reliable conclusions about the impact of testing on anything.
It is conceivable, of course, that implementing high-stakes testing could influence dropout or graduation rates, although the authors of this report, as well as those who funded it, will have a hard time explaining why, in several states, the trend lines point to declining dropout rates and rising graduation rates after the introduction of testing. (I don’t place much credence in these results, either, because they, too, are unlikely to be statistically significant.) But the authors’ contention that the implementation of high-stakes testing depressed students’ performance on tests like the SAT or ACT is just plain silly. Performance on these tests is strongly linked to students’ socioeconomic status and is marginally, if at all, affected by what takes place in the classroom.
And then, of course, there is what social scientists call the third-variable problem. During the period following the implementation of testing, plenty of other factors change as well, and many of these factors could conceivably influence dropout and graduation rates as well as achievement-test scores. Comparing each state’s trend to the national trend does not solve this problem, because factors that may have changed in a particular state may not have changed in the same way across the nation.
It is conceivable, of course, that implementing high-stakes testing could influence dropout or graduation rates.
One potentially important factor, for example, is the size of the state’s Hispanic population, because Hispanic youngsters drop out of school at a much higher rate than do other students. The two states where the relative increase in the dropout rate following the introduction of testing appears to be large enough to be worrisome—Nevada and New Mexico—are states with high and rapidly growing Latino populations. In fact, five of the eight states that showed a relative increase in their dropout rates following the introduction of testing are states with large Latino populations that grew dramatically during the time frame examined in the report (the other three are New York, Texas, and Florida). In all likelihood, this change in demographics, and not the implementation of testing, led to higher rates of dropping out and lower test scores.
A sensible reading of the evidence to date suggests that high-stakes testing so far has had neither the dramatic beneficial effects hoped for by its proponents nor the catastrophic ones feared by its detractors. But even this conclusion is not cautious enough. It will take many years, perhaps even decades, to assess the impact of such a dramatic change in educational policy and practice on student achievement.
Does high-stakes testing encourage teaching to the test? Probably. But this is not a problem if the tests that teachers are teaching to are measuring things we want our students to learn. As long as this is the case, there is nothing wrong with ensuring that students have mastered what we expect them to know before promoting them to the next grade level. How can anyone oppose that?
Laurence Steinberg is the distinguished university professor of psychology at Temple University in Philadelphia.