My friend Mike Petrilli and I just wrapped up the seventh cohort of our Emerging Education Policy Scholars program. The time we spend with these talented young Ph.D. scholars is always filled with talk about how research influences policy and practice, and their frustration that it doesn’t seem to loom larger in the minds of policymakers and system leaders. The experience is energizing, and one from which I inevitably learn a great deal. But I also find myself routinely offering a version of the same meditation on how and when research influences real-world decisions—and why that influence should be much more halting and hesitant than researchers would generally prefer.
For my part, I routinely advise policymakers and practitioners to be real nervous when an academic or expert encourages them to do “what the research shows.” As I observed in Letters to a Young Education Reformer, 20th-century researchers reported that head size was a good measure of intelligence, girls were incapable of doing advanced math, and retardation was rampant among certain ethnic groups. Now, I know what you’re thinking: “That wasn’t real research!” Well, it was conducted by university professors, published in scholarly journals, and discussed in textbooks. Other than the fact that the findings now seem wacky, that sure sounds like real research to me.
Medical researchers, for instance, change their minds on important findings with distressing regularity. Even with their deep pockets and fancy lab equipment, they’ve gone back and forth on things like the dangers of cholesterol, the virtues of flossing, whether babies should sleep on their backs, how much exercise we should get, and the effects of alcohol. Things would be messy if lawmakers or insurers were expected to change policies in response to every new medical study.
In truth, science is frequently a lot less absolute than we imagine. In 2015, an attempt to replicate 97 studies with statistically significant results found that more than one-third couldn’t be duplicated. More than 90 percent of psychology researchers admit to at least one behavior that might compromise their research, such as stopping data collection early because they liked the results as they were, or not disclosing all of a study’s conditions. And more than 40 percent admit to having sometimes decided whether to exclude data based on what it did to the results.
Rigorous research eventually influences policy and practice, but it’s typically after a long and gradual accumulation of evidence. Perhaps the most famous example is with the health effects of tobacco, where a cumulative body of research ultimately swayed the public and shaped policy on smoking—in spite of tobacco companies’ frenzied, richly funded efforts. The consensus that emerged involved dozens of studies by hundreds of researchers, with consistent findings piling up over decades.
When experts assert that something “works,” that kind of accumulated evidence is hardly ever what they have in mind. Rather, their claims are usually based on a handful of recent studies—or even a single analysis—conducted by a small coterie of researchers. (In education, those researchers are not infrequently also advocates for the programs or policies they’re evaluating.) When someone claims they can prove that extended learning time, school turnarounds, pre-K, or teacher residencies “work,” what they usually mean is that they can point to a couple studies that show some benefits from carefully executed pilot programs.
The upshot: When pilots suggest that policies or programs “work,” it can mean a lot less than reformers might like. Why might that be?
Think about it this way. The “gold standard” for research in medicine and social science is a randomized control trial (RCT). In an RCT, half the participants are randomly selected to receive the treatment—let’s say a drug for high blood pressure. Both the treatment and control groups follow the same diet and health-care plan. The one wrinkle is that the treatment group also receives the new drug. Because the drug is the only difference in care between the two groups, it can be safely credited with any significant difference in outcomes.
RCTs specify the precise treatment, who gets it, and how it is administered. This makes it relatively easy to replicate results. If patients in a successful RCT got a 100-milligram dosage of our blood pressure drug every twelve hours, that’s how doctors should administer it in order obtain the same results. If doctors gave out twice the recommended dosage, or if patients got it half as often as recommended, you wouldn’t expect the same results. When we say that the drug “works,” we mean that it has specific, predictable effects when used precisely.
At times, that kind of research can translate pretty cleanly to educational practice. If precise, step-by-step interventions are found to build phonemic awareness or accelerate second-language mastery, replication can be straightforward. For such interventions, research really can demonstrate “what works.” And we should pay close attention.
But this also helps illuminate the limits of research when it comes to policy, given all the complexities and moving parts involved in system change. New policies governing things like class size, pre-K, or teacher pay get adopted and implemented by states and systems in lots of different ways. New initiatives are rarely precise imitations of promising pilots, even on those occasions when it’s clear precisely what the initial intervention, dosage, design, and conditions were.
If imitators are imprecise and inconsistent, there’s no reason to expect that results will be consistent. Consider class-size reduction. For decades, advocates of smaller class sizes have pointed to findings from the Student Teacher Achievement Ratio (STAR) project, an experiment conducted in Tennessee in the late 1980s. Researchers found significant achievement gains for students in very small kindergarten and first-grade classes. Swayed by the results, California legislators adopted a massive class-size reduction program that cost billions in its first decade. But the evaluation ultimately found no impact on student achievement.
What happened? Well, what “worked” on a limited scale in Tennessee played out very differently when adopted statewide in California. The “replication” didn’t actually replicate much beyond the notion of “smaller classes.” Where STAR’s small classes were 13 to 17 students, California’s small classes were substantially larger. STAR was a pilot program in a few hundred classrooms, minimizing the need for new teachers, while California’s statewide adoption required a tidal wave of new hires. In California, districts were forced to hire thousands of teachers who previously wouldn’t have made the cut, while schools cannibalized art rooms and libraries in order to find enough classrooms to house them. Children who would have had better teachers in slightly larger classrooms were now in slightly smaller classrooms with worse teachers. It’s no great shock that the results disappointed.
Research should inform education policy and practice, but it shouldn’t dictate it. Common sense, practical experience, personal relationships, and old-fashioned wisdom have a crucial role to play in determining when and how research can be usefully applied. The researchers who play the most constructive roles are those who understand and embrace that messy truth.