What educator hasn’t sometimes felt frustrated at the overreliance on test scores to measure the success of their students and their teaching?
Today’s guest post explores alternatives to that approach.
‘Expanding Rigor’
James Soland is an associate professor of research, statistics, and evaluation at the University of Virginia School of Education and Human Development whose work focuses on assessment (psychometrics), evaluation, and data use:
Educators know the drill: A new program rolls out, someone gathers test scores before and after, and they are told whether it “worked.” But what does that really tell us? Does it help improve teaching? Does it help educators and policymakers understand why it worked here but not there? Does it help us understand what students actually gained and experienced?
As I discuss in a recent blog at the Brookings Institution, right now, the field of education evaluation is fixated on one question—“what works?”—narrowly defined as whether a program causes measurable improvements in things like test scores. That is certainly valuable at times, but it misses too much of what matters in schools. It privileges what we can easily measure over what we ought to understand. And it treats school contexts as interchangeable backdrops rather than vital elements of success.
As teachers and school leaders, you know that learning is messy, local, and human. It unfolds in specific classrooms with specific students and adults doing real work together. The current dominant approach—evaluating programs like black boxes and judging them by a narrow set of outcomes—doesn’t capture that reality. It leaves out teacher expertise, student experience, school culture, and the many contextual conditions that make a strategy effective here but not somewhere else.
We need evaluation that learns from you—the educators in the trenches—not just about students’ test results. You are the people who see when a strategy sparks student curiosity, when it bumps up against local realities, or when it might not work when applied to a different set of students. You also know when a program is truly moving the needle, versus being done purely out of compliance.
What’s the problem with “black box” evaluations?
Most rigorous program evaluations focus on isolating causal effects: “Did X cause Y?” That’s mostly done with experiments or quasi-experiments using standardized outcomes that policymakers and researchers can compare across time and place. But this approach has two big limitations:
- It favors outcomes that are easy to quantify—like state test scores—over equally important outcomes that are harder to measure, like critical thinking, collaboration, or students’ sense of belonging. Those latter components are often at the heart of teachers’ daily decisions and matter deeply for long-term learning.
- It treats context—the school, community norms, teacher skills, resources, and culture—as something to control away, rather than a source of insight about how and why something works.
This narrow focus leads to evaluations that feel detached from reality. They tell you whether something worked somewhere but not why it worked, how it worked, and under what conditions it might work in your own context.
So what might a better evaluation look like? Let’s use an example of a socio-emotional intervention (such as to boost growth mindset or self-management skills) designed to improve that socio-emotional competency and, thereby, also improve achievement. Here are some key elements:
1. Broaden outcomes beyond test scores.
Standardized tests capture important academic skills, but they miss socio-emotional growth, critical reasoning, cultural competence, and other dimensions of learning that teachers nurture every day. When evaluation counts these too—even if they’re harder to quantify—it aligns more closely with what matters for students.
In the socio-emotional-competency example, it would mean not only looking at achievement gains but also at changes in self-management or growth mindset, ideally using a survey measure designed to understand change over time. It would further involve asking teachers whether they think the intervention actually improved the competency or if it was more likely a measurement artifact (e.g., students better anticipating the “correct” answer on the survey after the intervention).
2. Mix numbers with narratives.
Rigorous causal work has its place—but it should sit next to rich qualitative evidence. That means intentionally gathering teacher perspectives, student voices, and descriptions of administrator experiences. Qualitative research has often been peripheral in program evaluation, but it helps us understand mechanisms—the how and why of what works—not just the if. In the case of the socio-emotional-competency intervention, interviews with teachers would ask if the teacher felt there was a valid, causal chain where the intervention increased self-management or growth mindset and that improvement then caused achievement gains.
They would talk about whether the intervention is easy enough to implement that it could be part of common practice. If the intervention did not show gains (in the competency or achievement), teachers would provide qualitative data on why not.
3. Make context part of the question, not something to control away.
Instead of treating local conditions as noise, good evaluation treats them as data. Knowing how a rural school engaged parents or how a multilingual classroom adapted a reading program can teach us about transportability and adaptation.
For socio-emotional conferences, that could look like asking teachers whether they felt there was support for the intervention (e.g., sufficient time to implement it well), if there were bureaucratic hurdles, if it did or did not work for students with particular learning challenges (e.g., students with a particular IEP), how their particular school and setting affected outcomes, etc.
4. Use mechanisms to guide improvement.
Instead of only reporting that an intervention “worked,” evaluations should articulate how it produced results. Was it because teachers had more collaboration time? Because students engaged more deeply with texts that reflected their lived experience? Because instructional coaching supported risk-taking? Because teachers recognized the value and bought into the strategy? These mechanisms—not just outcomes—are the critical lessons for replication and improvement.
In the socio-emotional-competency case, all of these mechanisms could emerge during teacher interviews, surveys, focus groups, or whatever was the most efficient use of their time. (Obviously, these additional data would be collected mainly in large-scale evaluations with sufficient resources to compensate teachers fairly.)
Putting contexts and conditions front and center with help from teachers
Teachers are constantly evaluating: They watch a lesson unfold, notice signs of engagement, troubleshoot misconceptions, and decide what to try next. That expertise—grounded in context and informed by deep knowledge of students—needs to be part of how we study educational effectiveness. By expanding our definition of evidence and privileging why and how as much as whether, we make evaluation more useful.
And I believe we can, by expanding the aperture of the questions we ask, just maybe, move past a hyper-fixation on test scores.
Moving beyond test scores and black boxes doesn’t mean abandoning rigor. It means expanding rigor to include modes of inquiry that respect complexity without losing clarity. It means building evaluation systems that help educators and policymakers learn what works for whom, why it works, and what to do next.
Thanks to James for contributing his thoughts.
Consider contributing a question to be answered in a future post. You can send one to me at lferlazzo@epe.org. When you send it in, let me know if I can use your real name if it’s selected or if you’d prefer remaining anonymous and have a pseudonym in mind.
You can also contact me on X at @Larryferlazzo or on Bluesky at @larryferlazzo.bsky.social
Just a reminder; you can subscribe and receive updates from this blog via email. And if you missed any of the highlights from the first 13 years of this blog, you can see a categorized list here.