State differences in testing and accountability systems have frustrated researchers and families alike since the early days of the No Child Left Behind Act.
An earlier federal crosswalk study of state tests and the National Assessment of Educational Progress, a set of large, nationally sampled tests in core subjects like math and reading, found states set their thresholds for proficiency at widely disparate levels. A 4th grader who was deemed a proficient reader in Arizona or Virginia could have significantly lower scores on NAEP than a student reading proficiently on reading tests in neighboring New Mexico or West Virginia, for example.
The Stanford Education Data Archive goes a step farther than the federal NAEP study. In a four-year project, Sean Reardon, a professor of poverty and inequality in education, and his colleagues at Stanford University compiled a massive database of 215 million state test scores from 40 million students—every test taken by a public school student in grades 3 through 8 from 2009-2012—and disaggregated by race, grade, subject, and proficiency level. The data include more than 12,000 of some 14,000 U.S. school districts and 384 metropolitan areas, encompassing both regular and chartered public schools. It is the largest and most comprehensive database of its kind to date.
(See our main story for more on what researchers are starting to learn from the database about where racial achievement gaps form and why.)
Translating ‘Proficiency’ Across States
The researchers ranked districts in each state, both overall and separately by racial/ethnic categories, by comparing their performance at multiple proficiency levels: “below basic,” “basic,” “proficient,” and “advanced.” This allows them to compare districts within a state. To compare districts across states, they “linked” state scores to the nationally representative NAEP test, said Andrew Ho, a Harvard University education professor who helped develop the method.
The researchers were able to confirm that the system worked by comparing the differences they predicted among major U.S. urban districts to those districts’ actual performance on the NAEP’s most recent Trial Urban District Assessment, which compares NAEP results in 21 urban districts throughout the country. For example, the system accurately predicted that Chicago 4th graders would perform about a half-grade level, or 10 scale-score points, below New York 4th graders in reading in the 2013 TUDA.
Their validation checks show that this enables them to conduct research on differences among districts across states, not just within states. “It’s statistically complicated, but it gives answers that are very accurate,” Reardon said.
The data “don’t tell us which school districts are better than others, more effective than others, because test scores of kids in a school district are the result of everything that’s contributed to the kid’s development since conception,” Reardon warned. “These [data] are really good for big picture stuff, but it shouldn’t be used to rank schools that are statistically very similar to each other.”
Researchers Can Build on Achievement Gap Data
A large dataset from the studies will be made available to researchers and the public, and Reardon said he is working to bring in education foundations to provide grants for those who want to analyze the data.
However, the statistical method could be used going forward to add and analyze new test data, to help compare educational achievement and progress across the country as states explore different testing regimes and accountability measures under the Every Student Succeeds Act.
You can watch Reardon explain more about his findings below.
A version of this news article first appeared in the Inside School Research blog.