What Federally Mandated State Tests Are Good For (And What They Aren’t) (Opinion)

Save to favorites
Print

Email Facebook LinkedIn Twitter

Copy URL

Stuart Kahl

Stuart Kahl is an independent assessment consultant with 35 years of firsthand experience in designing, developing, and implementing state assessments and is the former CEO and founder of Measured Progress Inc., a company that operated state testing programs in more than half the states over several decades. He previously taught at the elementary, secondary, and graduate levels.

Federally mandated state testing, which did not take place last year because of the pandemic, is happening this spring. However, with the waivers states have been allowed, involving the identification of low-performing schools, the shortening of tests, and changes to the percentage of students to be tested, this marks the second year that continuity under the Every Student Succeeds Act has been broken. For that reason, this is an opportune time to start shifting the focus of state testing back toward a purpose for which the tests are more appropriately designed.

In recent years, school administrators and teachers have put pressure on state policymakers and testing officials to address the desires of local educators for more immediate information from state tests to inform their day-to-day instruction while still satisfying the federal accountability requirements. Guiding day-to-day instruction is not something states’ summative-test results can or should be expected to do, despite the ESSA’s unmet requirements for quick, individual student results that are “interpretive, descriptive, and diagnostic.”

Many people have questioned the need for spring 2021 testing, viewing the main purpose of state assessments to be school accountability for student academic performance for which states determine percentages of proficient students. This limited view overlooks the power of end-of-year state assessments for program evaluation and improvement.

Standardization in testing means comparability—a quality of state tests that enables local educators and residents in general to get an external perspective on how their local instructional programs are doing. How does our student performance compare with that of other schools serving similar populations?

State test results should raise questions that need further investigation to answer. Why are we underperforming in this subdomain of math? Why is this subgroup in our school underperforming their counterparts in other schools in our state? Are our new approaches in this area working? Finding the answers to such questions informs program-improvement efforts that may not immediately benefit the students tested but instead should benefit many more students in the future.

There is strong feeling among educators and noneducators alike that teachers and schools should not be penalized because of the impact of COVID-19 on student learning. I agree. But the need for program-evaluation information is greater than ever right now. How well have we served our students via online instruction, packets sent home, or any other method compared with other schools? What is the extent of learning loss during the pandemic?

State test results should raise questions that need further investigation to answer.

Clearly, we should be prepared to see this spring’s results reflect a negative impact from COVID-19. Although the stakes have been relaxed this year, state testing officials would do well to maintain a focus on program evaluation not only now but also in the future after the stakes are reimplemented.

State tests are designed to produce reliable total test scores for students in mathematics, English/language arts, science, and other subjects states may test. But the tests generally do not yield reliable subtest scores—e.g., geometry or measurement in mathematics or physical or life science within science. Typically, subtest scores are based on just a handful of machine-scorable items hardly representing a good sampling of the content and skills in a subtest area. These areas are still very broad, so the sparse coverage of relevant content and skills makes subtest scores neither reliable nor valid for making important instructional decisions for individual students. For a test to be truly diagnostic, it would require multiple items addressing narrowly defined learning targets.

The U.S. Department of Education could offer guidance on ESSA relaxing the unmet requirement for states to provide diagnostic reports for individual students, while still requiring the reporting of total test scores for all students. Diagnoses of students’ specific learning gaps are best left to other components of a balanced assessment system. School average subtest scores are more reliable than individual student subtest scores even with the poor sampling of content. However, that poor sampling means important programmatic decisions for school programs are still questionable—and this is a validity issue. Local educators observing patterns of scores over a few years would be a better basis for major changes to programs.

There’s a place for end-of-year summative testing in both program evaluation and accountability assessment. For these purposes, I’d want to know how students in a school perform near the end of a school year. ESSA now allows accountability assessment to take place during the course of the school year. Interim benchmark tests covering recently taught material and interim general achievement measures are appropriate for purposes of early warning to identify students and curricular areas needing additional attention before end-of-year testing. However, many of these tests have the same limitations as the state summative tests because of the sparse coverage of relevant content and skills. And they don’t reflect student achievement at the end of the school year.

There are testing-program designs that can shorten the testing time for individual students yet broaden the coverage of a subject-area domain for school-level results. Reliable total test scores could still be provided for individual students, while reliable school results in subtest areas and even finer breakdowns of content and skills could be reported, offering more valuable information for program evaluation. Innovative curriculum-embedded performance assessments a few times during the school year could tap higher-order cognitive skills that the efficient end-of-year tests shortchange and could yield immediate results from teacher scoring of student work on complex tasks or projects. These, too, could count toward federally required accountability results.

Challenges create opportunities. The pause in 2020 testing and waivers this year—because of the pandemic—make this a good time to renew our emphasis on state assessment as a powerful activity for informing program improvement. We should not think of it solely as a vehicle for producing accountability data or a source of information for real-time instructional use—something it cannot be. Since the collection of longitudinal student-achievement data for ESSA has to be restarted anyway, changes to assessment designs can be made this year or next that optimize their utility for program evaluation. Such changes would be a way to get state assessment back on track with respect to what it can legitimately do, not just for the short term but also for the foreseeable future.