|This is how a national system of annual student testing might work.|
A centerpiece of President Bush’s education plan, currently under discussion in Congress, is the proposal for annual math and literacy tests for all children in grades 3-8. The benefit envisioned is the improvement of schools. The costs budgeted are not negligible. If adopted, the testing plan would cost $400 million, and would absorb many classroom hours that might otherwise be devoted to instruction. Thus, the proposal should be judged based on both its potential benefits and on its costs, and should be formulated to maximize the likely benefits.
Like previous large-scale attempts at education reform, this proposal could be a lever for improvement, or it could be an expensive, time-consuming, misdirected, and frustrating failure. For this initiative, as for others, the devil is in the details.
Under what conditions would annual testing actually generate educational improvements? In what form should the annual testing be implemented to achieve its desired effects? Let’s begin with some necessary preconditions.
The conditions ensuring the greatest benefits from annual testing include an enhanced public understanding of how tests work and what they tell us. The public needs to understand, for example, that tests, by themselves, cannot improve educational outcomes. They can lead to improvement only if they become a stimulus to change in the educational system—a basis for improved curricula, upgraded instruction, better professional development for teachers, and better distribution of resources.
While holding school districts, schools, and teachers accountable is only fair, the public also needs to understand how financial resources, student demographics, and teacher preparation affect a school’s performance. Since these contextual factors are usually outside a school’s control, it is not fair to ignore them in comparing school outcomes. Accountability systems can work if they give schools specified goals and undercut easy excuses for failure. But the results of accountability testing can also be misleading. If we simply compare scores across schools without taking into account change over time, schools that have shown great improvement can look bad in comparison with schools where children score higher but make less progress.
How, then, should President Bush’s annual test be implemented to maximize the stimulus to educational improvement and to minimize damaging effects? We propose that the two crucial features of an effective annual testing system would be a mechanism for using the test results to distribute instructional resources and a mechanism for minimizing both teaching to the test and likely misinterpretations of the results.
Use scores for improving instruction. When administered in the context of an ongoing program of classroom-based assessment and professional development, properly selected and properly interpreted tests can do the following: provide information about children’s performance levels; identify the children who need extra instructional attention; and identify the classrooms in which teachers need extra instructional support.
The public needs to understand, for example, that tests, by themselves, cannot improve educational outcomes.
But fulfilling these various functions requires selecting the appropriate tests, properly interpreting test results, and then actually using test results to inform instruction. Remarkably, the individuals responsible for making testing decisions typically know rather little about how to select, interpret, or use test results. We can hardly expect administering millions of tests to improve education if few educators know how to use the data.
It is a basic principle of test design that different functions require different tests. Our proposal violates this principle, in suggesting that an accountability test could also provide instructional information. We suggest that the annual test should be designed as a screen, to identify children who need help mastering the basics of math and reading. The information about the number of children who achieve scores above the cutoff, if appropriately filtered (see below), can reflect school effectiveness. At the same time, this information identifies children who need further, more diagnostic assessments that can be used to help teachers decide what sort of instruction to provide. We propose that it also be used as a basis for distributing professional- development resources according to need.
Test scores within a classroom should become the basis for allocation to that classroom of professional development and support to the teacher. Thus, classrooms in which a very high percentage of children receive scores below the acceptable level would receive more aid, in the form of instructional mentoring or coaching for the teacher, help in administering follow-up assessments designed to guide instruction, and resources for supplementary materials or extra classroom personnel. Classrooms in which only a small percentage of children scored below the cutoff would receive less aid. Of course, if tests are to be used to target instructional support, then they must be administered in such a way that the information from them is available immediately and early in the school year. Thus, we argue that an early-fall administration of these tests is highly desirable.
From data to information. Of course tests can provide data about how schools are doing. But such data do not constitute information about school performance if we just compare test scores across schools. We need to compare children’s test scores across time. Since in some urban areas 30 percent to 50 percent of students in a classroom in April may not have been there in September, a school’s average test score is based on the performance of many children who have hardly received instruction in that school setting. Particularly in high-transiency settings, a school’s average test scores reflect who showed up on the day of testing, not how much the school has taught its children.
Furthermore, the huge differences in test performance between urban and suburban schools often point to the experiences children bring with them as much as the experiences schools provide. Finally, the financial resources available to suburban schools are much greater than those available to the schools which typically score poorly. In using test scores to judge schools, we must disaggregate the impact of student mobility, school resources, and the extent to which children arrive already knowing what the school is trying to teach.
We suggest that any test’s use for school accountability purposes must be limited to data from children who have been in the school for at least a year, and that schools should be held more accountable for their longer-enrolled students. Doing this would require student identification procedures so that students’ school histories could be established (amazingly enough, many large urban districts do not currently have this capacity), and it would require tests designed to be comparable across grades 3-8. Comparing scores across differently designed tests is extremely difficult, if not impossible. If we wish to invest in accountability, we need to invest in designing tests that can give us the information we need to make sound decisions.
When to test? If testing is meant to improve instruction, then end-of-year tests are worthless. Results from tests administered in spring are not typically even available to teachers until the next fall, by which time the children whose test scores they receive have moved on. Furthermore, even if the test scores were available immediately, they would arrive too late in the school year for changes in instruction to have much effect.
|Testing children does them no good unless it guides teachers in providing improved instruction.|
An additional disadvantage of administering accountability assessments in the spring is that it creates both pressure and considerable opportunity to teach to the test. While President Bush may believe that teaching children to perform well on a math or a reading test is equivalent to teaching math or reading, this is simply not true. Tests, by their very design, reflect only a sample of what we want children to know. Teaching the sample is not equivalent to teaching the entire curriculum. While in very dysfunctional schools teaching to the test may be better than what goes on normally, in most schools it represents a narrowing of the curriculum and a waste of precious instructional time.
Making it work. For a system such as we propose to work, the annual screening tests selected would have to be relatively brief, standardized in administration, machine scoreable, and able to identify those children who need help in basic math and reading. If states are to choose their own tests, they would need a set of guidelines for selecting the screening test and guidance in prescribing appropriate follow-up assessments. A national test-review board might well be established to provide support in making these decisions.
If the tests are used to distribute professional development to those classrooms most in need, a coherent professional-development system, probably requiring increased funding, would be needed in every school district. Finally, as noted above, unique student identifiers that would make it possible to track individuals’ progress are needed for interpreting the data appropriately.
If a national system of annual testing is inevitable, experts in testing must be recruited to think creatively about how to make it serve both accountability and instructional needs. Teachers, principals, school board members, and the general public need information that can help them interpret test results appropriately.
The testing system must remain focused on upgrading instructional programs. Testing children does them no good unless it guides teachers in providing improved instruction, which in turn requires greatly enhanced professional development and support.
Annual tests should be one piece of an integrated system of ongoing classroom-based assessment and professional development, targeted where the need is greatest.
Catherine E. Snow is the Henry Lee Shattuck professor of education at Harvard University’s graduate school of education in Cambridge, Mass., and a member of the Board on Testing and Assessment. Jacqueline Jones is a visiting associate professor at the graduate school and a senior research scientist at the Educational Testing Service in Princeton, N.J.
A version of this article appeared in the April 25, 2001 edition of Education Week as Making a Silk Purse...