Much of the data from the nation’s longest ongoing assessment of student writing skills is unreliable and will be scrapped, the board that oversees the National Assessment of Educational Progress has determined.
The National Assessment Governing Board, which sets policy for NAEP, voted at its quarterly meeting this month to exclude the results of the 1999 long-term writing assessment from the trend report it is set to release this summer. Results from the tests administered in 1994 and 1996 will be removed from the board’s World Wide Web site.
The discovery could also reveal potential problems in other performance-based tests and in state testing programs that require students to write short answers and essays, especially when the results are used to make decisions about individual students or to examine trends over time.
“I’ve just lost confidence that the data are reliable,” Gary W. Phillips, the acting commissioner of the National Center for Education Statistics, said last week. “My big concern with [the data] is that the long-term trend in writing has really been [determined] just using five or six [test] items, and you just can’t carry two decades of achievement and measure progress on five or six items.”
The NAEP trend tests—which have been given to 4th, 8th, and 12th graders in math, science, and reading since the 1970s, and in writing since 1984—are designed to gauge changes in achievement in those subjects over time.
Unlike the so-called main NAEPwhich periodically tests a national sample of students in core subjects based on the current curriculum elements—the trend tests use the same sets of questions and tasks from assessment to assessment. Their design is based on the curriculum and instruction patterns that were prevalent the first year the tests were given.
The writing trend test requires far less effort and resources than the main test. It costs approximately $450,000 each time it is administered, and about 11,000 students take it.
Mr. Phillips recommended that the governing board discard the writing results after learning there were errors associated with the scaling model used to score the tests. The problem is blamed primarily on the small number of questions, or prompts, used the test. Students were asked to complete five or six writing tasks. The potential for problems had been discussed in previous years, according to Mr. Phillips, but no errors had been found in analyzing the data before now.
But late last week, Peggy G. Carr, the associate commissioner for assessment, said NCES hopes to salvage some of the 1999 data.
Assessing students’ ability in writing, as well as for other performance-based tasks, while asking only a few questions has generally proved tricky, some assessment experts say.
“Students do better on some prompts than others, so you need a large number of prompts to get a clear picture of how a student is doing,” said Stephen P. Klein, a senior research scientist at the RAND Corp., a Santa Monica, Calif.-based research organization.
In the case of the NAEP trend test, comparing students’ writing achievement and how it may have changed over time is complicated by the limited number of writing samples, Mr. Phillips said.
Since 1984, the data indicate that there has generally not been a significant change in the writing proficiency of 4th and 8th graders, though scores for the latter appear to have declined slightly while those for the former rose a few points. The 1992 test, however, was an exception among 8th graders, who saw a sharp and unexplained jump in their scores, which dropped again in 1994. Twelfth graders, on the other hand, saw an average 7-point drop in their total score.
A new version of the main NAEP writing test was given in 1998 using 20 prompts at each grade level, and could provide a much more reliable measure of students’ skills in the subject. The test is to be given again in 2002.
Some testing experts have pushed for so-called performance-based tests as a more accurate gauge of what students know and are able to do. But the tests take more time to administer and score than the more common standardized, multiple-choice tests.
Even so, several states have incorporated more written-response portions into their testing programs. Administering and scoring the more complicated tests have proved challenging, said Rosemary Fitton, an assistant superintendent for the Washington state education department. That state has phased in tests in core subjects over the past three years, though they do not yet carry consequences for students or schools. Its writing test, which includes just two questions each for students in the 4th, 7th, and 10th grades, has been particularly difficult to implement.
“We have had problems because there are so few questions,” Ms. Fitton said. “We need to continue to try to use these types of assessments, but we need to be very careful about the types of decision we’re making based on them, when we are unsure of their reliability.”
As more states attach high stakes to their tests—such as using scores to make graduation or promotion decisions—officials must take heed of potential deficiencies, educators say.
“This raises a big issue of how quickly and how superficially should we rely on performance-based, high-stakes data when we simply don’t know if it’s telling us what we need,” said Alan E. Farstrup, the executive director of the International Reading Association, based in Newark, Del. “Some very serious decisions are being made about kids’ lives based on instruments we’ve pressed into service for reasons that are laudable, but the consequences could be devastating.”
The main NAEP has run into problems of its own. In 1995, it was discovered that a computer program made mistakes in the scoring of the 1992 assessments in mathematics and reading and the 1994 reading and world geography tests. The achievement levels assigned for those tests were also flawed. Although NCES officials said at the time that the problem had only a negligible
The Educational Testing Service, which produces the national assessment under contract with the NCES, is responsible for the latest error, according to officials of the NAEP governing board. But the flaws in the testing procedures developed by the nonprofit, Princeton, N.J.-based test-maker were part of the inevitable learning process in creating assessments and were not due to neglect, said governing-board Chairman Mark D. Musick.
An ETS spokesman declined comment on the matter, referring inquiries to NCES and NAGB.
The trend test has created confusion because it is disconnected from other NAEP tests. It has been criticized because the test does not change to reflect new instructional and curricular practices.
In its 1996 restructuring plan, the governing board said it would try to phase out the entire long-term assessment or find a better way to link it to the main program. The board has adjusted the frequency of the trend test from every two years to every four.
“We value NAEP’s long-term trend information,” Mr. Musick said. “But we have consciously broken the long-term trends when the board, along with NCES, determined that things changed significantly enough that it makes sense to do so.”
News about the writing-assessment data was a disappointment to some experts. But it should not be viewed as a defeat for performance-based tests, said Eva L. Baker, the director of the Center for Research on Evaluation, Standards, and Student Testing, based at the University of California, Los Angeles. “I think it’s really important that NAEP maintain its role as a model to provide open-ended responses for students,” Ms. Baker said. “It’s hard and it’s challenging, but it’s a much better way of [assessing students].”
A version of this article appeared in the March 15, 2000 edition of Education Week as NAEP Drops Long-Term Writing Data