Test Dilemma: Revisions Upset Trends in Data
By the time 2012 rolls around, Kentucky's residents will almost certainly have a testing system that describes how well the state's children are learning. What they won't know is how far they've progressed since 1992, the advent of the state's landmark school overhaul.
Because the state switched testing programs in 1998, it will lose the trend lines established in 1992 that many educators and policymakers expected to carry through the 20-year quest to improve student achievement. And when testing programs are reworked, or sometimes just tinkered with, they lose the ability to make direct comparisons with student achievement over time—a situation that every state with a testing and accountability program will likely face if it keeps its programs updated.
"People like me thought it would stay the same for 20 years, but that was naive," said Robert F. Sexton, the executive director of the Prichard Committee for Academic Excellence, a Lexington-based backer of Kentucky's school improvement efforts. "I don't think anybody thought of the nuances and the ins and outs over a 20-year period."
Comparing test scores from one test to another is like comparing race times on different marathon courses.
That's just what Kentucky did. The new testing program started assessing students' skills in some subjects at different grade levels. What's more, the new program de-emphasizes so-called performance questions, such as writing essays and showing mathematical reasoning step by step, which are hard to compare from one test to another.
New standards are being written for the new tests. Since 1998, though, the state has used the performance standards from the old test for its accountability system as an interim measure.
State officials in Indiana, New York, and Ohio are learning a lesson similar to Kentucky's as they begin to modify their own testing programs and the standards that outline how students should perform. While neither Indiana nor Ohio is overhauling its testing system as Kentucky did, both are adding new subjects and changing the grade levels at which they test. Consequently, they will be forced to re-evaluate their standards and find ways to connect student achievement from one testing system to another.
"Psychometrically, comparison can be done," said Mary Tiede Wilhemus, a spokeswoman for the Indiana education department. "But there has to be a caveat. There's no way around it."
"Ours may not be as dramatic" as Kentucky's, said Bob Bowers, Ohio's associate superintendent for curriculum and assessment, referring to his state's changes, "but we will have to adjust our trend line a little."
New York, meanwhile, abandoned trend data on the regents exam several times in the 136-year-old testing system's history.
By raising the standards and requiring all prospective graduates to pass the English and mathematics tests, the state essentially declared the previous test scores to be of historical note only, said Roseanne Y. DeFabio, the state's assistant commissioner for curriculum, instruction, and assessment.
"In every subject, we look at the number of students reaching the standards rather than trying to suggest that the exams are equated from the old to the new," she said.
'The Best for Kids'
The only way to preserve the historical data, testing experts say, is to ignore the evolving improvements in the world of assessment. But they recommend that states continually review their testing practices so they can incorporate all the advances in methods and update performance standards to meet changing expectations. If they need to sacrifice the longitudinal data in the process, they shouldn't hesitate, those experts say.
"If I have to pick between doing the best for kids or having a consistent trend line, I'm going to pick doing the best for kids," said Andrew C. Porter, the director of the Center for Education Research at the University of Wisconsin- Madison and an adviser to Kentucky officials.
But advocates of the testing and accountability movement that's under way nationwide are urging states to do everything they can to maintain data that compares achievement across time.
"The whole idea of how you continuously improve your standards and assessment and accountability while continuing to track kids over time is a huge issue," said Matthew Gandal, the vice president of Achieve, a Cambridge, Mass.-based coalition of governors and business executives.
Mr. Gandal encourages all states to change their testing systems and performance standards so they reflect the latest knowledge of testing experts. But he expects that most will be able to preserve the achievement data.
"I would guess it would be a rare state" that has to lose its trend data, he said.
Kentucky inaugurated its Kentucky Instructional Results Information System, or KIRIS, in 1992 and declared it would be the basis for monitoring schools' progress toward the 20-year goal that every student would reach the "proficient" level on the state's performance standards.
In the six- year life of the program, members of the public criticized KIRIS for not producing scores that could be compared against national norms, meaning how students across the country perform on similar tests. Researchers also said the scores weren't accurate enough to use in the state accountability system.
By 1998, the state legislature had replaced KIRIS with the Commonwealth Accountability Testing System, or CATS. The new program not only includes national norm-referenced sections, but has de-emphasized the performance assessments that made it difficult for KIRIS to produce an accurate gauge of student achievement. The program also changed the grade levels at which some subjects are tested.
After such changes, "a direct comparison [between KIRIS and CATS] is fraught with problems," said John P. Poggio, a professor of psychology at the University of Kansas, in Lawrence, and the vice chairman of the board of technical advisers to Kentucky.
"Everybody wants to do a comparison to what it looked like in 1998, to what it looks like in 2000," he said. "People have to recognize that this is a new program."
To set new standards for the new tests, the state engaged 1,650 teachers for the past year. Their task was to decide what students should know and be able to do to meet each of the state's four performance categories: "novice," "apprentice," "proficient," and "distinguished."
The state school board reviewed that work last month, and is expected to take up the subject again this week.
As they review the teacher panels' reports, state officials are questioning why the proposed standards for CATS yield different results from the KIRIS ones. For example, 31 percent of elementary school students who took KIRIS in 1998 ranked as proficient in reading. Two years later, that proportion jumped to 52 percent under the standards proposed for CATS. By contrast, 26 percent of high schoolers scored as proficient in reading under KIRIS in 1998, but only 21 percent of the same age group would have achieved that level under CATS in 2000.
Helen W. Mountjoy, the chairwoman of the state board, said its members have to understand why the CATS results sometimes diverge from the ones on KIRIS before they adopt new standards.
But Mr. Poggio says it's a matter of the former tests' performance levels' being incorrectly established, in contrast to the sophisticated and thorough process he says the state used to set the standards for CATS.
Has Improvement Occurred?
After the Kentucky state board decides how to reset the standards, it will need to figure out a way to explain the changes to the public, which is sure to be skeptical of wild fluctuations in scores.
Because so many teachers participated in setting the CATS standards—something that didn't happen with KIRIS—many of them will be able to explain the standards' content and meaning to their colleagues, Ms. Mountjoy said. "What it's done is provide a level of credibility that we didn't get [under KIRIS]."
But the board also needs to deal with the inevitable question: How can we tell whether student achievement has improved since 1998?
While education officials won't be able to make direct comparisons on the two testing systems, Ms. Mountjoy said, they can look for clues through scores from other programs, such as the National Assessment of Educational Progress.
For the CATS program, however, there's nothing else to say other than it's a new starting point.
"You have to belabor the point that this is where we stand today," Mr. Poggio said.
And even though schools' scores may vary from KIRIS to CATS, none of them will be close to reaching the goal that all Kentucky students attain the proficient level by 2014, the new target date for that achievement.
"Even if some schools get an artificial jump [from CATS], they still have this incredibly challenging goal in front of them," Mr. Sexton of the Prichard Committee said.
A Word of Advice
Testing programs tend to last only about six years before policymakers start fiddling with or overhauling them, according to Mr. Poggio.
In Kentucky, state leaders are hoping that the current round of standards-setting will last for a while.
"I'm not sure we'll make it to 2014," said Gene Wilhoit, the state commissioner of education. "But it'll be nice to have some stability in the system."
Vol. 20, Issue 33, Pages 1, 23Published in Print: May 2, 2001, as Test Dilemma: Revisions Upset Trends in Data