Two Versions of 'Common' Test Eyed by State Consortium
State consortium to offer long, short versions of test
An unprecedented assessment project involving half the states is planning a significant shift: Instead of designing one test for all of them, it will offer a choice of a longer and a shorter version. The pivot came in response to some states’ resistance to spending more time and money on testing for the common standards.
The plan under discussion here last week among state education chiefs of the Smarter Balanced Assessment Consortium represents the collision of hope and reality, as states confront what is politically and fiscally palatable and figure out how that squares with the more in-depth—and potentially more valuable—approach to testing promised by the consortium.
“There is the dream, and there’s real life,” said one state assessment director attending the meeting. “We’re trying to bridge the two the best we can.”
The evolving two-pronged approach would give states the option of using a version of the Smarter Balanced test whose multiple sessions and classroom activities span nearly 6½ hours in grades 3-5, close to seven hours in grades 6-8, and eight hours in high school, or the group’s original version, which lasts about four hours longer in grades 3-8 and about five hours longer in high school.
Because the assessments would be built on the same blueprint, with a mix of multiple-choice, constructed-response, and technology-enhanced items, as well as lengthy performance tasks, the two versions would deliver comparable results, said Joe Willhoft, the executive director of the consortium. And both would produce the school-, district- and state-level information needed to meet federal accountability requirements, he said.
Both versions would yield overall scores for each student in mathematics and English/language arts, as well as some results within each of those subjects, such as a separate score for students’ writing and research skills, or for their grasp of math concepts and procedures, Mr. Willhoft said.
But because a shorter version of the test is more limited in what it can validly say about an individual student’s performance, an extended version—with more items of each type—would be needed to make the finer-grained “claims” about each student’s learning in multiple areas of each subject that can yield richer portraits for teachers, parents, and state officials, Mr. Willhoft said.
It would be up to each state to choose which version of the assessment it uses. Early signs suggest that public antipathy toward testing and states’ tight fiscal straits are leading more than a few to consider the shorter version. It was pressure from chiefs within the Smarter Balanced consortium that prompted the group earlier this year to explore the option of two versions.
Mr. Willhoft said the 25-state consortium—whose members span the country, from California and Washington state to Missouri, South Carolina, and Maine—must respond to the needs of its states, or risk losing their membership. And if the consortium loses too many states, it can’t stay in operation. Federal rules require each consortium—Smarter Balanced and the 23-member Partnership for Assessment of Readiness for College and Careers, or PARCC—to have at least 15 members to qualify for federal funding. The two consortia are using $360 million in aid under the U.S. Department of Education’s Race to the Top program to design the tests and related projects.
“Having this shorter version at least keeps them in the game,” Mr. Willhoft said. “If all we had was the original, extended version, they might walk.”
PARCC officials said there is no discussion in that group about offering two versions of its test, though Smarter Balanced officials see such a discussion as inevitable in that group as well.
The pressure within Smarter Balanced to offer a shorter version is unsettling for the group’s biggest advocates, who contend that its vision, while lengthening testing in some states, offers immense promise to make tests a more meaningful gauge of achievement and also a form of instruction.
'An Audible Gasp'
Idaho’s current tests take three hours or less, said Carissa M. Miller, the co-chair of Smarter Balanced’s executive committee. So it’s no small thing to consider exams that could double—let alone quadruple—that amount of time.
“I presented that to district superintendents, and there was an audible gasp,” said Ms. Miller, Idaho’s deputy superintendent for assessment, content, and school choice.
But she and the state’s superintendent of public instruction, Tom Luna, believe so strongly in the value of the detailed information the longer version of the Smarter Balanced assessment will yield that they are working hard to win support from their fellow educators, she said.
“You asked for authentic assessments,” Ms. Miller said she tells them. “Authentic assessment takes time.”
Idaho has not yet decided which version of the test it will use. Neither has Missouri, according to state Commissioner of Education Chris L. Nicastro.
“There are many unanswered questions and a lot of anxiety about the tests,” she said. “The additional rigor and higher expectations of the common standards wouldn’t make it unreasonable to expect the tests to be a little bit longer. But still, we have some folks concerned about testing.”
The U.S. Department of Education, which must review and approve changes in either consortium’s assessment plan, is working with Smarter Balanced officials to refine the design of its two versions so the consortium can present them to its governing board for approval in late November, consortium officials said.
Ann Whalen, a top aide to U.S. Secretary of Education Arne Duncan, said the designs must meet key aims the department had in funding the project.
“While there are different ideas and approaches under discussion, at the end of the day, these assessments must measure critical thinking, paint a very clear picture of which students are doing well, and which need more help, indicate whether students are college- or career-ready, and give students and teachers the information they need to improve,” she wrote in an email. “This is an absolute priority for us and will help us better serve the needs of children.”
Some consortium members, and some of its closest advisers, worry privately that too many states will opt for the cheaper, shorter version of the test, leaving few—if any—to prove that the greater investment of time and resources in “tests worth teaching to” is worth it in the long run.
“They may opt for a shorter version, but what you lose in that is a greater ability to say detailed things about the depth of what students know and can do,” said Derek C. Briggs, a nationally recognized assessment expert from the University of Colorado at Boulder who serves on both consortia’s technical advisory committees.
“It’s a slippery slope,” he said. “Once you start down that path, you may start losing the advantages of a groundbreaking assessment system and it might start resembling the testing systems we have now.”
Experts cautioned that it can be daunting to build shorter and longer versions of a test without sacrificing the ability to compare results from one to those of the other. It’s also difficult to create a shorter version that measures a set of standards as meaningfully and consistently as a longer version, they said. Doing so requires careful attention to a host of psychometric and statistical concerns.
Gregory J. Cizek, a professor of educational measurement and evaluation at the University of North Carolina at Chapel Hill, said there are many examples of multiple versions of tests in use, such as the Iowa Tests of Basic Skills, the TerraNova, and states’ modified assessments for students with disabilities. Longer versions of a test may deliver higher levels of reliability and validity, he said, but shorter versions can produce levels that are still quite acceptable.
The “prime validity target” in educational testing is content validity, the faithfulness with which the test measures the content of the standards, said Mr. Cizek, who serves on the Smarter Balanced technical advisory committee.
“A shorter test will reflect them a little bit less with fewer items to cover that terrain, so the validity is reduced a little bit,” he said.
The key, he said, is to take care to make only those claims about student performance that are appropriate to the validity of the assessment.
'Good Luck and Bad Luck'
Another nationally known assessment expert, who declined to be identified because of the politically sensitive nature of the consortium work, cautioned that a shorter version of a test will have more measurement error than a longer version.
That doesn’t cause a problem when making inferences about certain results, such as the average score of all students who took the tests, he said. But he said that for others—such as the proportions of students who scored at various achievement levels—it can cause significant distortion, tending to concentrate performance at the extreme ends of the spectrum.
“It’s like free throws in basketball,” he said. “If you give people five shots, some will get all five and some will get zero. If you let them shoot 100 times, hardly anyone will get zero or 100. With a short test, there is more spurious good luck and bad luck happening.”
There are statistical methods that can be applied to enable sound results in such cases, this expert said. But he expressed doubt that states have the capacity to apply those methods consistently to ensure accurate, responsible interpretations of test results.
And when they move from interpreting the two versions of the tests for groups, as states are expected to do for accountability, to using them to make decisions about individual students—as they plan to do in deciding whether high school students are “college and career ready”—the risk increases, he said.
“Any inferences about an individual from a shorter test will be noisier and less reliable,” the expert said.
“If you’re going to make decisions about people,” he said, “you’d hate to make them based on a test where 30 percent of the time you would make a different decision if you used the long instead of the short version of the test.”
Vol. 32, Issue 04, Pages 1, 19Published in Print: September 19, 2012, as Two Versions of 'Common' Test Eyed