Cutoff Scores Set for Common-Core Tests
In a move likely to cause political and academic stress in many states, a consortium that is designing assessments for the Common Core State Standards released data Monday projecting that more than half of students will fall short of the marks that connote grade-level skills on its tests of English/language arts and mathematics.
The Smarter Balanced Assessment Consortium test has four achievement categories. Students must score at Level 3 or higher to be considered proficient in the skills and knowledge for their grades. According to cut scores approved Friday night by the 22-state consortium, 41 percent of 11th graders will show proficiency in English/language arts, and 33 percent will do so in math. In elementary and middle school, 38 percent to 44 percent will meet the proficiency mark in English/language arts, and 32 percent to 39 percent will do so in math.
Level 4, the highest level of the 11th grade Smarter Balanced test, is meant to indicate readiness for entry-level, credit-bearing courses in college, and comes with an exemption from remedial coursework at many universities. Eleven percent of students would qualify for those exemptions.
The establishment of cut scores, known in the measurement field as “standard-setting,” marks one of the biggest milestones in the four-year-long project to design tests for the common standards. It is also the most flammable, since a central tenet of the initiative has been to ratchet up academic expectations to ensure that students are ready for college or good jobs. States that adopted the common core have anticipated tougher tests, but the new cut scores convert that abstract concern into something more concrete.
Smarter Balanced is one of two main state consortia that are using $360 million in federal funds to develop common-core tests. The other group, the Partnership for Assessment of Readiness for College and Careers, or PARCC, is waiting until next summer—after the tests are administered—to decide on its cut scores. Smarter Balanced officials emphasized that the figures released Monday are estimates, and that states would have “a much clearer picture” of student performance after the operational test is given in the spring.
More than 40 states have adopted the common standards—the product of an initiative led by the nation’s governors and state schools chiefs—and most belong to one of the assessment consortia. Seventeen of Smarter Balanced’s 22 members plan to use the consortium’s test this school year.
Smarter Balanced based its achievement projections on 4.2 million students’ performance on field-test items last spring. Using cut scores that were set in meetings with hundreds of educators in Dallas this fall, the consortium estimated how many students would score at each level on its test. Two people who took part in that process confirmed that the final cut scores approved by state chiefs, in consultation with top officials in their states, were very close to those recommended by the Dallas panels.
One participant said that when the standard-setting panelists saw the data projecting how many students would fall short of proficiency marks with their recommended cut scores, “there were some pretty large concerns. And it was very evident that this was going to be a problem from a political perspective.”
“The scores that came out of those rooms were close to the rigor level of NAEP,” said another participant, referring to the National Assessment of Educational Progress, a federally administered test given to a nationally representative sampling of students that is considered a gold standard in the industry. “That was sure to freak out some superintendents and governors.” He had anticipated that the state schools chiefs would lower the marks significantly before approving them, and he said he was “impressed and pleased” that they didn’t.
If the achievement projections hold true for the first operational test next spring, state officials will be faced with a daunting public relations task: convincing policymakers and parents that the results are a painful but temporary result of asking students to dig deeper intellectually so they will be better prepared for college or good jobs.
Managing the Message
Statements by Smarter Balanced officials previewed the kinds of arguments state officials will likely have to make.
“Because the new content standards set higher expectations for students and the new tests are designed to assess student performance against these higher expectations, the bar has been raised,” Joe Willhoft, the group’s executive director, said in statement on Monday. “It’s not surprising that fewer students could score at Level 3 or higher. However, over time, the performance of students will improve.”
Many state officials cautioned against comparing the projected performance on the Smarter Balanced test with performance on their current tests, because the tests themselves are different, and they test different material. And indeed, some Smarter Balanced states are likely to see big drops.
California, the biggest state using the new test next spring, turned in proficiency rates as high as 65 percent in some grades on its most recent English/language arts tests. Two-thirds or more of Delaware’s students cleared the proficiency mark on its tests in both subjects in 2014.
Some experts cautioned against assigning too much meaning to the projected levels of student performance.
Daniel Koretz, a Harvard University professor of education who focuses on assessment, said studies show that the numbers of students who score at each proficiency level can vary greatly, depending on which method of setting cut scores is used.
“I would ask to what extent what we’re seeing is a difference in the way standards are set, and to what extent it’s the content of the test,” he said. “People typically misinterpret standards to mean more than they reasonably do. They think psychometricians have found a way to reveal the truth of the distinctions between ‘proficient’ and ‘not proficient.’ But it’s just an attempt to put a label on a description of student performance.”
Activists who oppose high-stakes standardized tests went further in their criticism of the anticipated Smarter Balanced performance levels.
“People should take this with a pound of salt,” said Monty Neill, the executive director of the National Center for Fair & Open Testing, or FairTest, an advocacy group based in Boston. “The deliberate intent is to create more difficult standards. So when the result is that your child, your school, your district doesn’t look as good, it’s because the test is made deliberately more difficult.
“The big issue is that we’re trying to control education through a limited set of standardized tests, and we know from No Child Left Behind that that doesn’t work,” he said, referring to the nearly 13-year-old federal law that made states’ annual testing of students the main lever of accountability for achievement.
How Scores Will Be Used
States in the Smarter Balanced consortium must report student performance on the test in order to meet accountability requirements. Those reports could pose a political challenge for states, and put keen pressure on districts, schools, and teachers. They could be accompanied by consequences as dire as school restructuring, depending on the details of each state’s accountability plan.
It is up to each state to decide how to use the test scores. The results could be factored into teachers’ evaluations, although some states have won delays in that requirement through waivers from the U.S. Department of Education. The results also could drive high-stakes decisions, such as grade-to-grade promotion and high school graduation. Few, if any, Smarter Balanced states plan to use them that way in 2014-15.
Advocates of the tests designed by Smarter Balanced and the other state consortium, PARCC, argue that the transitional stress of lower scores is justified by powerful payoffs.
Instead of each state giving its own test, half the states are giving the same two tests, allowing a shared concept of proficiency and an unprecedented level of cross-state comparison, they contend. And instead of gauging superficial knowledge through bubble sheets, the new exams plumb students’ skills more deeply, with lengthy performance tasks that require students to justify their conclusions in math and supply evidence for their interpretations in English/language arts, those advocates say.
“We have an opportunity to change what assessment means inside our classrooms, an opportunity to make it really be about improving teaching and learning,” said Deborah V.H. Sigman, a member of the Smarter Balanced executive committee and a deputy superintendent in California’s Rocklin Unified School District.
During closed-door discussions to consider the new cut scores, some state leaders voiced uneasiness about reducing the complexity of student performance to four categories, instead of expressing its range in scale scores. Vermont abstained from the vote because of such concerns. (New Hampshire abstained for other reasons.) In response to the concerns about interpretations of the scoring categories, Smarter Balanced states approved a position paper encouraging states to take a broader view when discussing student achievement.
“There is not a critical shift in student knowledge or understanding that occurs at a single cut-score point,” the paper said. “Achievement levels should be understood as representing approximations” of the levels at which students demonstrate mastery. States should consider evaluating additional data, such as grades and portfolios of student work, when evaluating student performance, the paper said.
Drawing the Line
The Smarter Balanced consortium established its cut scores in a lengthy process that began with defining what achievement should look like at each of the four levels. In September, it invited members of the public to rate the difficulty of test items online, and 2,600 did so.
In October, about 500 reviewers—mainly classroom teachers, principals, curriculum directors and higher education faculty—gathered in Dallas to study those descriptions of achievement and to review booklets of test items, arranged in order of difficulty. In separate panels by subject and grade level, they examined the items and decided the points that distinguished the four levels of achievement.
After rounds of discussion, the results were aggregated into cut scores for each grade and subject. The panelists considered performance data from national tests such as NAEP and the ACT college-entrance exam for comparison as well. Sixty of the Dallas panelists later reviewed the cut scores across grades for consistency.
Yvonne Johnson, a parent and PTA leader from Delaware, served on the 3rd grade math cut-score-setting panel. She said that in early rounds, she was inclined to set the cutoff points much lower than those many of her fellow panelists favored.
“We would be working on Level 2, and I’d want to set it on Page 7, and everyone else was at, like, Page 25 and above,” she said. “I thought, ‘Wow, are these appropriate? This is a very rigorous standard.’ But I learned that they’re aiming for something higher. I’m used to, ‘2 plus 2 is 4; you’re right.’ But here, you’d want the student to explain why he got that answer, to justify.”
In the end, Ms. Johnson said, she felt confident that the cut points were set at the “appropriate” level, with extended discussion, input, and agreement from all panelists.
Many in the field of educational measurement expressed concern, though, about Smarter Balanced’s decision to set cut scores based only on field-test data. More often, states establish those scores after the operational test is given, and that is what PARCC will do.
“It’s really bizarre to set cut scores based on field-test data,” said one state education department psychometrician. “You can’t possibly project” accurately what proportions of students will score at the four levels of the test. He and other assessment experts said that field-test data are not good predictors of performance on the operational test because students are unfamiliar with the test, and often, teachers have had less experience teaching the material that’s being tested.
And students might lack motivation to do their best on a field test, experts said.
“The good news is that whatever they’re anticipating now [in student-proficiency rates] will get better,” one assessment expert said of the Smarter Balanced approach.
Vol. 34, Issue 13, Pages 1, 11, 13Published in Print: December 3, 2014, as Consortium Sets High Bars for Its Common-Core Tests