I’m waiting for my T-shirt to arrive in the mail. It’s the one that will say, on the back, “ETS Research Symposium on Through-Course Summative Assessment, Atlanta, Feb. 10-11, 2011.” And on the front, in bold letters, it will say: “So Many Questions, So Few Answers.”
Now, the fact that the questions far outnumbered the answers isn’t necessarily a bad thing, as several of the 100-plus attendees observed. This was, after all, a gathering of researchers, psychometricians, test developers, and state-policy and -assessment people. And they were trying, collectively, to share thoughts, concerns, dilemmas—and encouragement—with the two big consortia of states that are designing testing systems for the common standards. The leaders of ETS’s Center for K-12 Assessment & Performance Management, which organized the invitational event, figured that now, before the first requests for proposals have even gone out, is a good time to identify what the challenges are so that designers can bear all this in mind when they sit down to design these things.
And boy, as it turns out, there’s a lot to bear in mind. Using summative through-course assessment for a variety of accountability purposes poses a bunch of challenges that developers haven’t had to deal with before.
To back up and offer a quick refresher: You might recall that when the U.S. Department of Education dangled pots of money to design these tests, it required, among other things, that they be capable of producing information that could provide valuable feedback to teachers as they teach, as well as be used to gauge student growth, assess their readiness for college, and evaluate teacher and school effectiveness. It also required that the tests be far more authentic measurements of students’ knowledge than fill-in-the-bubble assessments. (For a complete description of what the feds required, see our story, or the Notice Inviting Applications published in the Federal Register.)
Those requirements led the two winning consortia, the SMARTER Balanced Assessment Consortium and the Partnership for Assessment of Readiness for College and Careers, to come up with outlines that represent a significant change from most states’ current testing systems. Both involve a distributed, or “through-course,” idea, meaning that students won’t sit down and take the entire test in one day. It will be broken up into components, over time, and the pieces will be rolled into one summative score. (PARCC’s design is far more pronounced in this way than is SMARTER Balanced’s. Take a look at the graphic or Power Point depictions of the proposals to get a better sense of this.)
The two-day discussion featured some highly technical stuff, coming from some big names in the business of educational measurement. (You can download the papers that were presented and, pretty soon, see videos of the discussions, on the center’s web page devoted to the symposium.)
Amid the algebraic equations and symbolic representations (which put me in mind of the stage name used by the singer Prince for a while), many intriguing questions were posed that provide much for the assessment consortia to consider.
For instance: If the components of a summative score are given four times during the year, as PARCC plans to do, or spread over a 12-week window at the end of the year, as SBAC plans to do, how should test designers figure out what weight to assign to each component? Especially if some of those components are multiple choice questions, some are constructed-response questions, and others are more extended performance tasks? A paper by the ETS’s Randy Bennett, Michael Kane, and Brent Bridgeman waded into this question. In presenting the paper, Bridgeman called it a “gnarly” question.
A “clear learning progression” would suggest putting more weight on items given later in the year, when students are more likely to have learned the material. But not all skills are acquired in a clear, linear way, he said. For instance, he said, it wouldn’t make a huge difference if a student learned proper sentence structure before learning the rules about apostrophes, or vice versa, so assigning more weight to whichever one is taught later doesn’t necessarily make sense. Additionally, he noted, it might be worth giving more weight to skills most important for college or career readiness, regardless of when in the year they are acquired. He also discussed the difficulty of weighting a test that includes both multiple-choice and constructed-response items, noting that it could be tough to properly balance weights for multiple-choice items, which are more reliable, with constructed-response items, which are less so.
Another for-instance: What’s the best way to design such a test so that its results are reliable? A paper by Michael J. Kolen of the University of Iowa took on this question. It noted that using just a small number of constructed-response questions can lead to a more unreliable result. So he suggested developing constructed-response tasks that can be broken up into multiple components that can be separately scored. Additionally, Kolen explored the challenges of making a reliable test that consists of different types of items, and is given in chunks over time, since student proficiency might vary over that time period.
Another paper, by UC-Berkeley’s P. David Pearson, the University of Washington’s Sheila Valencia, and the University of North Carolina’s Karen Wixson, dealt with how to assess reading comprehension, taking into account everything the field has learned about how students acquire those skills. Lauress Wise of the Human Resources Research Organization looked into how best to aggregate the multiple results from through-course tests. The ETS’s Rebecca Zwick and the University of Maryland’s Robert J. Mislevy teamed up to explore ways to scale and link the through-course assessments.
John Sabatini of ETS shared what ETS has learned so far as it works on through-course formative and summative assessments as part of its CBAL initiative, and Andrew Ho of Harvard discussed the advantages and drawbacks to various growth models for through-course summative tests. Keeping a focus on the impact of new assessment systems on students and teachers, Stan Heffner, Ohio’s assessment chief, focused on the need for model curriculum to serve as the “how” between the “what” of standards and the “how well” of tests.
It was Heffner whose early quip became a symbol both of the optimism and the skepticism attending these new assessments. He ticked down the list of purposes the new tests are supposed to serve (school accountability, teacher evaluation, adjusting instruction, etc.), and then added: “And we’re giving away free toasters, too.” That got a chuckle, and was repeated lightheartedly by subsequent presenters. But even as he made light of the high expectations, and had plenty of cautions for those designing the tests, Heffner said the enterprise was “an opportunity we can’t afford to miss.”
A version of this news article first appeared in the Curriculum Matters blog.