Vision, Reality Collide in Common-Core Tests
Political, technical, and financial factors constrain original assessment plans
In states across the country, field-testing of the exams that will measure students' mastery of the Common Core State Standards is well underway. Much attention is focusing on the questions that this "testing of the test" will inevitably raise about bandwidth, access for special populations, and standard-setting.
But behind those questions lurks a more conceptual one: In terms of overall execution, how do the exams crafted by the two main state testing coalitions—the Smarter Balanced Assessment Consortium and the Partnership for the Assessment of Readiness for College and Careers, or PARCC—stack up to what they promised in their $360 million bids for federal funding?
There are two ways to consider the question. One is a glass-half-full reading, which focuses on the exams' technological advances and embrace of performance-based assessment. On the flip side, a confluence of political, technical, and financial constraints have led to some scaling back of the ambitious plans the consortia first laid out.
With regard to technology, in 2014-15, most students will take the exams on computers, rather than use bubble sheets, for instance. The Smarter Balanced assessment will adapt in difficulty to each student's skill level, potentially providing better information about strengths and weaknesses.
In addition, students taking the PARCC test will write essays drawing on multiple reading sources. And to a level not seen since the 1990s, students taking both exams will be engaged in "performance" items that ask them to analyze and apply knowledge, explain their mathematical reasoning, or conduct research.
Still, the exams, nearing their final stage, contain some notable changes from the designs initially put forward by the consortia nearly four years ago. Both have scaled back the length or complexity of some test elements, and their development of tools and supports for teachers has lagged behind the construction of the year-end tests that will be used to generate school ratings.
In sum, say testing experts, what the consortia have accomplished thus far is more like a first draft of their original goals.
"Both consortia will have tests in 2014-2015 that will be better than almost all existing state tests, if not all. Neither will be as good as promised in their response to the department's [request for proposals]," said Scott Marion, an associate director at the Dover, N.H.-based National Center for the Improvement of Educational Assessment, which advises both consortia. "But if they can survive until 2018, '19, '20, they actually might have something pretty good that comes close to living up to their promises."
The U.S. Department of Education's Race to the Top program envisioned an entirely different approach to testing, one that would provide more helpful and timely information to teachers and students.
Both PARCC and Smarter Balanced proposed testing systems that coupled extended, performance-based tasks with traditional items. And they promised to provide tools and resources that would help teachers translate year-end testing targets into instructional units.
Interviews with consortia officials and advisers paint a picture of somewhat different working cultures in the two organizations. PARCC, a more centralized body, took a conservative approach deeply grounded in measuring the common standards faithfully. Smarter Balanced, by contrast, was more loosely structured and open to experimentation, driven by its belief in the power of adaptive testing.
Both groups will continue to use some multiple-choice or machine-scored questions, but many of those items have been enhanced—allowing students to select multiple answers, for instance, or to drag and drop text from reading passages to cite evidence.
From the beginning, though, the development of the performance-based tasks has been a heavy lift. States initially came to the consortia with very different understandings about what such testing might entail.
"States that tended to have multiple choice thought all it meant was an open-ended question," said Shelbi Cole, the director of mathematics for Smarter Balanced. "Others, mainly those on the East Coast, thought, 'No, it's two-weeks long.' They were on really different sides of the continuum."
And the novelty of some of those formats often meant training and retraining item-writers.
"They needed time to innovate and learn to break free from old writing templates and rules," said Bonnie Hain, the former English/language arts senior adviser for PARCC.
One notable technological issue affected design and price point. Both consortia had expressed interest in using "artificial intelligence" scoring to ease the burden of hand-scoring answers. But as it became clear that AI scoring would not be ready to measure the evidence-based reading and writing skills demanded by the common core, both consortia decided to rely on trained educators to score students' responses to the performance-based tasks. (Each group plans to carry out additional studies of AI scoring, in the hope that it might become feasible in the future.)
A Tough Sell
In essence, testing experts say, the consortia faced one of the quandaries of performance-based assessment: It makes for longer, more expensive exams—a tough sell at a time when resistance to standardized testing and its effect on curriculum is growing from some quarters.
"The point the consortia are emphasizing is that it's very good testing in a sense, and will tell you things we haven't been able to tell you before," said Derek Briggs, a professor of research and evaluation methodology at the education school at the University of Colorado at Boulder who serves on the technical-advisory panels for both consortia. "But it's still a hard sell to a lot of parents and children and people who are already skeptical about testing."
Such constraints affected both consortia's initial proposals. PARCC early on discarded a plan to scatter three smaller tests, administered at equal intervals over the course of the year, in favor of one window for multiple performance tasks followed by a year-end, machine-scored component.
"There was a lot of sensitivity about not trying to influence implementation of the standards in terms of the curriculum and the sequence of instruction," said Jeffrey Nellhaus, the director of policy, research, and design for PARCC.
Smarter Balanced, meanwhile, reduced the number of performance tasks in each subject from three in the initial application to one, comprised of several steps.
"The price point people felt they could manage politically has meant we're doing less than we could have done, and it will not signal as firmly that we want kids to demonstrate their learning," said Linda Darling-Hammond, a Stanford University education professor who advises the Smarter Balanced consortium.
Smarter Balanced has kept, however, a classroom-based introduction and activity for each performance-based segment meant to help level the playing field for students who come to the exam with different levels of background knowledge.
Some of the consortia's decisions also reflect the parameters of the Education Department's grant criteria. The federal agency wanted the year-end tests to go live in the 2014-15 school year—a short timeline for producing the level of complexity demanded, testing experts say.
Meanwhile, the K-12 testing policy inscribed in the No Child Left Behind Act remained unchanged. That meant certain ideas—testing samples of students rather than every child, for instance—couldn't be entertained.
Those constraints, though, shouldn't detract from some real breakthroughs, according to testing experts. Performance testing in K-12 has never been done at the scale it will occur once the two groups' tests go live, they say. And the consortia's advances in that area directly respond to the instructional shifts in the common core.
The performance-based math items created by Smarter Balanced aim to measure whether students exhibit the set of mathematical practices identified in the standards, Ms. Cole noted, such as making sense of problems and persevering in solving them, and reasoning abstractly and quantitatively.
"States have claimed anything involving words is problem-solving," she said. "We are asking students to take some steps in terms of sense-making, which is very different from finding a keyword in a problem like 'altogether' and knowing that you have to add."
There are innovations in the exams' approaches to measuring reading skills, too. Traditionally, states' reading tests have relied on "commissioned" passages written explicitly for the exam—and stripped of interesting, varied syntactical features.
"We wanted authentic texts, because one of the critical things in the common core is that text should be rich and worthy of reading," said Ms. Hain, now a consultant to PARCC. "What you find with a lot of commissioned texts is that they're pablum. The structure is not worth discussing because it's the same old boring, dry, deductive statement, a main idea in the first paragraph, and then three details and a closing sentence."
As a result, she said, PARCC has committed to using only "permissioned" texts drawn from actual novels, books, and journal articles in its reading tests.
About a third of Smarter Balanced reading texts are commissioned, but that group, too, says that using authentic texts is a priority. It has struck an agreement with the Copyright Clearance Center, a Danvers, Mass.-based company that irons out copyright permissions with texts not yet in the public domain.
If many of the breakthroughs focus on the year-end tests, there is a general sense that the development of the supplemental, nonstandardized supports for educators, such as model units, videos, and formative assessments, lag behind.
"While it's not the case that they've done nothing on interim and formative assessment—they have—the first priority is the one with the most accountability in it. You could argue that clearly the formative and interim features have not gotten, in either consortia, the same degree of attention," said Mr. Briggs of the University of Colorado.
Smarter Balanced didn't contract with vendors to begin building its Digital Library until early 2013. That resource, which the group hopes to unveil this summer, will include online training modules, exemplar units, and teacher-submitted resources.
PARCC, meanwhile, is still seeking a contractor for its Partnership Resource Center, an online site that will host released test items, model curricula frameworks, and formative-assessment tools, some provided by states' own repositories.
Although many instructional experts support those efforts, they worry that they are coming too late, since teachers are facing instructional challenges now.
Teachers are aware of the end goals espoused in the common-core standards, but need more support in learning how to break them into manageable units, said Margaret Heritage, an assistant director for professional development at the National Center for Research on Evaluation, Standards, and Student Testing at the University of California, Los Angeles.
"My concern for teachers is getting a handle on these standards and understanding the depth of them, and what it's going to take to reach these deeper-level learnings the standards require," said Ms. Heritage, who sat on Smarter Balanced's formative-assessment advisory panel. PARCC does have an optional diagnostic exam, which teachers can use to better pinpoint students' weaknesses, said Mr. Nellhaus. And Smarter Balanced is now deep in the work of creating the teacher supports.
More than 1,400 K-12 teachers are now helping to generate—and vet using common criteria—the resources for the Smarter Balanced digital library, according to Chrys Mursky, the group's director of professional learning.
Finally, a few elements remain open points of concern.
Smarter Balanced's adaptive-test model has raised a tricky policy dilemma: whether students who are demonstrably performing significantly above or below proficiency should be given test questions outside their grade level.
To date, the federal Education Department has forbidden that practice, citing the requirements of the NCLB law. Smarter Balanced plans to make its case to the agency, with the input of a variety of advocacy groups and assurances that it will institute plenty of safeguards, said Joe Willhoft, the executive director of Smarter Balanced.
"If we have a 4th grade student who is very good in math, we want to open up the pool for them to see harder items," he said. "But we don't want to give them something about the Pythagorean theorem. We want to be sure that if they get it wrong, it's because they don't know the math, not that they've just never seen it before."
Another element lies outside the consortia's direct control but is of equal import: Will districts' notoriously variable technological capacity be able to support a full testing schedule for all students?"There is the fear certainly in some quarters that it will be an Obamacare-type disaster," said one consultant on the exams who spoke on the condition of anonymity because he continues to work with consortia officials. "If that occurs, it will be more the fault of the Education Department, which insisted on rushing it along."
Both consortia hope that any such glitches will occur during field-testing, allowing enough time for corrections, because any mistake after that point could be costly.
Indeed, support for common assessments seems less assured than for the standards themselves.
Only one state, Indiana, had reversed its adoption of the standards as of mid-April. But criticism of the testing has led several states, including Florida, Georgia, and Pennsylvania to decide against using the consortia tests. And there are external pressures, too, as a variety of nonprofit and for-profit vendors begin to build suites of tests to compete for market share with the consortia products.
With such pressures looming, many in the assessment community hope the consortia's efforts will continue to grow stronger over time. The tests mark an important shift away from the basic skills that the NCLB-era exams tended to measure, they argue.
"It's important for people to give the consortia a little bit of charity, given the size of the task," said the University of Colorado's Mr. Briggs. "I worry that if they don't have it perfect from the start, then people will want to pull the plug. And then we'd be back to having assessments that look an awful lot like what we had before."
Vol. 33, Issue 29, Pages s8, s10, s12Published in Print: April 23, 2014, as Vision, Reality Collide in Common Tests