The New Breed Of Assessments Getting Scrutiny

Save to favorites
Print

Email Facebook LinkedIn Twitter

Copy URL

This is the first story in an occasional series that will examine trends in assessment and new ways of measuring what students know and are able to do.

Two years ago, California won acclaim for its pioneering assessments that asked students for answers in their own words. But today, lawmakers there are scrambling to redesign the testing program after the Governor vetoed its continuation last fall.

Arizona this winter suspended an innovative statewide test that asked students to integrate knowledge across subjects because state officials feared the test was not measuring what it was designed to measure.

And in Kentucky, a recent report warned that the state’s trailblazing assessment system may not be reliable enough to determine rewards and sanctions for schools.

New forms of student assessment--portfolios, open-ended questions, and performance tasks--have exploded over the past decade. These new measures ask students to write essays, conduct science experiments, complete projects, and set up and solve problems in mathematics, rather than respond to multiple-choice questions.

Advocates of such assessments say they can improve instruction by offering models of good teaching and encouraging teachers to focus on more complex skills. Many of the tests are being created to measure whether students meet new academic standards developed by the states. Such tests also can provide richer information about what students know and how they learn.

But the early infatuation with these performance assessments has given way to intense scrutiny.

‘Unrealistic’ Enthusiasm?

Particularly when adopted on a large scale, the tests are proving costlier and more complicated to use than expected. And some policymakers now question whether the promise of performance-based assessments has outpaced the available technology.

“There was an initial period of enormous enthusiasm which, in my judgment, was often unrealistic,” said Daniel M. Koretz, a senior social scientist at the rand Corporation. “And now people are going to have to start asking: Are we getting what we’re paying for?”

How much performance assessments cost to create and maintain is unclear. A report by the U.S. General Accounting Office looked at six states where school districts used both state-developed performance assessments and commercially developed multiple-choice tests. It found the performance assessments were typically almost twice as expensive, averaging $33 a student. Other studies put the estimates much higher.

Today, machine-scored, multiple-choice tests continue to dominate the market. But a 1994 study by the Educational Testing Service found the popularity of performance assessments spreading rapidly. In 1992-93, 38 state programs included writing samples, 17 used some form of performance assessment, and six collected student portfolios. However, most states were implementing the new assessments while continuing to administer traditional multiple-choice tests.

Requirements under the federal Goals 2000: Educate America Act and the Title I compensatory-education program also are pushing states to shift from norm-referenced, multiple-choice tests to more innovative assessments.

Changing Instruction

Ironically, the criticisms of performance assessments have blossomed just as researchers are amassing evidence that such tests may be changing instruction in positive ways.

In 1988, Vermont embarked on a statewide performance assessment that asked students to keep portfolios in writing and mathematics. After studying Vermont’s program for two years, Mr. Koretz concluded that “the effects of that program on instruction have been, on balance, substantial and positive.”

Between 70 percent and 89 percent of the teachers surveyed, for example, reported more discussion of math, explanation of math solutions, and writing about math in their classrooms since the advent of the portfolios.

Nearly three-fourths of the principals interviewed said the program had produced positive changes in instruction.

A study released this winter of Kentucky’s testing program--which included portfolios, performance tasks, multiple-choice, and open-ended questions--found that students there are writing more and doing more group work as a result of the program.

“Teachers, district assessment coordinators, and superintendents report almost unanimously that writing has improved,” noted the report by the Evaluation Center at Western Michigan University.

Lorraine M. McDonnell, a professor of political science at the University of California at Santa Barbara, said there were similar findings in a case study of 24 teachers in six Kentucky schools. Particularly at the elementary level, she said, “I see units that are much more thematic and conceptual; there are a lot more projects, a lot more group work.”

But other studies suggest that states have underestimated--and underfunded--the professional development needed so that teachers can teach in the ways that the tests promote.

“You almost can’t overestimate the amount of time and effort that’s gone into professional development” in Vermont, Mr. Koretz said. That is one reason he cautions against generalizing about the positive effects of performance assessments based on Vermont’s experience.

Mary Lee Smith, a professor of educational-policy studies at Arizona State University, conducted a case study of how instruction has changed at four Arizona schools since the start of the Arizona Student Assessment Program. She found that, with the exception of one suburban school that already was moving in the directions advocated by the state, there was little instructional change.

And in a statewide survey of about 1,350 teachers, she reported, “only about 20 percent felt that adequate professional development had been provided to them or to the teachers in their school.”

Without such assistance, researchers fear, the classroom changes resulting from the new measures will be superficial at best. Teachers could end up shaping lessons to the test format, just as they have done with multiple-choice tests, without focusing on teaching the underlying concepts.

Technical Hurdles

An equally big problem is to demonstrate that performance assessments can be valid, reliable, and fair when used on a large scale. States have run into a host of technical problems when they try to use such measures to compare achievement across students and schools.

In California, problems in how the statewide test was scored and administered led to inaccurate results for a number of schools. And a panel of experts cautioned against reporting scores for individual students.

Vermont reports portfolio results at the state level and for the state’s 60 supervisory unions, which range from individual school districts to clusters of districts. But it does not report results at the school level because of variations in how the portfolios have been implemented and scored.

And in Kentucky, researchers have cautioned against using results from the performance events or writing portfolios to make high-stakes decisions because of problems with reliability.

“The assessment community in California and throughout the nation is being pressed to deliver dependable information when the groundwork for accurate performance assessment has not been laid,” the panel that evaluated California’s program concluded.

Some studies have suggested that errors due to differences in how judges rate the assessments can be kept relatively small if students take on the same tasks under controlled circumstances, if care is taken in training the raters, and if the scoring criteria are well defined. States can provide more reliable scores for schools by using so-called matrix sampling, in which each student completes only a few tasks but more tasks are administered over all. In both Kentucky and Maryland, for example, the state reports scores for schools but not for individual students.

But the pressure to report how individual students fare on the exams is intense. The primary reason Gov. Pete Wilson of California gave for vetoing the California Learning Assessment System was that it did not provide results for individual students.

“The public isn’t sufficiently enlightened to know how a matrix-sampling design works,” said Kathy Kimball, the assistant executive director of the Commission on Student Learning in Washington State. “They want information about their own kids. And politically, the pressure is to get information about individual kids.”

Her commission probably will recommend that the state develop a large-scale, performance-based system that is tied to the state’s academic-content standards and provides individual scores in grades 4, 8, and 10.

In Georgia, the legislature, with the support of the state schools superintendent, may opt to use a multiple-choice, norm-referenced test for its state testing system, primarily because it can yield reliable, individual scores. Norm-referenced tests compare a student’s results to that of a peer group, not to a specified standard for performance or competence.

Validity Questioned

One reason performance assessments have become so popular is that it is assumed they are more valid than multiple-choice tests. For example, they can assess writing by asking students to actually write, rather than merely answer questions about grammar.

But it has proved difficult to document that the tests are measuring what they are supposed to measure. In many cases, Mr. Koretz said, the content and skills that the exams are supposed to tap have not been clearly specified.

Arizona, for instance, has a set of “essential skills” that school districts are supposed to teach. Districts administer a test known as Form A that demonstrates whether students have mastered the essential skills in reading, writing, and math. In grades 3, 8, and 12, a sample of the state’s students takes Form D, a statewide test that was presumed to measure the same skills in a more integrated fashion.

But a recent study found almost no correlation between the two assessments, suggesting that they are measuring different things. Teachers also have complained that the statewide test does not reflect what they are teaching.

Such concerns led state officials to suspend Form D and to ask whether either test is measuring what they thought it was measuring. It also has caused the state to re-examine the essential skills themselves to determine whether they are measurable.

Lisa Graham, Arizona’s new state superintendent of education, said: “I’m a strong believer in what a performance assessment can do, as an additional piece of information about what students know....But what I think happened is that we got so enamored of it, we moved out in front of the psychometric technology. And that’s not O.K.”

Signs of Retreat

“I think we’re already seeing some signs of a retreat from large-scale performance assessments on the part of states,” said Edward H. Haertel, a professor of education at Stanford University. “And where these tests are being used, states are being forced to proceed much more slowly than they had originally hoped.”

“Maybe that’s not surprising,” he said. “Legislators typically tend to minimize technical problems and try to get everything done very fast and on the cheap.”

Mr. Haertel predicted greater public scrutiny of the changes in the goals of schooling that the assessments imply.

In California, for example, critics of the statewide testing system have turned their attention to the curriculum frameworks that undergird the tests.

“What we see is that we have weak subject-matter frameworks,” argued Natalie R. Williams, the director of education affairs for the Claremont Institute, a conservative think tank there. “Our frameworks de-emphasize phonics in the English/language arts. They’re promoting a lot of new, new math.”

“We’ve gone way too far to the extreme to try to get to the higher-order thinking,” she added. “We’ve tossed out the basics to try to get there.”

Seeking a Balance

The solution, some experts suggest, may be to slow down and strike a better balance between performance assessments and more traditional measures.

Parents and politicians need to be brought into the test-development process from the start, these experts argue, and they need to have greater access to test questions.

The National Assessment of Educational Progress, the nation’s primary barometer of student achievement, uses a combination of multiple-choice and performance tasks that have proved valid and reliable.

Gary Phillips, the associate commissioner of the National Center for Education Statistics, which oversees NAEP, said: “The approach that NAEP has taken is to try to strike a balance between the desirability of doing performance assessments and the feasibility. This is why, on most of our assessments, we have a combination of multiple-choice items and performance items, with 50 percent to 60 percent of the students’ time spent on performance items.”

The center expects to release a report this summer on the technical issues related to performance assessments, including their reliability, validity, and fairness.

Some observers fear the backlash against such assessments will reduce the resources needed to make the necessary advances in technology. But few think the movement can be reversed.

Mark J. Fenster, an assistant professor at the Evaluation Center at Western Michigan University, said, “I think the driving force behind the changing assessments are policymakers and people in the business community, who just feel these are better tasks for children to be able to do--write essays, work in groups, work collaboratively--and they’re not as concerned about the technical issues.”

“We’ll never swing all the way back to nothing but machine-scorable tests,” agreed Lorrie A. Shepard, a professor of education at the University of Colorado at Boulder. “So I guess these little battles will have to be waged, and eventually practice will still take a step forward regardless of the war.”

The “Review Session” series is made possible by a grant from the John D. and Catherine T. MacArthur Foundation.

Lynn Olson

Lynn Olson was managing editor of special projects for Education Week. She also covered national policy (including “P-16 issues” issues, NCLB standards, accountability, and reform), assessment and testing.

A version of this article appeared in the March 22, 1995 edition of Education Week as The New Breed Of Assessments Getting Scrutiny