Benchmark Assessments Offer Regular Checkups On Student Achievement
School districts worried about how students will perform on end-of-the-year state tests are increasingly administering “benchmark assessments” throughout the year to measure students’ progress and provide teachers with data about how to adjust instruction.
Nearly seven in 10 superintendents surveyed for Education Week this past summer said they periodically give districtwide tests, and another 10 percent said they planned to do so this school year. Such tests typically are aligned to state or district standards for academic content and given three to five times during the year. Some are given as often as monthly.
Most benchmark assessments take one hour each for reading and mathematics, but may include other subjects. Extensive reporting systems break down test results by the same student categories required under the federal No Child Left Behind Act, such as by race, income, disability, and English proficiency, in addition to providing individual progress reports at the district, school, classroom, and student levels.
“I do believe that three years from now, certainly five years from now, no one will remember a time when there weren’t benchmarks,” said Robert E. Slavin, the director of the Center for Data-Driven Reform in Education, at Johns Hopkins University.
That’s certainly what test vendors hope. Last year, Eduventures Inc., a market-research firm based in Boston, identified benchmark assessments as one of two high-growth areas in the assessment industry, alongside state exams, with a compound annual growth rate of greater than 15 percent. The company predicted that by 2006, what it called “the formative-assessment market”—using a term sometimes treated as a synonym for benchmark assessment—would generate $323 million in annual revenues for vendors.
But while many assessment experts agree that the idea of frequent testing of students to monitor their learning and adjust instruction is sound, some also warn that districts should take a close look at what they’re getting for their money and how they are using such exams.
“You might say that the message here is, ‘Get a second opinion,’ ” said Grant Wiggins, the president of Authentic Education, a Hopewell, N.J.-based consulting service that works with districts.
It’s no secret why districts are turning to benchmark tests. The No Child Left Behind Act, signed into law by President Bush in January 2002, and states’ own accountability systems have created a high-stakes environment in which both districts and schools can face penalties for failing to meet performance targets.
In this standards-based environment, the feeling is that the sooner and more often schools have information about how they’re doing against the standards, the better.
“The reason that there is a boom in benchmark assessments is that most states and school systems are providing nothing more than autopsy reports right now,” said Douglas B. Reeves, the founder of the Center for Performance Assessment, a private consulting organization based in Denver that works with districts to design fair and rigorous assessments and classroom activities. “They tell you why the patient died at the end of the year, and then marveled that the patient didn’t get any better.”
Studies by the Washington-based Council of the Great City Schools, the Austin, Texas-based National Center for Educational Accountability, and others have found that one feature of high-achieving districts is their use of periodic, benchmark assessments to track student achievement and make adjustments.
“Good formative assessments, good benchmark assessments,” Mr. Reeves said, “provide feedback throughout the year, and that is far more fair to principals and teachers, provided they are used wisely.”
In the past few years, according to Eduventures’ 2004 report, “Testing in Flux,” new competitors have flooded the formative-assessment market, including:
• Major test publishers, such as the New York City-based CTB/McGraw-Hill and the San Antonio-based Harcourt Assessment;
• Test-preparation companies, including the New York City-based Princeton Review;
• For-profit providers that specialize in linking assessment results with prescribed remediation plans and curricula, such as the San Diego-based Compass Learning and the New York City-based Kaplan K-12 Learning Services;
• Nonprofit organizations, such as the Portland, Ore.-based Northwest Evaluation Association; and
• Suppliers of “whole-school-reform models,” such as the New York City-based Edison Schools Inc. and Mr. Slavin’s Baltimore-based Success for All Foundation, which designed the 4Sight assessment series.
The products of such suppliers range from formatted tests linked to the standards in individual states, to item banks that districts and schools can use to develop their own assessments, to online testing, scoring, and reporting systems.
Skimming the Surface?
Lorrie A. Shepard, the dean of the school of education at the University of Colorado at Boulder, voices caution about the trend.
A 2004 report predicted that the market for benchmark or formative assessments would expand by a compound annual growth rate of more than 15 percent from 2003 to 2006.
New competitors have emerged in recent years to supply school districts with benchmark assessments. They include:
MAJOR TEST PUBLISHERS, such as CTB/McGraw-Hill, based in New York City, and the San Antonio-based Harcourt Assessment;
TEST-PREPARATION COMPANIES, including the Princeton Review, based in New York City;
SUPPLIERS of whole-school-reform models, such as Edison Schools Inc., of New York, and the Success for All Fouondation, of Baltimore;
FOR-PROFIT PROVIDERS that specialize in linking assessment results with prescribed remediation plans and curricula, such as the San Diego-based Compass Learning and the New York City-based Kaplan K-12 Learning Services;
NONPROFIT ORGANIZATIONS, such as the Northwest Evaluation Association, in Portland, Ore.
While “not all formal benchmarking systems are bad,” she said, she worries about the effects of using 15- or 20-item multiple-choice tests that mirror the format of state exams to drive classroom instruction.
Previous research by Ms. Shepard and others has found that students who do well on one set of standardized tests do not perform as well on other measures of the same content, suggesting that they have not acquired a deep understanding.
“The data-driven-instruction fad means earlier and earlier versions of external tests being administered at quarterly or monthly intervals,” Ms. Shepard said. “The result is a long list of discrete skill deficiencies requiring inexperienced teachers to give 1,000 mini-lessons.”
Good benchmark assessments, she suggested, should include rich representations of the content students are expected to master, be connected to specific teaching units, provide clear and specific feedback to teachers so that they know how to help students improve, and discourage narrow test-preparation strategies.
Rather than trying to assess everything, added Mr. Reeves, the best benchmark tests focus on the most important state or district content standards. And they provide results almost immediately, in simple, easy-to-use formats, he said.
The National Center for Educational Accountability stresses that good benchmark assessments measure performance “on the entire curriculum at a deep level of understanding.” They also begin before grade 3 in both reading and math and provide a process to ensure that data on student performance are reviewed and acted upon by both districts and schools, the center says. In addition to such tests, it adds, districts may provide unit or weekly assessments that principals and teachers can use to monitor student progress.
But in talking about benchmark assessments, not everyone means the same thing.
According to Mr. Slavin, some benchmark tests, like 4Sight, are designed primarily to predict students’ performance on end-of-the-year state exams. They measure the same set of knowledge and skills at several points during the school year to see if students are making progress and to provide an early warning of potential problems.
Other benchmarks are tied more closely to the curriculum, and to the knowledge and skills students are supposed to have learned by a particular time. For example, a skill-by-skill benchmark series in math might focus on fractions in November, decimals in January, geometry in March, and problem-solving in May, rather than testing all skills at the same time, Mr. Slavin said.
Such benchmarks serve as pacing guides for teachers and schools, providing information on whether students have learned the curriculum they’ve just been taught. Some companies claim their tests serve both purposes, predicting students’ ultimate success on state tests and gauging how they’re progressing through the curriculum.
Historically, vendors would design one set of benchmark tests for the entire country. Now they craft tests for each state, starting with the larger ones.
While not everyone means the same thing by the term, benchmark assessments typically:
• Are given periodically, from three times a year to as often as once a month;
• Focus on reading and mathematic skills, taking about an hour per subject;
• Reflect state or district academic-content standards; and
• Measure students' progress through the curriculum and/or on material in state exams.
Many companies also work with districts to design the districts’ own assessments, tied to state and district standards, or permit districts and schools to modify previously formatted exams. Some vendors provide large, computerized pools of item banks that teachers and schools can use to create their own classroom tests and check students’ progress on state standards.
Stuart R. Kahl, the president of Measured Progress, a Dover, N.H.-based testing company, says that while item banks hold great promise, because they permit teachers to design tests that can be used during the ongoing flow of instruction, one issue is whether teachers are prepared to use them appropriately.
“Now we’re putting individual items in the hands of teachers,” he said, “saying, ‘You construct the test; make it as long or as short as you want.’ Do we think they have the understanding to know how much stock they can put in the generalizations they make from such exams?”
Some also worry that as vendors have rushed in, quality has not kept pace. The Eduventures report noted that many vendors have marketed formative assessments “on the basis of the quantity of exam items, as opposed to those items’ quality.” For example, companies may tout having tens of thousands of exam items, it said, although many of the items have not been extensively field-tested or undergone a rigorous psychometric review.
“I think vendors in our space have found it challenging,” said Marissa A. Larsen, the senior product manager for assessment at the Bloomington, Minn.-based Plato Learning Inc., whose eduTest online assessment system is now used in more than 3,000 schools.
While districts sometimes apply the same psychometric standards to benchmark tests that are applied to high-stakes state exams, she said, “in many cases, that’s not what vendors in this space are trying to do. If we did that, it would be well beyond what districts could afford to buy for formative systems.”
Critics also say that even the best benchmark assessments are more accurately described as “early warning” or “mini-summative” tests, rather than as true “formative” assessments, which are meant to help adjust teaching and learning as it’s occurring. In contrast, summative tests are designed to measure what students have learned after instruction on a subject is completed.
“Formative assessments are while you’re still teaching the topic, providing on-the-spot corrections,” said Mr. Kahl. “With benchmark assessments, you’re finished. You’ve moved on. Not that you don’t get individual student information, but at that stage, it’s remediation.”
What Is ‘Formative’?
Yet Eric Bassett, the research director for Eduventures, said the terms formative and benchmark assessments are often used interchangeably in the commercial education market.
And that, some critics say, is precisely the problem.
“I recognize that I’ve lost the battle over the meaning of the term ‘formative assessment,’ ” said Dylan Wiliam, a senior researcher at the Educational Testing Service, based in Princeton, N.J.
In the 1990s, he wrote an influential review that found that improving the formative assessments teachers used dramatically boosted student achievement and motivation. Now that same evidence, he fears, is being used to support claims about the long-term benefits of benchmark assessments that have yet to be proven. “There’s a lack of intellectual honesty there,” Mr. Wiliam said. “We just don’t know if this stuff works.”
He and others say the money, time, and energy invested in benchmark assessments could divert attention from the more potent lever of changing what teachers do in classrooms each day, such as the types of questions they ask students and how they comment on students’ papers.
“If you’re looking, as you should be, at the full range of development that you want kids to engage in, you’re going to have to look at their work products, their compositions, their math problem-solving, their science and social-studies performance,” said Mr. Slavin of Johns Hopkins.
Mr. Wiggins of Authentic Education said that while some commercially produced benchmark assessments are far from ideal, they’re better than nothing. “I would rather see a district mobilizing people to analyze results more frequently,” he said. “That’s all to the good.”
The key point, he and others stress, is what use is made of the data.
“It’s only a diagnosis,” Mr. Slavin said. “If you don’t do anything about it, it’s like going to the doctor and getting all the lab tests, and not taking the drug.”
Vol. 25, Issue 13, Pages 13-14Published in Print: November 30, 2005, as Benchmark Assessments Offer Regular Checkups On Student Achievement