When local newspapers ranked schools in an affluent Maryland district on the basis of test scores, Broad Acres Elementary School always ended up near the bottom of the barrel.
The 500-student school is in one of Montgomery County’s poorest neighborhoods, and more than 90 percent of its students qualify for free or reduced-price lunches.
“It was rather a discouraging and demoralizing situation for the teachers to always think that we weren’t doing as well as other schools,” says Principal Mary D’Ovidio.
But the rankings didn’t tell the whole picture.
- • A New Accountability Player: The Local Newspaper
June 17, 1998
- • Charter Schools Struggle With Accountability
June 10, 1998
- • In Age of Accountability, Principals Feel the Heat
May 20, 1998
- • A Question of Value
May 13, 1998
- • Failing Schools Challenge Accountability Goals
March 25, 1998
- • Tenured Principals: An Endangered Species
March 4, 1998
- • The Push for Accountability Gathers Steam
Feb. 11, 1998
Two years ago, the suburban Washington district divided its elementary schools into five groups, based on the percent of their students who had ever qualified for lunch subsidies, a common measure of child poverty. Then they compared the performance of schools within those groups.
Under the new analysis, Broad Acres shone. Its average test scores exceeded those of schools with similar child-poverty rates and were higher than those for many schools with fewer poor students.
Thanks to the new analysis, “the teachers feel valued and that they’re doing a great job,” Ms. D’Ovidio says. “And that encourages them to try and strive even harder.” This year, the district recognized Broad Acres as one of 10 schools in the county that had made the most significant gains on state tests over the past four years.
The school’s example offers an instructive lesson for policymakers who want to hold schools accountable for the performance of their students.
The most common practice is to judge schools based on their students’ average test scores, or the percent who score at or above a certain level. Such measures may be useful for identifying students who need help. But the strong correlation between test scores and socioeconomic background has led many critics to argue that such rankings have more to do with the characteristics of the students who attend a school than with how well its educators are doing their jobs.
They argue that a much fairer way to assess the productivity of individual schools is to look at how much “value” the school adds by focusing on gains in its students’ test scores over time.
Other researchers take the notion a step further to argue that states and districts should actually try to weed out the influence of such nonschool factors as poverty and race, by adjusting test scores statistically.
Tennessee, for example, focuses on gains in student achievement to help judge the effectiveness of both schools and teachers. Districts such as Dallas and Minneapolis provide financial rewards to schools based, in part, on how much test scores improve. In March, the Consortium on Chicago School Research, an independent group of local organizations, recommended that the Windy City switch from identifying low-performing schools based on the percent of students who score at or above national norms to an approach that would focus more on student growth.
That same month, Britain announced that it will report the performance of secondary schools based on how far their students improved between the ages of 14 and 16, as well as on each school’s final results. It is the first time that any such “value added” element will be included in the country’s autumn ranking of schools.
Researchers have also used value-added techniques to identify schools or teachers that do an exceptionally good job of educating their students and to analyze what they do differently.
William L. Sanders, who designed the Tennessee Value-Added Assessment System, a statistical process for measuring the influence of districts, schools, and teachers on student learning, claims that focusing on student improvement rather than absolute scores is the “only fair, reasonable thing to do if you’re going to have an accountability system.” That way, he argues, educators are not held responsible for factors beyond their control.
Concerns About Complexity
Some educators, however, raise serious concerns about such an approach. Many worry that the sophisticated statistical techniques used to provide a value-based picture of learning are so complex that the public can’t understand them. Public support is considered a crucial element of accountability efforts, and states and districts have long been criticized for using language and statistics that confuse, rather than enlighten, parents and taxpayers.
And some educators worry that focusing solely on whether students have improved provides an incomplete picture of achievement.
They’re also concerned that if test scores are adjusted based on such factors as student poverty or race, it will send a chilling message that schools expect less of some children than of others simply because they are poor or members of a minority group.
“As an idea, it’s very appealing,” says Carol Ascher, a senior research scientist at the Institute for Education and Social Policy at New York University. “It feels very progressive. It feels fair.”
But, she adds, “the execution of it is so problematic.”
As an idea, it's very appealing. It feels very progressive. It feels fair." But, "the execution of it is so problematic."
In arguing for the use of value-added considerations in judging school performance, experts cite three main problems with relying solely on average or absolute scores:
Factors over which schools have no control, such as students’ race or income levels, are highly correlated with test scores. If an accountability system doesn’t account for such factors, schools like Broad Acres Elementary are often at a disadvantage.
Helen F. Ladd and Charles T. Clotfelter, two researchers at Duke University in Durham, N.C., analyzed the test scores of more than 41,000 5th graders in 571 elementary schools in South Carolina. They found little correlation between a school’s average test scores and other measures that focused more on school improvement. But there was a very high correlation between those test scores and the percent of students who were black or received subsidized school meals.
- Average or absolute scores don’t take into account the prior achievement levels of students entering a school. That puts schools whose students begin with lower average test scores at a disadvantage. And it’s particularly problematic for schools with highly mobile students. Judging schools based on the scores of students who have been there only for a few weeks doesn’t tell you much about the effectiveness of a school.
- Evaluating schools based on the percent of their students who score at or above a certain standard has additional problems. In particular, it creates an incentive for schools seeking to avoid being classified as “failing” or “low performing” to focus their efforts on those youngsters who perform just below the bar. There is less incentive for them to help students who are already over the bar or who are so far below it that there is little chance of their ever meeting the standard.
Such criticisms highlight just some of the problems inherent in the current wave of attempts to evaluate and judge schools. While those efforts may prevent schools from wallowing in mediocrity or failure year after year, a relentless focus on test scores and achievement can also send schools some dangerous messages.
Dangers of Accountability
If not carefully planned and implemented, accountability systems can lead schools to behaviors and strategies that contradict their stated purposes--what one researcher calls “perverse incentives against good teaching.”
A struggling school, for example, may be led to cast aside curriculum and the notion of treating all students fairly in a last-ditch effort to raise the test scores of the small proportion of students who fall just below an arbitrary cutoff point. Or, a too-rigid accountability system may cause students or teachers in the lowest-performing areas to lose hope or to feel they’re being blamed for factors their counterparts in well-to-do, high-achieving schools don’t face.
As a result, says Robert H. Meyer, an assistant professor of public policy at the University of Chicago, “many educators and scholars fear that poorly implemented performance indicators could ultimately be worse than no indicatorsat all.”
Schools may encourage low-achieving students to stay home on exam day, classify them as handicapped or non-English-speaking so that they are exempted from the tests, or treat them so badly that they are encouraged to drop out.
Equally problematic, Mr. Meyer argues, policymakers may mistakenly expand programs that don’t actually contribute much to students’ learning and shut down schools that do.
The Tennessee Model
Among all the states, Tennessee is probably the best known for its work on a value-added assessment system. Since 1990, the Tennessee Comprehensive Assessment Program has measured student performance in reading, math, language arts, science, and social studies each year in grades 2 to 8.
The Tennessee Value-Added Assessment System, which includes some 3 million records on the state’s schoolchildren, is able to use this information to track gains in the test scores of individual students over time.
Those individual scores are then computed to provide information about classrooms, grades, and schools. Based on that data, Mr. Sanders says, the state has found enormous variability in how much schools add to students’ achievement. “We have absolutely superior schools, average schools, and schools that need lots of improvement,” he says.
And he notes that the particular effectiveness of any school cannot be predicted based on its students’ race or poverty level.
Joel D. Giffin, the principal of Maryville Middle School, a 1,030-student school in Maryville, Tenn., says educators there sit down and analyze such scores each year to look for strengths and weaknesses, and adjust their teaching strategies and curriculum accordingly.
But Mr. Giffin admits that many schools don’t use the scores as a diagnostic tool. “People see it as a threat,” he says. “Secondly, it really takes some hard work by somebody--in my case, I think it’s my responsibility--to know that information inside out and upside down.”
‘Takes Away the Excuses’
Dallas school officials also have calculated a school improvement index, which they use to provide financial rewards to schools.
It uses a sophisticated statistical approach known as hierarchical linear modeling to predict the test scores of individual students and groups of students, after taking into account factors that are beyond a school’s control. At the student level, those include such factors as gender, race, fluency in English, and poverty. At the school level, it includes such things as student mobility, school overcrowdedness, and the average family education level. Schools whose students greatly exceed their predicted growth are eligible for rewards.
“The main reason to use value-added is because students start at different places,” says William J. Webster, the assistant superintendent for accountability and information systems. “If my goal is to have 90 percent of my students passing a particular test, for example, I have a very different problem if I have 70 percent of my students already passing that test versus if I have 20 percent passing.”
In Maryland, Montgomery County’s much simpler attempt to account for demographic factors does not result in rewards for individual schools.
But successful schools are recognized by their peers and encouraged to share their practices. Schools that don’t perform well compared with others with similar student populations are asked what they might do differently.
“It takes away the excuses,” says Steven G. Seleznow, the 127,000-student district’s associate superintendent for school administration.
Researchers at the University of Maryland, led by Willis D. Hawley, the dean of the college of education, have used a value-added analysis to compare elementary schools across the state in five categories: suburban, high income; suburban, moderate income; suburban, low income; urban, low income; and rural, low income.
They identified schools that score far better or worse than predicted, after controlling for such factors as school size and student poverty. By comparing two of the most successful schools in each group with one of the least successful, they have isolated more than a dozen characteristics associated with effective schools.
Among them are: a principal who is a strong instructional leader; good teachers; focused, ongoing staff development; and the continuous use of test data to identify school goals, modify instruction, and help shape practices.
Officials in Prince George’s County, Md., recently used a value-added analysis that identified the amount of teachers’ college education as one of the most important factors in determining student performance. The researchers in the 125,000-student district in the Washington suburbs found that as the level of teacher training in a school increased, scores on a 3rd grade reading test also went up.
The district doesn’t use such information to reward or penalize schools, but it is using the results from such studies to decide how to fix schools that have been identified by the state as low-performing.
Good Teacher, Bad Teacher
Some of the most debated findings stem from attempts to use value-added analyses at the level of individual teachers.
In Tennessee, for example, Mr. Sanders and his colleagues divided teachers in two unnamed metropolitan districts into five categories--from high to low effectiveness--based on whether their students scored much better or worse than anticipated on the state tests over a four-year period. Then they analyzed the math scores for students in the two districts.
They found that groups of students with comparable achievement levels in grade 2 had vastly different test scores by grade 5, and the difference was linked to the quality of their teachers.
Fifth graders who had three years of teachers who were deemed very ineffective averaged 54 to 60 percentile points lower than students who had a series of highly effective teachers. The effects of even one bad or good teacher were still reflected in test scores two years later.
“The single greatest effect on student performance is not race, it’s not poverty, it’s the effectiveness of the individual classroom teacher,” Mr. Sanders says. (“Bad News About Bad Teaching,” from Research Notes, Feb. 5, 1997.)
Researchers in the 150,000-student Dallas district have used a similar technique with similar results. (“Students’ Fortunes Rest With Assigned Teacher,” Feb. 18, 1998.)
In Tennessee, the data on the effectiveness of individual teachers are shared with that teacher and his or her principal and may be used as one part of a teacher’s evaluation. Dallas does not evaluate individual teachers based on test scores, but it does use such information to help identify where teachers need to improve.
In Boston, Bain & Co., a private consulting firm, analyzed the district’s efforts to redesign its high schools. As part of that study, it looked at changes in the scores of individual 10th graders on the Stanford 9, the test used by the district, from the spring of 1996 to the spring of 1997.
The researchers then ranked 100 teachers based on how much their students’ test scores had improved over the course of a year. David Bechofer, who headed the study, says highly effective teachers were able to produce six times as much growth in student learning as the least effective teachers, even though their students started with similar baseline scores.
“Here is a phenomenal feedback mechanism that can be used to help teachers identify when they’re being effective and when they’re not,” he says.
Those findings have been strongly contested by the Boston Teachers’ Union, an affiliate of the American Federation of Teachers, which says the sample size of 100 teachers was too small to draw such conclusions.
In the 50,000-student Minneapolis district, David J. Heistad, the director of research, evaluation, and assessment, has used a value-added analysis to identify 2nd grade teachers whose students made the most gains in reading scores.
The district then surveyed teachers about their instructional practices and found that the most effective teachers shared a number of common traits, such as providing students more time for oral and independent reading. Those findings have been shared with teachers throughout the system.
“What we’ve done is try to find the teachers who beat the odds, so that we can replicate those instructional practices that seem to be correlated with their success,” Mr. Heistad says. But he doesn’t recommend rewarding or penalizing individual teachers based on such value-added analyses.
Mr. Heistad predicts that more states and districts will eventually use value-added techniques because they’re a fairer and more diagnostic way to judge individual teachers and schools.
But if value-added analyses are so good, why are so few places using them now? In part, it’s because what sounds good in theory is awfully complicated in practice.
Ideally, states or districts that wanted to track the growth of individual students over time would test each student every year in the core academic subjects--an expensive proposition. In reality, few states or districts test students that often. And records on individual students are often incomplete or fragmented.
Places like Tennessee and Dallas then use sophisticated statistical models to determine how much growth they can expect from individual students each year. Indeed, the Dallas model is so complex that Ms. Ladd and Mr. Clotfelter, the Duke researchers, concluded, “In its attempt to be scrupulously fair to schools, Dallas had developed an approach that is incomprehensible to most participants in the process and to most outside observers.”
The problem is that “you have created this elaborate statistical model which is not very transparent,” says Ms. Ascher of NYU. “And it makes, basically, everybody a captive to the statistician.”
Need for Balance
Determining which factors to control for in such models is also the subject of much debate.
By controlling for a factor like race, for instance, some suggest that researchers are basically conceding that black students are not expected to achieve as much as white students. That, they argue, sets up a self-fulfilling prophecy.
“I think one has to be very careful in implementing such a system that it doesn’t create the perception that you’re setting different standards for different groups of kids,” says Criss Cloudt, the associate commissioner in the office of policy planning and research in the Texas Education Agency.
Many experts argue that states and districts should pay attention both to a school’s absolute academic performance and to whether it is contributing to its students’ growth.
Otherwise, funny things can happen. Schools with many high-performing students, for example, may have trouble producing large gains year after year. Schools that experience a dramatic drop in the preparation of their incoming students may contribute a lot to students’ learning but still see their average test scores decline.
Policymakers might not want to penalize such a school, but neither would they want to hold it up as a model. Schools with low test scores still need help, even if their students are doing better than expected.
“We don’t think you should use the value added only,” says John Q. Easton, the deputy director of the Consortium on Chicago School Research. “It’s just too complicated, and we see all these dozens of strange situations, where you don’t get a full enough picture of student achievement.”
A version of this article appeared in the May 13, 1998 edition of Education Week as A Question of Value