Made to Measure

Save to favorites
Print

Email Facebook LinkedIn Twitter

Copy URL

Come every spring, Texas students from the 3rd to the 10th grades take the Texas Assessment of Academic Skills.

They toil at least two full days filling in the bubbles for multiple-choice and true-false questions on mathematics and reading. Eighth graders spend another two days on social studies and science tests. To graduate, high school students must pass such tests in reading, writing, and math.

Also at stake are bragging rights for their schools, the reputations of their teachers, and even the values of the homes they live in.

Testing programs like the one in Texas are becoming standard among states as the century nears its end. Forty-eight states have testing systems, and most rely on the results to determine a wide range of life-changing events, including whether teachers will receive bonuses or the school will wear a badge of honor--or shame--for its scores. Texas and 18 others also require students to pass an exam to qualify for a diploma.

This current generation of exams is the culmination of a century in which tests and assessments--ranging from IQ tests to SATs to statewide proficiency exams--have played an increasingly pervasive part in the lives of students and the operation of the schools they attend.

At the start of the 20th century, standardized testing was the province of urban districts, and it was employed mostly to measure how well students were learning. The test scores usually remained under wraps to all but a select number of administrators.

A hundred years later, tests are the tools of the state and federal governments, directing what is taught and how it is taught.

The deeply held belief in the value of such testing is rooted in the political ideals that shaped this country, argues Eva L. Baker, a co-director of the Center on Research on Standards and Student Testing, a federally financed research project, and a professor at the University of California, Los Angeles.

Alexis de Tocqueville, the astute early-19th-century observer of the United States, noted that Americans believed in the “perfectability of man,” Baker says, and the American obsession with testing in the 20th century reflects that creed. Educators continue to search for the perfect instrument to help them provide the best possible educational program for their students.

But the use of tests also raises a host of issues about another cherished American value: equal opportunity.

Since the civil rights and women’s rights movements of the 1960s and ‘70s, schools and colleges have been bombarded with complaints and lawsuits contending that standardized tests aren’t fair to minority students and girls. Tests’ biases produce scores that track students into special education and less challenging courses and away from competitive colleges, the critics claim.

School leaders, meanwhile, have complained that tests are unfair barometers of how well they are educating children and should not be the sole criterion used to make decisions on students’ futures, such as whether a 10th grader in Brownsville, Texas, or Newton, Mass., graduates from high school.

The Printed Exam

As early as 1845, schools began testing their students in a uniform way. At the time, oral examinations were state-of-the-art assessments.

That year, Boston became the first American district to print short-answer tests that would be used throughout its system, according to How Research Changed American Schools, by Robert M.W. Travers.

Students were tested in geography, grammar, history, rhetoric, and philosophy. The city’s school committee proposed to test every student in its schools, but only about 20 or 30 students in each school took the tests. Only about 500 students took each subject test out of the city’s 7,000 students.

In a reaction that will sound familiar to present-day educators, Boston school leaders were shocked by the poor performance. The students tested failed to answer 40 percent of the questions correctly in any subject, Travers writes in his 1980 book.

The results, concluded a contemporary report on the exams, “show beyond all doubt, that a large proportion of the scholars in our first classes, boys and girls of 14 or 15 years of age, when called upon to write simple sentences, to express their thoughts on common subjects, without the aid of a dictionary or a master, cannot write, without such errors in grammar, in spelling, and in punctuation.”

The tests were given again in 1846, but by 1850, Boston had abandoned its strategy and reverted to nonstandarized exams that were mostly based on oral presentations, Travers writes.

In 1874, Portland, Maine, experimented with standardized testing for the first time, according to The One Best System: A History of American Urban Education, by David B. Tyack.

Samuel King, the Portland superintendent, created a uniform curriculum for the city’s schools and wrote a test to measure whether students successfully learned it.

“System, order, dispatch, and promptness have characterized the examinations and exerted a helpful influence over the pupils by stimulating them to be thoroughly prepared to meet their appointments and engagements,” the superintendent wrote about his exam, according to documents cited in Tyack’s 1974 book.

King published each student’s score in the newspaper, raising the ire and opposition of teachers and parents that led to his resignation in 1877. His successor never published scores, but he continued the tests, even if he did make them easier so more students would pass than they did in the King era.

Rise of the IQ

The experiences of Boston and Portland were common in city districts until the turn of the century. But in the first two decades of the 20th century, education researchers such as Edward L. Thorndike started to create standardized tests to measure students on uniform scales in arithmetic, handwriting, and other subjects.

Standardized tests held great appeal because they removed the subjectivity of individual teachers’ grading methods. With an objective test, the experts said, a student’s score could be confidently compared with those of his classmate in the next seat and his counterparts in a city across the country.

All those tests had a single goal: to measure how well students performed against a prescribed curriculum. But in the early years of the century, psychologists started to draft a new form of test: one designed to measure innate ability and predict future performance, instead of evaluating whether students had mastered the material in a curriculum.

In 1904, the Paris school system hired Alfred Binet to design a test to identify students who had been unable to benefit from instruction. He devised a scale that predicted how well a child would learn and estimated his or her “mental age.” The American psychologists Henry Goddard, Lewis M. Terman, and others adapted Binet’s work in 1912 to create the Intelligence Quotient, or IQ--calculated by dividing a person’s “mental age” by his chronological age.

Terman, a Stanford University psychologist, unveiled what is called the Stanford-Binet scale in 1916 in his book The Measurement of Intelligence. In it, he outlined how the Binet test was to be administered and explained how its results would yield data revealing a student’s innate intelligence.

In a crucial development, Terman’s newest version could be administered to students using pencil and paper, making it much easier and cheaper for schools to administer than earlier versions requiring the services of a specially trained psychologist.

When Terman published the third volume of the test, he claimed that a student’s score would be constant over time. By 1922, Terman said that 250,000 students had taken the test.

IQ tests “are now being used to classify schoolchildren of all degrees of intelligence and are being developed as important aids in vocational guidance,” Harlan C. Hines, a professor at the University of Washington wrote in the April 1922 edition of The American School Board Journal.

In at least one state, which Hines did not name, “high school pupils are being classified on the basis of intelligence tests alone,” he added.

Tracking by Test

Even though that was a blatant misuse of the test, it was common practice, say Robert Glaser and Edward Silver, two University of Pittsburgh researchers.

In the early part of the century, enrollment in the nation’s schools swelled, thanks to a flood of immigrants and compulsory-attendance laws. Administrators wanted a way to sort students by ability because they believed it was the best way to provide appropriate instruction.

Intelligence tests provided the scientific data needed to assign students to separate tracks, Glaser and Silver write in Review of Research in Education, 20th Edition.

Testing was a “convenient and powerful instrument of social control” in the early 1900s, they write, and has remained so throughout the century.

The widespread adoption of tracking reflected the consensus of social scientists at the time that intelligence was a hereditary trait. By using a scientific instrument to group students, went the thinking in those days, school officials could find the right place to serve students according to their innate abilities.

By the middle of the century, the Scholastic Aptitude Test became an annual rite for college applicants, and by the 1970s, students were classified for special education and remedial education in federal programs based on test scores.

Whether those tests should be used for such purposes was a question as early as the 1920s, and it is one that persists. Based on the results from the earliest versions of the tests, recent immigrants from southern Europe were found to be less intelligent than others. The findings were used to set quotas to restrict immigration from Italy, Greece, and other Mediterranean countries. Later, a disproportionate number of black students ended up in special education programs, where they failed to catch up with their peers, because they scored poorly on such tests.

Those uses, according to Glaser, Silver, and other testing experts, are part of a pattern of inappropriate and unfair use of such exams. The University of Washington’s Hines recognized the potential for problems early on.

“The writer has come to feel that a test loses its value and becomes a dangerous weapon in the hands of the untrained,” Hines wrote in the conclusion of his 1922 article.

The following year, Terman created an assessment very different from his intelligence tests. His product was similar to Boston’s and Portland’s standardized tests of the 19th century, as well as others that started to hit national markets in the first two decades of this century.

In his Stanford Achievement Test, Terman set out to measure student achievement in specific subjects across several grades. With it, a school could measure how well students in grades 2-8 performed.

More important, the Stanford results could be compared with a national sample of 350,000 students, according to Lewis M. Terman: Pioneer in Psychological Testing, a 1988 biography by Henry L. Minton. The sample was far bigger than any from similar tests available at the time.

The form and substance of the Stanford exam “foreshadowed the future,” according to Glaser and Silver. Throughout the rest of the century, standardized tests would be modeled after Terman’s work.

Measuring Content Mastery

One such test was the Iowa Tests of Basic Skills, one of the three most routinely administered standardized tests in schools today.

The testing program started as a scholarship competition run by the University of Iowa in 1929. In the first year, the test had sections dedicated to grammar, English and American literature, world history, American history, algebra, geometry, science, physics, typing, and stenography, according to The Iowa Testing Programs, a history of the program written by Julia J. Peterson.

Like many tests of the era, the Iowa exams were based on textbooks commonly used throughout the state. To speed scoring, the authors relied on multiple-choice, true-false, matching, and fill-in-the-blank questions; the only free-response questions were on the math exam. The tests could be graded so quickly that students knew their scores and whether they advanced to the next round of competition before leaving the auditorium.

By 1935, the scholarship program had been replaced by a battery of tests, also called the Iowa Tests of Basic Skills. Like the Stanford Achievement Test, the ITBS covered all grade levels and spanned the basic curriculum: reading, language, study skills, and arithmetic. Each district received a confidential report ranking its students against the mean of a statewide sample, as well as the number of correct answers. The designer of the Iowa exam, E.F. Lindquist, intended that it be “diagnostic in character” and able to evaluate whether a district’s curriculum was working well, Peterson writes in her 1983 book.

Within four years of the program’s start, 30,000 Iowa students had taken the basic-skills tests. The tests began their spread across the country, with Kansas City, Mo., and South Carolina purchasing them for use in their classrooms.

The market for the tests continued to grow. In 1940, the University of Iowa testing bureau contracted with the Houghton Mifflin Co. in Boston to distribute the ITBS nationally. The first four versions brought the university $330,000 in royalties.

The Iowa and Stanford tests were not the only such achievement tests of the era. Throughout the ‘30s, competitors such as the California Testing Bureau also grew at rapid rates. (See a related story, “Quiz Biz.”)

“Publishers had come to recognize that testing provided a new and lucrative market which developed rapidly during the 1930s, not just in the three R’s but in the knowledge areas of the curriculum,” Travers writes in his history of educational research.

A Validation

The 1940s saw the increased use of another standardized test: The Scholastic Aptitude Test. Like IQ tests, the SAT’s purpose then and now is to predict performance. Instead of being a device to sort students within schools, the SAT’s purpose was to divide them among colleges and universities.

The standardized version of the SAT was born in the 1920s when Harvard University decided to offer scholarships to underprivileged students and others who hadn’t attended New England’s elite preparatory schools.

Harvard contracted with the College Entrance Examination Board to create a new test to select the recipients. Since 1900, the board--now known simply as the College Board--had given essay exams in fields such as rhetoric, Greek, and other elements of the traditional prep-school curriculum.

Though the College Board continued to offer the essay tests for 10 years after the first SAT was given in 1926, the new exam soon became the standard hoop college applicants needed to jump through.

When servicemen returned from the Second World War and began flocking to colleges under the GI Bill, the importance of the SAT grew along with enrollments.

“It was truly the return of the GIs after World War II that spurred the growth in college enrollment and, therefore, admissions testing,” says Brian P. O’Reilly, the College Board’s executive director of guidance and admissions-testing programs.

When the SAT created its 800-point scale for each of its two sections in 1941, 10,000 students took the test, O’Reilly says. Seven years later, the number had doubled. In 1964, more than a million took it; that number had jumped to 2 million by 1967.

The SAT became the preferred method of evaluating college applicants because its maker, the Educational Testing Service, working with the International Business Machines Corp., used machines to scan answer sheets quickly. Contestants in the Iowa scholarship program could know their score in a few hours in the days before computer scanners. But the numbers of students applying to that program paled in comparison with the flood of postwar college applicants.

The importance of the SAT, Baker wrote, along with her UCLA colleague Regie Stites, in a 1991 essay on testing trends, is that it “legitimated” multiple-choice tests, especially among the highly educated class of people who grew up to be policymakers.

“Our best and brightest, and their influential parents, accepted the validity of such tests for college admissions,” they say. “Thus, the experience of being tested successfully themselves bred not contempt but reaffirmation of the accuracy of the measure for the use by others.”

But critics have long charged that the SAT is not a fair measure to the rest of society.

“The data are clear” that the SAT discriminates against minorities and sometimes girls, says Monty Neill, the executive director of the Center for Fair & Open Testing, known as FairTest. “It hasn’t changed substantially [since the beginning].” FairTest was founded in the 1980s by civil rights and consumer advocates to monitor testing practices and campaign against those its leaders deemed unfair.

The biggest problem the bias creates may not be in the admissions process, Neill suggests, but in students’ decisions about where to apply. A student who scores a combined 1000 on the test will shy away from applying to schools that publish an average SAT score of 1100, he says. “It reinforces the existing hierarchies of race and class,” he asserts.

The College Board maintains that the SAT is the second-best predictor of how well a student will perform in college.

“We’re very comfortable in knowing high school grades are the [best predictor], but only by a tiny little bit over test scores,” O’Reilly says. “Nothing else contributes very much.”

High school grades, of course, are derived in large part from the tests prepared by teachers--the tests that “count” in students’ eyes. Little research is available to show how these assessments have changed over the century.

What SAT scores are not a good barometer of, O’Reilly adds, is school quality.

Since the 1970s, real estate agents, newspaper editors, and U.S. secretaries of education have cited SAT data in an attempt to judge the quality of schools. As scores started to decline in the 1960s, critics have pointed to SAT scores as a sign that the quality of education is declining. In the 1980s, the U.S. Department of Education annually compared states’ SAT scores in comparing their educational performance.

But such comparisons are meaningless, O’Reilly and other researchers say. They say the comparisons ignore the fact that SAT-taking is the product of a self-selected group rather than a consistent group that represents the population as a whole.

“Trying to compare schools just on SAT scores ... ignores a whole lot of other things that are going on in the schools,” O’Reilly says.

The Federal Plunge

The SAT set the stage for the next step in the use of testing. While the SAT, IQ tests, and other selection tests held high stakes for individual students throughout the century, until the 1960s there were no assessments that held consequences for the people who ran schools.

Testing programs such as the Iowa and Stanford ones had been prevalent since the 1930s, but their results were intended to inform teachers on how to instruct students, not to rate how well they themselves did their jobs.

“We used to have standardized achievement tests [in the 1950s], but they were never used to judge the quality of schooling,” says W. James Popham, a former science teacher who went on to become a leading testing expert as a professor at the University of California, Los Angeles.

Before the 1970s, test scores were considered “for internal use only” and rarely were reported to state or federal officials, or even the public, Joy A. Frechtling, then the director of educational accountability for the Montgomery County, Md., schools, writes in a 1989 text on educational measurement.

All that changed when the federal government started to play an increasingly important role in subsidizing schools and wanted to see rewards for its investment.

In 1965, the U.S. Office of Education contracted with the sociologist James S. Coleman to study whether American schools offered equal opportunity to white and black students.

The report remains one of the biggest and most significant studies of educational achievement ever. Coleman and his team surveyed 570,000 students and 60,000 teachers throughout the country. They found that students’ family backgrounds and the socioeconomic makeup of their schools were more meaningful factors in student achievement than the quality of their schools.

Those findings have been debated ever since, but the study created a prototype for conducting education research that put test scores at center stage.

“The Coleman report formally reduced the question of how well schools serve low-income and minority students to a single criterion, student performance on multiple-choice tests of basic skills,” Baker and Stites write.

That assumption began to drive how the federal government ran its growing investment in precollegiate education. After President Lyndon B. Johnson signed the Elementary and Secondary Education Act in 1965, its program to help schools with high concentrations of poor children soon began to reflect a test-driven definition of success.

To qualify for Title I money, the federal government said, school districts had to show results. The government created the Title I Evaluation and Reporting System--also called TIERS. It required schools to evaluate their federal programs by using norm-referenced tests--which compare students against a national sample--and it contributed to the “substantial expansion” in the use of them throughout the decade, according to a 1998 paper by Robert L. Linn. He is a co-director of the Center on Research on Standards and Student Testing, or cresst, at UCLA and a professor of education at the University of Colorado in Boulder.

TIERS encouraged schools to test Title I students twice a year, Linn writes, because the best way to prove academic growth was to compare a student’s scores in the fall against those in the spring. Studies showed that schools following that pattern exhibited more student improvement than those that tested once a school year.

Nine years after the inception of Title I, the federal government ordered a different kind of testing in a new special education law.

The 1975 Education for All Handicapped Children Act required schools to test slow learners to determine whether they would qualify for individualized services under the program. The law even mandated that schools assess every potentially disabled child twice, Linn writes in the 1989 edition of Educational Measurement. That provision was included to ensure accuracy, but, like TIERS, it doubled the time dedicated to testing.

It also reinforced the role of IQ tests and other psychological instruments designed to test innate ability.

Minimum Competency

At the same time, support was growing among federal leaders for a new assessment designed to provide a snapshot of national student achievement.

U.S. Commissioner of Education Francis Keppel formed a panel of experts in 1963 to explore ways of crafting such an assessment system.

“The nation could find out about school buildings or discover how many years children stay in school; it had no satisfactory way of assessing whether the time spent in school was effective,” Keppel wrote in 1966, a year after leaving his federal post.

To lead the committee, Keppel named Ralph W. Tyler, who had led the evaluation of the landmark Eight-Year Study of progressive education at the secondary school level 30 years earlier.

The Tyler committee recommended a regular sampling of students in basic subjects for the program that became known as the National Assessment of Educational Progress. Because local school administrators objected to reporting a breakdown of scores by state, Tyler’s panel proposed that the scores be reported by four regions.

While the compromise was necessary to assuage worries that the federal test would drive state and local curriculum decisions, it also meant NAEP results would be useless for evaluating the effectiveness of schools, according to Maris A. Vinovskis, a University of Michigan historian who wrote a history of the program in 1998.

The lack of comparable state data forced federal officials to rely on other data, such as SAT scores, in their “Wall Chart” in the 1980s designed to grade states.

Consequently, the job of evaluating school districts fell to states.

Starting in the early ‘70s, states set out to define the minimum students should know before they graduated. To measure whether individuals met those basic standards, states created so-called minimum-competency tests.

The standards were “the most minimum imaginable,” according to Popham, the UCLA professor emeritus.

From 1973 to 1983, the number of states with minimum-competency tests grew from two to 34, Mr. Linn writes in a 1998 essay published by CRESST about the growth of testing during the past 50 years.

A 1977 North Carolina law, for example, called for a testing system to fulfill three purposes: ensure that high school graduates “possess minimum skills,” identify their strengths and weaknesses, and hold schools accountable for what they teach students.

Like the IQ tests and SATs before them, the minimum-competency tests raised questions of racial fairness.

In 1977--the first year of Florida’s test--Linn writes that 75 percent of white students passed on the first try, compared with 60 percent of Hispanics and fewer than 25 percent of their African-American counterparts.

Twenty years later, minority students narrowed the gap but had not closed it, according to data cited by Linn. In 1984, 70 percent of black students passed on the first try, but by then, almost 90 percent of white test-takers were doing so. Since then, the African-American passing rate has declined gradually, while white students have continued to score at about the same level.

Once states committed to attaching what are now called “high stakes” to student testing, they didn’t turn back. If anything, they have upped the ante over the past 16 years.

In 1983, a federal panel released the influential report A Nation At Risk. It declared the nation’s schools woefully inadequate and called for ways of measuring whether students had mastered a rigorous curriculum.

The call set off activity not only to append significant consequences to, but to increase the difficulty of, statewide tests.

Most states--just as they have whenever the number of students being tested rises dramatically--relied on standardized tests, setting off a debate over whether norm-referenced tests should be used for decisions such as school sanctions or rewards or student promotion or graduation decisions.

“A good norm-referenced test will give you in great detail, skill by skill, a child’s strengths and weaknessess,” says Maureen DiMarco, the vice president for educational and governmental affairs for Riverside Publishing, the division of Houghton-Mifflin that distributes the Iowa tests.

On an aggregate level, such tests can play a role in accountability decisions “if it’s part of a system, not the sole characteristic,” adds DiMarco, who was a high-profile education adviser to former California Gov. Pete Wilson. “It can be the predominant one. It’s going to be your strongest and most objective measure.”

In the late 1980s and early 1990s, some states experimented with so-called authentic assessments based on portfolios of student work and questions that required them to write essays on exams--even in such subjects as science and math that traditionally eschewed them.

California abandoned its program after conservatives, DiMarco among them, charged that test questions pried into students’ personal beliefs, and traditionalists complained that a student could score well on a math problem by writing a high-quality essay but failing to comprehend the mathematical principles behind it. In 1996, Vermont added a standardized test to supplement its portfolio-based assessments after scoring was found to be unreliable for individual results. Kentucky is also adding standardized tests to its assessment package.

Maryland, meanwhile, continues to rely on its performance-oriented system.

The dependence on standardized tests hasn’t changed much in the ‘90s, testing experts say, even though states declare that their tests are aligned with curriculum standards they have adopted.

So general are many of the standards, however, that test writers need to do little more than revise off-the-shelf products to satisfy the needs of states, says Popham, who occasionally competes for contracts as an independent testing consultant .

“The present watchword of alignment is mostly a farce,” asserts Eva Baker, the director of the federal testing center at UCLA.

“More often than not, they look like warmed-over versions” of standardized tests, Popham says. “The mentality [test publishers] bring when they create a test is what they know, and they don’t know [anything]

The Texas test, Popham points out, is written by Harcourt Brace, which is one of three leading test publishers. Harcourt Brace’s Stanford Achievement Test-9th Edition is used in several other states. The California Testing Bureau--now owned by publishing conglomerate McGraw-Hill--now is in charge of Kentucky’s testing program.

Even those who support the concept of aligning content standards with the test say the match isn’t always perfect.

“It’s still an open question: What does an aligned system look like?” says Matt Gandal, the director of standards and assessments for Achieve, Inc., a group of corporate and state leaders pushing for increased student achievement. “Rhetorically, everybody is there. But in reality, what does it look like?”

High-Stakes Testing

While states were raising the difficulty level on their own tests, the federal government raised the stakes on NAEP. Starting in 1990, the national assessment began producing scores on a state-by-state basis.

For many states, NAEP results carry significant weight--in substantive as well as public relations terms. California’s poor showing on the 1994 NAEP reading test, for instance, armed critics of the state’s whole-language philosophy who succeeded in forcing the state to shift toward phonics instruction. This year, Kentucky has had to defend its increases, deemed statistically significant, from critics who say the gain occurred because the state excluded a higher percentage of students in 1998 than in 1994.

And if President Clinton’s proposal for a voluntary national test becomes a reality, the NAEP tests might have consequences for individual 4th and 8th graders. The plan is currently being studied by the board that oversees NAEP but is unlikely to overcome deep-seated opposition by both Republicans and Democrats in Congress.

Nothing like the recent occurrences in Kentucky and California would have happened at the beginning of the century--for the simple reason that test scores were not revealed to the public.

In 1925, for example, Texas undertook what probably was “at the time one of the most extensive investigations of the operations of a state’s schools and department of education ever made,” according to a history of the states’ role in education compiled by the Council of Chief State School Officers.

The study reported on student access and bureaucratic structure, but it didn’t mention test scores. Indeed, the 1969 book detailing the history of the state’s role in education doesn’t mention the testing system at the time.

But by 1980, Texas started relying on test scores in making significant decisions. That year, it started requiring minimum-competency tests in reading, mathematics, and writing.

In 1990, the state introduced the Texas Assessment of Academic Skills, known as TAAS. Unlike its predecessors, TAAS results mete out consequences for everyone in schools: students, teachers, administrators, and board members.

Still, testing experts question whether programs such as TAAS should be the primary criteria for defining successful schools.

“They have items in there that do a terrible job at measuring school quality,” Popham says. “What is being measured is what kids come to school with, not what they learn there.”

And most academic experts warn against relying on a single test to make critical decisions, such as whether a student will move to the next grade or graduate from high school.

“A test score, like other sources of information, is not exact,” a National Academy of Sciences panel wrote in a 1998 report. “It is an estimate of the student’s understanding or mastery at a particular time. Therefore, high-stakes educational decisions should not be made solely or automatically on the basis of a single test score, but should also take other relevant information into account.”

Testing advocates agree that test scores should not be the sole factor in accountability decisions. But they do look to tests to play a leading role.

“They have to be at the center of accountability policies,” Gandal says. “They are one of the only reliable indicators of what students are learning. It doesn’t mean they can’t be supplemented very well by what teachers and schools bring to the equation.”

But calls for moderation are unlikely to sway policymakers and the public. Policymakers are hungry for data to prove that their schools are succeeding, and they are relying on test results for a variety of report cards, teacher bonuses, and penalties for school officials whose students score poorly.

The public--from newspaper reporters to real estate agents--use test data as “a vicious tool for self-interest,” says Sherman Dorn, an assistant professor of educational history at the University of South Florida in Tampa.

Now, school administrators are the ones on the hot seat.

“History has played a cruel joke on school administrators,” Dorn says. “They used those tests to track students sometimes in vicious ways ... and now they’re used against them.”

David J. Hoff

David J. Hoff was an associate editor for Education Week.