Drowning In Lake Wobegon
Test scores are malleable as putty. They can convey good news or bad, signal improvement or decline, trigger smugness or alarm. And the more they are twisted and tugged, the stronger our reactions are apt to be.
For better and--mostly--for worse, in this era of "spin control'' the conclusions we draw from test scores are seldom the product of our own impartial analysis of objective data sets. Rather, they are effects deftly contrived by individuals and organizations that wish us to be complacent or worried, encouraged or depressed.
Score manipulators have bountiful opportunities: selecting the test instrument; administering it; scoring and analyzing the data; and framing the report or press release that "spins'' the results.
Practically everyone with access to that sequence has an interest in making the results look rosy. Whether it's an alderman seeking re-election, a superintendent promoting a bond issue, a principal wanting to attract families to his school, a teacher eager for her pupils to shine, a state education department seeking credit for its latest reforms, a teachers' union struggling to fend off vouchers, a test publisher hoping to expand its trade, or a chamber of commerce wooing new industry--few welcome bad news about student performance. But because American education, by and large, has nothing akin to an independent auditor, its shareholders--parents, voters, taxpayers--have no place else to turn to find out how their children and schools are doing.
Indeed, if we seek the source of widespread complacency about our own children and local schools--never mind that the nation is said to be "at risk''--we don't have to look far. Most folks base their conclusions on the information at hand and, at least with respect to test scores, that generally means information furnished by parties who have something to lose if the news is glum.
Thus, in a celebrated 1987 episode, we found most states and localities washed by the waters of Lake Wobegon, the humorist Garrison Keillor's mythical place where "all the children are above average.'' That was when the West Virginia physician John J. Cannell jolted the education community with word that he "had surveyed all 50 states and discovered that no state is below average at the elementary level on any of the six major nationally normed, commercially available tests.'' He also announced that 90 percent of local school districts claimed score averages that exceeded the national average and that "more than 70 percent of the students tested nationwide are told they are performing above the national average.''
This riveting allegation turned out to be essentially correct. The U.S. Education Department, where I worked at the time, invited several of the nation's foremost psychometricians to see if they could replicate Dr. Cannell's findings. Which they did. It turned out that most states and localities, in tacit or overt collusion with test publishers, were availing themselves of most of the points of leverage noted above--and were also "teaching to the test.'' The one sin they were not committing, so far as I know, was using imaginary test-takers to alter the results. That's a more recent phenomenon, about which more below.
The Cannell report had several repercussions. It helped to discredit "standardized'' testing. It prompted a number of states to develop new schemes for measuring and reporting student progress. It deepened public mistrust of the "education establishment.'' It helped catalyze the "national goals'' process and, for a time at least, a serious campaign for national testing. And it underscored the need to expand and improve the National Assessment of Educational Progress, the closest thing we had--and have--to a trustworthy report card on U.S. educational achievement.
NAEP works pretty well. It tests representative samples of 4th, 8th, and 12th graders. Its secure exams cannot be "taught to.'' Nobody knows which youngsters will get what questions, so it's impossible to select one's NAEP test to match a particular curriculum. It yields reliable trend data over time. It's been moving steadily away from multiple-choice formats and making greater use of open-ended and "performance'' items.
The test content is determined by an independent governing board, which uses an elaborate process for seeking national "consensus'' about the knowledge and skills important enough to assess in each subject. And the results can be reported either in relation to national averages (norm-referenced) or--recently--in relation to standards set by the governing board (akin to criterion-referencing). Perhaps most useful of all, Congress has given NAEP limited permission to generate state-level results for jurisdictions that wish to participate. Though this has only been done in a couple of subjects and grade levels so far, the results are immensely helpful to governors, legislators, and others tracking the progress of state and national reform efforts.
NAEP is, in effect, the country's nearest approximation of an "independent audit'' of educational achievement and, like accountants' reports on the financial health of companies whose books they've examined, much of its value depends on its integrity and credibility.
Now, however, Lake Wobegon is rising again, and this time it's splashing NAEP. Worse, those raising the H2O level turn out to be--egad--the U.S. Education Department and its prime NAEP contractor, the giant Educational Testing Service.
This tale has several parts. The first dates to early March, when senior assessment officials at the National Center for Education Statistics advised the NAEP governing board that they were considering "adjusting'' state scores to account for demographic and socioeconomic differences in state populations. (See Education Week, March 9, 1994.) A briefing paper prepared for the board analogized this to "the handicapping principle used in golf'' and noted that "states vary in average proficiency due to a variety of factors, some of which are not in their control.'' The idea was--as statisticians say--to "control'' for some of those factors. Data manipulations would answer such questions as "What would the states' proficiency distributions look like if the state population characteristics were similar to the nation's?''
The "characteristics'' they proposed to "adjust'' for included race, parents' education, type of community, and so forth. "So, for example, jurisdiction A has a very large percentage of minority students. The adjusted national score says, if the nation had as many minority students as jurisdiction A then it, too, would be achieving less well. ... ''
Why do this? The main beneficiary would be states whose relative rankings rise when their scores are based on a hypothetically more "representative'' student population instead of the youngsters who actually live there and take the test. The National Center for Education Statistics's commissioner, Emerson Elliott, told a reporter that the idea originated with the former California state superintendent of public instruction, Bill Honig, whose state ranked low and who allegedly felt that non-adjusted scores were "unfair'' to jurisdictions with lots of minority youngsters. Keeping the statisticians busy also appeared to be part of the N.C.E.S. rationale: "We want to make more use of [NAEP] data,'' Mr. Elliott said. And some suspected the influence of the new activist leadership of the Education Department's office for civil rights.
But just about everyone else thought this a dreadful idea. Particularly distasteful was the notion of adjusting NAEP scores by ethnicity, which is reminiscent of the lamentable practice of "race norming'' in employment-selection tests and which struck governing-board members--especially those belonging to minority groups--as patronizing and deterministic, implying that minority youngsters ought not be expected to do as well on tests as white children, and masking the educational malpractice that is too often their lot in school.
The board resolved unanimously that such score adjustments should not be made to NAEP state results, noting that to do so "would be contrary to the strong national commitment to encouraging high standards for all children.'' A strong "no'' was also voiced by the group of state testing directors that advises the N.C.E.S. on these matters. Seemingly chastened, Commissioner Elliott told a journalist that his agency had shelved the idea, at least for now. "We're not going to be adjusting any state scores,'' he said at the end of April.
But board policy controls only the "official'' NAEP reports issued by the government, not the uses that others may make of the data. And it turns out that the Educational Testing Service, which conducts NAEP for the government, serves as primary repository of NAEP data, and ghostwrites the official reports, was busily engaged in NAEP-score adjustment on behalf of other clients. That's the second part of the story.
On April 14 of this year, the New Jersey Education Association issued a press release boasting that "New Jersey's public school students are performing at world-class levels in mathematics and are among the best in the nation in reading, according to an E.T.S. research report.'' It seems the N.J.E.A. had engaged the services of Howard Wainer, a principal research scientist at the E.T.S., and paid $17,000 for him to re-analyze New Jersey's NAEP results. (While that contract may have covered the direct costs of his work, one doubts it could have been done so cheaply had not millions of federal dollars already stocked the E.T.S. computers with NAEP data.) The fruits of Mr. Wainer's labors appeared in a pair of reports bearing the E.T.S. logo and imprimatur. They prompted articles in New Jersey newspapers with headlines like "Jersey 8th Graders Above the Average in Math,'' "Higher Grade for N.J. Schools,'' and "Study Says N.J. Making the Grade.''
Claiming he was seeking to "place all states on a level playing field,'' Mr. Wainer's manipulations moved the Garden State from 14th to fourth place in 8th-grade math and from fifth place to third in 4th-grade reading. (Although the District of Columbia took part in the assessment, Mr. Wainer omitted it from his list of adjusted scores. To include it would have moved New Jersey down a peg because the district, once its scores were "adjusted'' for demographic factors, came out near the top--in sharp contrast to its actual NAEP rank at the bottom.)
A further statistical technique, linking New Jersey's NAEP results to the 1991 international math assessment (also conducted by the E.T.S. with federal funds), led Mr. Wainer to report that whereas "the United States finished near the bottom ... New Jersey's students' performance was sixth among all nations participating.''
What to make of this? The Philadelphia Inquirer dryly observed that Mr. Wainer's findings "happen to complement the theme of a $300,000 advertising campaign now being launched by the N.J.E.A. that 'public schools work.''' Mr. Wainer's work also undermines the credibility of "official'' NAEP reports (and the governing board's non-adjustment policy), by asserting that "between-state comparisons ... can yield misleading inferences if one is not acutely aware of the differences in the demographic makeup of all of the constituent units. This is of such a complex nature that it is impossible to keep things straight without some formal adjustment.''
Legally, the private corporation named E.T.S. may do whatever it likes for its various clients, even when those turn out to be teachers' unions seeking to persuade voters and taxpayers that everything is hunky-dory with public schools as they are. Even when the conclusions set forth in privately funded E.T.S. reports turn out to contradict the "official'' reports that the E.T.S. is drafting for the federal government. Even when the Governor, say, may find herself citing official NAEP reports as evidence that the state needs to undertake bold reforms that the teachers' union, citing other E.T.S.-produced NAEP reports, is fiercely resisting.
It's legal, but is it ethical? Is it good for children, especially minority children? For educational quality? For the long-term stature of NAEP as a reliable independent audit? For the prospects that Congress will want to continue supporting NAEP's precarious (and fairly costly) foray into state-level reporting? What is the likely response of states whose scores decline as a result of these manipulations?
Score adjustment is a genie that, once released from its bottle, may do untold mischief. Done with enough imagination--and massaging enough different factors--it can place almost every state "above the national average.'' Try one adjustment. If the results don't please, try another. As this practice becomes commonplace, it will come to be seen as routine, then as legitimate. The E.T.S. has already undertaken revisions of Hawaii's NAEP math results. (The adjustment caused Hawaii to rise from 37th to 35th among the states in 8th-grade math. Unmentioned in the report is that the adjustment dropped the state from 31st to 36th in 4th grade math.)
Regrettably, the federal government has not wholly abandoned this approach, either. N.C.E.S. officials have told the NAEP governing board that they're considering other ways of "standardizing'' results, such as charts showing how much better or worse "than predicted'' each state fared. Such "predictions,'' of course, would be based on demographic and socioeconomic factors, yielding the very sort of differential expectations that board members find abhorrent.
The board's meeting last month also brought an extended discussion of what to do about handicapped and non-English-speaking youngsters who are sometimes excluded from the NAEP test-taking population. (See Education Week, May 25, 1994.) N.C.E.S. officials recommended research into various ways of incorporating their results. The board was favorably disposed. Some strategies to be explored are as straightforward as developing different forms of the test and altering the circumstances of test administration so that more youngsters can participate. But one approach under consideration by the N.C.E.S. entails the estimation of scores of children who never actually take the test--basing these estimates on the youngsters' perceived characteristics--and the "imputation'' of those hypothetical results into the data base.
At that point, Lake Wobegon may overflow. And we may need the likes of George Orwell to chronicle the havoc that follows.
Vol. 13, Issue 38, Pages 31, 35