Education Opinion

Some Caveats on Comparing S.A.T. Scores

By George H. Hanford — October 08, 1986 10 min read

With the College Board’s recent release of Scholastic Aptitude Test scores for 1986, public interest in the scores is at its annual peak, and, once again, the urge to make comparisons of educational quality--particularly on a state-by-state basis--become irresistible.

As recently as 15 years ago, the notion that S.A.T. scores would be used as a measure of statewide educational assessment would not have occurred to us at the College Board. Although the test had been in use for nearly 50 years, no one had suggested that it might become a kind of national educational benchmark; an indicator, if you will, of the Gross Educational Product stated in terms of scores of college-bound high-school seniors.

The idea that the S.A.T. might be used in statewide educational assessment arose from its somewhat ambiguous role as a measure of national educational assessment.

Colleges using the S.A.T. in admissions and secondary schools with substantial numbers of applicants had, for some time, been receiving summary statistics about, respectively, their applicant pools and test-taking cohorts. In the early 1970’s, the College Board decided to formalize and enhance these reports in what would become known as the Summary Reporting Service. The service had three elements that are salient for this discussion:

• Beginning in 1971, students registering for the S.A.T. and achievement tests were asked to report a good deal more about themselves-for example, their career interests, academic goals, family makeup and income, and ethnicity-than they had in the past. The assumption was that such information, together with their scores, would yield valuable insights about them as groups-in school, in the applicant pool, and from year to year.

• Whereas the earlier reports included scores for all test-takers, whether sophomore, junior, or senior, the new reports dealt solely with graduating seniors.

• The reporting service generated mean scores for the three “reference groups” to which the colleges and schools could relate their summary figures. These reference groups were the national population of seniors who had taken the S.A.T. and the comparable populations within each College Board region and within each state. State mean scores were available to colleges and schools within each state only with the consent of the chief state school officer.

The national college-bound-senior population taking the S.A.T. represented about a third of all high-school graduates at that time, and two-thirds of those who went on to college. It seemed to the board that the characteristics of this group--their interests, ambitions, and, of course, national mean scores--might be of public interest.

So, the newly available data on the national (not state) level were released for the first time in 1973. In publishing the data, the board warned that they were limited in meaning. The data reported nothing about the two-thirds of graduating seniors who did not take the test, nor about the 18-year-olds who did not graduate from high school.

In 1974, an alert education reporter noticed that the scores had dropped from the previous year. He asked for the figures for earlier years, and was thus able to take public note of the fact that since 1963 there had been a gradual, steady decline involving the score on at least one section each year. At that time, the aggregate decline on the 200- to-800-point scale was 41 points on the verbal section and 24 points on the mathematical.

Caveats notwithstanding, a change of that magnitude displayed by such a large proportion of high-school seniors indicated that something was happening. ‘lb find the explanation, the College Board, as the sponsor of the S.A.T., and the Educational Testing Service, as developer and administrator of the test, jointly recruited a distinguished panel under the leadership of former U.S. Secretary of Labor Willard Wirtz.

That panel’s report, delivered in 1977, had both good news and bad news. The good news was that a large part of the decline occurred because a larger and more diverse pool of students was taking the test. The bad news was that the remainder of the decline was caused by a variety of factors, the most salient of which might be summed up as a dilution of substance and softening of standards in high-school curricula.

Following the first rush of interest in the score declines and the panel’s report, the scores continued to go down. Public interest waned. But when scores did not decline in 1981, interest revived and speculation abounded. Was the great decline over?

In the flurry of interest that followed, the media asked for scores by states. Consistent with its longtime position that scores are the confidential property of the individual school or jurisdiction to which they refer, the board declined to provide them. Nevertheless, in the fall of 1981, one enterprising publication, invoking “sunshine” laws where necessary, managed to assemble a table of state scores-and got some wrong.

Therefore, to ensure accuracy, and not entirely with the enthusiastic assent of the chief state school officers, the board undertook in 1982 to publish state-by-state scores. At the same time, we issued a further caveat:

“It is the College Board’s position that comparison of states, districts, schools, or any other subgroups on the basis of S.A.T. scores alone is invalid. The board discourages any use of S.A.T. scores as a measure of overall performance of teachers, schools, or state educational systems.”

The board also explained the strong effect that participation rates have on state mean scores, as an example of one factor that could skew scores without revealing anything about educational quality.

In 1984, U.S. Secretary of Education Terrel H. Bell published for the first time the “‘wall chart” of comparative educational indices by state, including S.A.T. and American College Testing Program scores. He took the occasion to make two major points:

• The continued rise in S.A.T. scores nationally indicated that the “educational reform movement,” which he dated from the publication of “A Nation at Risk,” was indeed working.

• Educational quality in the states as measured by a variety of indices was not necessarily dependent upon the resources applied, especially by the federal government.

The Secretary, of course, was using the state S.A.T. scores in precisely the way that the board cautioned they should not be used. Through a progression of individually innocuous moves, therefore, a big shift had taken place: from the S.A.T. as one of a number of national indicators to the S.A.T. as one of a number of measures of relative quality among states.

The absurdity of comparing a large industrial state such as New Jersey--where most colleges and universities require the S.A.T., 65 percent of high-school graduates take it, and the mean scores are 425 on the verbal section and 489 on the mathematical-- with a state such as Iowa--where few institutions require the test, only 3 percent of high-school graduates take it, and the mean scores are 521 on the verbal and 576 on the mathematical--should be apparent.

In all fairness, Secretary Bell did show good judgment, as did his successor, William J. Bennett (who repeated the wall chart), in presenting along with the scores a number of other factors such as graduation rate, pupil-teacher ratio, expenditures per pupil, incidence of poverty, and minority enrollment. He also distinguished between states in which the majority of students took the S.A.T. and those in which the majority took the A.C.T. assessment. But the data provided simply were not enough to determine whether a state-to-state score difference signified a real educational difference.

The critical point is that use of S.A.T. scores to compare one state with another, rather than to compare a state with itself . over time, is not valid. How then should the test be used for statewide assessment?

The key, in the board’s view, is in the distinction between measurement and evaluation, between what the S.A.T. measures and what may be indicated by changes in scores over time. Using score changes in comparisons within the state or within individual jurisdictions over time, in conjunction with changes in other indicators, is a legitimate form of evaluation.

What if, along with rising S.A.T. scores, dropout rates are declining, enrollments in honors courses are increasing, reading scores are improving, and college admissions are going up? That is a picture that reflects a positive evaluation. If the trends are reversed, the picture would be negative. If the trends are mixed, so is the picture.

The point is that if all other measures were improving, declining S.A.T. scores would not necessarily call for a negative evaluation. Nor, if the other measures were going down, would rising scores necessarily offset the bad news.

It should be noted, however, that even used in this way, statewide scores can be misleading if there are wide differences in conditions among districts.

To maintain clarity on the proper use of the S.A.T. for educational assessment, I have stressed various caveats. But there definitely are positive reasons for using the test appropriately.

As we as a nation move ahead with our efforts to improve education, it is increasingly clear that some of the things we want to achieve--or that we want our students to achieve--may not be measurable directly.

This is particularly true of attempts to specify learning outcomes in reasoning, critical thinking, and communication skills. The desired outcomes are relatively simple to state, and it is no great feat to determine if they have been achieved, but they are most difficult to measure. Such objectives include:

• “The ability to distinguish between fact and opinion.”

• “The ability to vary one’s writing style, including vocabulary and sentence structure, for different readers and purposes.”

• “The ability to separate one’s personal opinions and assumptions from a writer’s.”

Those statements are taken from “Academic Preparation for College--What Students Need To Know and Be Able To Do,” a publication from the College Board’s Educational Equality Project. The booklet is being adopted by a growing number of districts and states as the summary specification of their education-reform goals.

At the request of those users, the board has conducted work on the questions of how to assess progress toward those goals. To a large degree, it has come to the conclusion that assessment will have to be in the form of evaluation rather than measurement.

Our work has revealed that among the evaluation tools, the S.A.T. is highly useful-- in a way that is supplementary to, but different from, the test’s primary intended use as an admissions examination.

Because the test measures certain learned mental skills that cut across and are relatively independent of specific classroom subjects, it can be useful in evaluating the attainment of certain “higher-order” skills or competencies--in reading, writing, mathematics, and reasoning--that do not readily lend themselves to more direct measurement. At the minimum, it is useful, in conjunction with other indicators, in evaluating whether students or groups of students in fact have improved their predicted performance in college.

That is the major message to be drawn from this discussion: Although the S.A.T. was not originally intended as a tool for evaluation, wisely used it may be helpful. Changes in group scores within jurisdictions can be used as a factor in the evaluation process. In that sense, the test may become more useful as the progress--and process--we seek to evaluate places greater emphasis on competencies or performance outcomes that cut across the curriculum, and that do not lend themselves to quantification.

But the numbers the S.A.T. produces exude an aura of precision out of proportion to their true significance. Therefore, it is important to ensure that policymakers at the state or institutional level do not use the scores inappropriately, if not incorrectly.

Above all, education policymakers must be consistent in what they do, for no one will believe them when they say declining scores are no indication of the health of the enterprise--if they alternately have used rising scores to claim that it is thriving.