|Standardized tests were not brought down from the mountaintop tucked under Moses’ arm next to the Ten Commandments.|
Before we get too immersed in the details of precisely why standardized- test scores have increased or decreased in a specific school or within a district, several overarching and critically important points should be understood concerning the basic underpinnings of all such assessment tools.
As one of the founding members of the Association of Black Psychologists, the group largely responsible for minimizing the use of IQ tests in schools, I continue to find the testing debate fascinating, even though it has taken several politically irresponsible turns of late. Why is it, for example, that those who know the least about learning and child development are the most vocal about the value of standardized tests—and often, are those most listened to? Why is it that those parents whose children have been declared the “winners” on such tests continue to pressure schools and politicians to expand their use and importance? Several factors that warrant inclusion in this debate will help answer important parts of these questions.
Standardized tests were not brought down from the mountaintop tucked under Moses’ arm next to the Ten Commandments. They are not divine creations. Commercial testing companies develop them with several clearly stated objectives driving their structure and architecture. Standardized tests were never intended to measure educational quality, nor to serve as gauges of teaching excellence. In many ways, they are grossly inappropriate evidence- seeking instruments for evaluating “quality” in any school. This is a clear case of educational mistaken identity.
There are several fallacies in our thinking about these high-stakes tests. The predictive value of the SAT, for example. One of the first and most influential standardized tests, it was originally based on its correlation with success during the freshman year of college only, not over an entire college career. Instruction in the mostly New England-based, college-preparatory, males- only schools for which the test was devised aligned perfectly with the first- year coursework in Ivy League colleges, so these tests received wide acceptance as predictors of college success. In many ways, the Ivy League freshman year was intended to be the logical “next step” for these students, and the very purpose of a college-preparatory education. Ivy League schools were the colleges to which “college preparation” for these boys had been directed. Their test results regularly overpredicted college success for men in general and underpredicted the college performance of women. Over time, a pattern of higher SAT scores for boys, but higher grade point averages for girls (in high school and later in college), came to disturb educators and parents of girls.
Seldom was it mentioned, either, that these test scores were more honestly reflections of the economic advantages and disadvantages seen in American society. High scores have had a high correlation to socioeconomic characteristics such as the parents’ occupation or level of education, the family’s income bracket, and the location of students’ elementary and secondary schools (the highly predictable “ZIP Code” factor). Family income plays such a prominent role in test scores that some testing analysts have facetiously proposed gauging something they call the “Volvo Effect” as a way to save vast amounts of money on standardized tests. Simply count the number of Volvos, sport utility vehicles, and comparably priced luxury cars used to transport students to and from a given school, and use that figure to measure “school quality.”
Suburban students who take Advanced Placement and other college-preparatory courses also consistently earn higher scores on the SAT and the ACT exams than do their inner-city counterparts. It is not unusual, however, for inner-city high schools to offer fewer than a fifth of the number of AP courses that suburban schools in the same district offer. Thus an inner-city student with high academic potential is given limited access to advanced levels of skills- development. Such “ZIP Code” circumstances have a negative impact on his or her college-entrance-exam scores.
Let’s be honest. If poor, inner-city children consistently outscored children from wealthy suburban homes on standardized tests, is anyone naive enough to believe that we would still insist on using these tests as indicators of success? Would we continue to advocate the use of such tests if there were evidence that they presented inner-city students with a sizable edge in the distribution of future job opportunities? We would either abandon such a test or drastically modify it until it generated more “acceptable” results.
Standardized tests, by design, were never intended to be “accountability” yardsticks. In fact, they do precious little to measure educational accountability. The chief goal of standardized tests is to spread students along the performance continuum in ways beneficial to those who deem the tests’ results valuable. The more the test (and each test item) spreads out test- takers, the more valuable it is in differentiating students.
Before the first bubble is filled in, we know by historical data (and ZIP Code) how certain schools will stack up in the standardized-testing process. We can predict test results with fairly high degrees of accuracy.
Family income plays such a prominent role in test scores that some analysts say you could simply count the luxury cars in the neighborhood and use that figure to measure school quality.
Worst of all, we have tacitly agreed which students we will allow to fall to the lower end of this sorting process. Even as we announce that we will “leave no child behind,” we are aware that many students will be left far behind, and we have made an advance determination of who they are and where they live.
Statistically, some schools and students must fall toward the lower end of the performance range. Yet, in low-performing schools, these students’ administrators and teachers are now being unwisely and unfairly threatened with job termination. Many of us are asking, “Why would we use tests that are carefully and deliberately designed to produce performance variances, and then punish schools that help prove the tests have met their goals?” Only in Garrison Keillor’s fictitious Lake Wobegon will one find a place where “all the kids are above average.” Notwithstanding the rhetoric of many politicians, critics of public education, and other modestly informed contributors to this discussion, a goal of 100 percent of our students scoring above the statistical average is impossible on any test.
Producing a wide range of student scores is essential on standardized tests. If most scores were bunched together around one area, there would be no way to make any judgments about hundreds of thousands of test-takers. Variations in scores are vitally important. Standardized tests must produce what appear to be achievement, intelligence, or performance differences, or students cannot be assigned a place or rank. But precisely what those differences truly indicate is a better question to ask.
A third factor to consider is this: Test items that are impervious to high- quality classroom instruction are the items most likely to remain on a standardized test. Only items that show evidence of helping to distribute youngsters across the performance spectrum are allowed to remain. Important content and skills mastered by most kids at a given grade level will invariably be replaced by items that some students get right and many others get wrong. Test developers know that standardized-test scores reflect as much of what a child has learned outside school as what he or she has learned in school, rendering test items that fall into the first group far more valuable during test construction.
Suppose, for example, that all of the country’s 3rd grade teachers did an excellent job of teaching a particular mathematical concept, and that, consequently, all 3rd grade students gave the correct response to a test item reflecting that concept. That particular test item would have to be eliminated from the test, because it would not promote the distribution of student respondents.
If masterful teachers achieve a high efficiency level in teaching a particular skill, and 98 percent of the students taking the test respond with correct answers, that test item will be removed during a revision of that standardized test. It does not contribute to the goal of score variation. Conversely, if 98 percent of the students gave an incorrect response, the item would have little testing value either. But if 50 percent of the students got the item incorrect, then it would help in distributing students along the performance continuum. We need to remember, however, that 98 percent correct would tell us that our teachers have been engaged in effective instruction. Ironically, such success would backfire on teachers in the test-construction business.
Several years ago, a study conducted at the University of Michigan showed that there was a 20 percent to 50 percent correlation between classroom instruction and test content, meaning that there was a 50 percent to 80 percent mismatch between what was being tested and what was being taught. This mismatch helped with the variations in scores and was thus beneficial to test-makers.
|Standardized tests, by definition, were never intended to be ‘accountability’ yardsticks. In fact, they do precious little to measure educational accountability.|
The last point to consider is that no one has yet been able to calibrate accurately the precise degree to which any given factor (teacher effectiveness, a principal’s being an instructional leader, primary language background, socioeconomic status, even health and nutrition) affects the final test scores. In cases where a considerable amount of compounded causality is obvious, such as among poor students in inner cities, how can we blame, praise, or punish teachers and schools?
It is widely acknowledged by test-development experts that higher socioeconomic backgrounds give students a positive boost in standardized-test achievement. When a test question asks, “What instrument would you use to look closely at the moon?” children from poor, inner-city environments may never have seen a telescope in school or at home. Growing up in an environment in which exposure to certain kinds of information is unlikely, thus, penalizes students on these tests. A child from a high-income suburban environment, on the other hand, has likely seen and used a telescope in his own home, in a neighbor’s home, or at a planetarium, or has learned about telescopes while watching the Discovery Channel with Mom and Dad, or through a host of other opportunities largely unavailable to the child from the lower socioeconomic setting.
Likewise, children from limited-English backgrounds invariably score lower on reading and language fluency tests. On mathematics tests, they may perform well on computational portions but demonstrate more difficulty with word problems. When students cannot reduce a particular word problem to the same computation form in which they have already demonstrated proficiency, it is clear that language background interferes with problem-solving. Spanish-speaking children in California typically perform better on mathematics test items less dependent on language background.
If a child enjoys the luxury of having his own bedroom, reads frequently, and is read to, he again benefits from a “suburban advantage.” When we know that living in a specific environment influences test-score outcomes, why pretend that we are measuring school quality? Perpetuating the notion that these are fair and unbiased tests is knowingly subscribing to a myth.
Our preoccupation with multiple-choice, fill-in-the-bubble standardized testing has taken on dangerous new dimensions with the introduction of these accountability and educational-quality justifications for their use. Can any meaningful concept be reduced to a bubble response? Can such a reduction then be used as a valid assessment of superior levels of cognition? As Louis Albert, the former vice president of the American Association of Higher Education, has said, “It is deep and long-lasting learning that we are after.”
Much of what is found on standardized-assessment tools would not suggest such deep understanding. High ideational complexity, inventiveness, applying one’s ingenuity and creativity in problem-solving (genuinely high standards) cannot be converted into the “bubblized” property of standardized tests.
And what about the cultivation of important traits such as perseverance, intuition, adaptability, responsibility, sensitivity, empathy, self-control, honesty, trustworthiness, healthy self-confidence, motivation, effective communication skills, open-mindedness, generosity, creativity, originality, cooperativeness, kindness, commitment, loyalty, friendliness, emotional maturity, and inventiveness? While none of these can be “measured” on a standardized test, any parent, prospective employer, or educator would gladly exchange 100 individuals who tested high in long division for one who exhibited such characteristics.
That which is quantifiable is sometimes devoid of significant educational, personal, or social value. And the assessment tools currently being used are not capturing the best indicators of the traits, characteristics, and skills we need to encourage in our young people. Although these may defy easy or precise calibration, they may be of far greater educational value for students in the long run. Let’s give more time and attention in our schools to the many other talents that matter in an enlightened society. It’s time for a serious examination of our unbridled faith in standardized tests.
Kenneth A. Wesson is the executive assistant to the chancellor of the San Jose/Evergreen Community College District in San Jose, Calif.
A version of this article appeared in the November 22, 2000 edition of Education Week as The ‘Volvo Effect’