Education Opinion

Billiards, Bubbles, And Better Tests

By Gerald W. Bracey — October 01, 1992 9 min read

Ever since multiple-choice tests were developed around 1920, they have suffered at the hands of critics. Perhaps no one has caught their essential weaknesses more innocently or more humorously than one T.C. Batty in a letter to The Times of London, published March 15, 1959:


Among the “odd one out” type of questions which my son had to answer for a school entrance examination was, “Which is the odd one out among cricket, football, billiards, and hockey?”

I say billiards because it is the only one played indoors. A colleague says football because it is the only one in which the ball is not struck with an implement. A neighbor says cricket because in all the other games the object is to put the ball into a net; and my son, with the confidence of nine summers, plumps for hockey “because it is the only one that is a girl’s game.” Could any of your readers put me out of my misery by stating what is the correct answer, and further enlighten me by explaining how questions of this sort prove anything, especially when the scholar has merely to underline the odd one out without giving a reason?

Perhaps there is a remarkable subtlety behind all this. Is the question designed to test what a child of 9 may or may not know about billiards—proficiency at which may still be regarded as the sign of a misspent youth?

Yours faithfully, T.C. Batty

Readers did not put Batty out of his misery. Indeed, they added to his woes as they penned letters to The Times, choosing among the various alternatives for reasons Batty and his friends had not thought of. No one, apparently, tried to explain what a correct answer with no reason given proved. Nor did Batty’s letter have any trans-Atlantic influence: Multiple-choice test use continued to grow until, by 1989, American children were bubbling in answer sheets at least 100 million times a year.

Recently, though, another form of testing, testing that often does require students to give reasons for their answers, has garnered much interest and attention. Often referred to as “authentic testing,” it is perhaps better referred to as “high-fidelity assessment,” or HFA. Such testing aims for high fidelity in the same way that a high-fidelity sound system is true to the ultimate in sound—live performance. High-fidelity assessment attempts to be faithful to important goals of education and the curriculum in ways that multiple-choice testing does not. The goal of this kind of assessment is to produce and judge complex performances, meaningful in their own right, not permit merely the selection of one answer from a set of four or five predetermined alternatives.

Not surprisingly, teachers have received high-fidelity assessment warmly. It fits what teachers think of themselves as doing in ways that multiple-choice testing, with its odd format and remote and difficult-to-interpret scores (often arriving months after testing), does not.

Although high-fidelity assessment holds much promise, its newness also means it has some problems. Multiple-choice tests have been around for 70 years, HFA only five. Thus the technology of multiple-choice testing is mature, while that of HFA quite young. It will require years, even decades, of research and development to create a wholly satisfactory HFA technology.

In the meantime, teachers may find themselves in a state or district that adopts some form of high-fidelity assessment. How can teachers evaluate the quality of these tests? What questions should teachers ask in order to judge their validity? A large, federally funded project headquartered at the University of California at Los Angeles and the University of Colorado has been the center of much research and development activity concerning authentic assessments. One product of this effort is a set of questions teachers can use to evaluate a given approach to HFA.

All of the questions reflect a concern with some aspect of test validity. The concept of test validity has changed drastically in recent years. Once, a valid test was one that measured what it said it measured. In addition, a test that predicted some future outcome was said to have “predictive validity.” Now, however, validity is a much expanded concept. Each of the following questions represents one facet of that concept:

1. What are the consequences of using this assessment?

In the past, people seldom spoke of test consequences. The advent in the 1970s of minimum-competency testing—testing that had obvious consequences for students and sometimes teachers and administrators—changed that. If the test has negative consequences for children, of course, then it must be abandoned. There are, however, other kinds of test-use consequences that teachers must attend to. Suppose, for example, teachers spend more time teaching content that is in the assessment and less time teaching content that is not. Is this good or bad? It might be either. In getting athletes ready for a game, coaches teach to the “test.” Can educators construct analogous HFAs where teaching to the test is good educational practice, not cheating?

Consider this example: James knows that half of the students from his school are accepted at the public university nearby. Also, half are accepted at the local private college. James thinks this adds up to 100 percent so he will surely be accepted at one or the other institution. Explain why James may be wrong. Use a diagram in your explanation.

This problem does not seem directly teachable in the way that rote application of an algorithm is teachable. In teaching about this problem, it seems that teachers would be helping children learn to think, to weigh probabilities—an important skill in real life. Note that the problem also deals with Batty’s concern that children provide reasons for their answers.

As another example of consequences of test use, suppose an assessment includes a portfolio of a student’s best work, with “best” being a collective judgment of teachers. In such a situation, students might spend too much time perfecting the entries for their portfolio to the neglect of other skills. It might also turn out that teachers have to spend too much of their time judging the portfolios to justify using them. On the other hand, some districts have found that rendering such judgments is a highly useful staff-development tool: It gets teachers talking among themselves about what really constitutes good writing.

Some school reformers have argued that American students would be better off if they learned less in more depth. Teachers must carefully judge time tradeoffs to determine if the assessment takes too much time.

2. Is this assessment fair?

Some critics have argued that the differences in test performance on current achievement tests among different ethnic groups indicates that the tests are culturally biased. Some people have expressed the hope that HFA might reduce the differences commonly found among ethnic groups. Early evidence, however, doesn’t show this to be likely. Nor does it seem likely that “functionally equivalent” tests suitable for different cultures will be developed. Indeed, that is probably an impossible task. The assessments must, therefore, be able to withstand scrutiny for unfairness.

If students have not had an opportunity to learn the material tested, then the assessment is unfair. A recent news story described how school districts in Indiana and Michigan were sending outdated textbooks to poor rural districts in Alabama and Mississippi. Even these texts were described by local Alabama school officials as a “Godsend.” These rural students will not likely score as well on any assessments as students in a wealthy suburb with up-to-date materials. An assessment that made such comparisons or that held teachers solely accountable for low scores would be unfair. If taking a test contributes to a feeling of hope- lessness in the economically distressed children, that constitutes another negative consequence of the test.

HFA measures complex behavior in students; teachers will have to judge that behavior. Therefore, teachers must receive adequate training so that different teachers render very similar judgments. The cost—in both time and money—of this training is yet another consequence by which the assessments must be evaluated.

3. Does the assessment cover high-quality content?

The tasks selected for assessment should be worthy of the time the students and teachers spend on them. Assessments should be systematically reviewed for quality in light of the best current understanding of the content area. Did subject-matter experts review these assessments?

4. Will the results from this task generalize or transfer to other tasks?

No teacher wants children to have simply what Alfred North Whitehead called “inert knowledge.” In an assessment, we want to know that students can demonstrate knowledge and skills in situations other than the one actually used in the assessment. This is particularly important in HFA: Because they are time-consuming, there will be few of them. Providing evidence for transfer, however, is going to be difficult because research in cognitive psychology indicates that behavior varies from setting to setting. Other research finds little transfer even when problems are quite similar. Still, teachers should look for evidence that the assessment developer has tried to find indications of transfer.

5. Does this assessment assess a cognitively complex task?

Critics of multiple-choice tests argue that they emphasize facts or tiny, well-structured problems too much. Teachers should ask what it takes to perform well on an assessment. They should look to see whether the assessments involve greater emphasis on problem solving, critical thinking, and reasoning. If they don’t, there is little justification for their time and cost.

6. Does this assessment cover an adequate range of content?

No coach would ever select a quarterback solely on how fast he or she ran the 40-yard dash. Other information is important. If tests, especially high-stakes tests, do not cover certain topics, both teachers and students tend to undervalue and understudy those topics—a consequence noted in the first question. One way to avoid such curriculum deflection is to ensure that the assessment has an adequate range of content coverage.

7. Is this assessment meaningful?

Critics claim that tests don’t engage students in meaningful material and problem solving. HFA should—and should be evaluated for the extent to which they do. If they bore students, that, too, is a consequence of their use which must be evaluated.

8. Does this assessment cost too much?

One great appeal of multiple-choice tests is their low cost. To critics, of course, even that money is wasted since they argue that the tests do not provide meaningful, useful information. Still, HFA often requires great investments of time and money. An assessment must be evaluated to see if the information it provides justifies such costs. HFA should provide teachers with much more information and more meaningful information about students than multiple-choice tests.

It is not likely that any current assessment system adequately meets all these criteria. Indeed, the criteria are sufficiently stringent and difficult that it is not clear that any assessment system can meet all of them. Together, though, they form a new way of scrutinizing the kinds of examinations that our children take. And they can add to the wisdom educators use to develop or select those examinations.

A version of this article appeared in the September 12, 1984 edition of Education Week as Billiards, Bubbles, And Better Tests