Commentary: Reconsidering Standards and Assessment
With all the current talk of the need for "raising standards" in education and establishing "national standards," we must exercise a bit of caution--not only because no one thinks their standards are too low, but because too many people mean nothing more than that test scores should be raised.
To see test scores as the key indicator of educational well-being is a glib response to the problem of standards. Is it too cranky to say that, in the last 20 years, the most massive investment in testing ever undertaken has coincided with a palpable decrease in the quality of education? Can an increase in testing ever yield improved quality in schools? To suggest that it can is akin to saying that more accounting results in higher-quality products or services in business, or that more taking of one's temperature will lead to better health.
Let us re-inject some common sense into the debate about school quality. Let us reconsider just what we mean when we speak of "standards" before we embark on yet another quest delivering only more standardized data collection instead of better schooling.
Standards refer to qualities, not quantities. As the history of the word reminds us, a "standard" is a set of values around which we rally; we "defend" standards. (The "standard" was the flag held aloft in battle, used to identify and orient the troops of a particular king.) A psychometric tactic has caused us to lose sight of quality: Thinking of "standards" as the setting of a cutoff hides the fact that standards represent differences in kind, not degree--desirable behaviors, not the best typical behavior (or "superior mediocrity," as John Dewey once termed it).
What are the signs of a student or school with high standards? Most observers' answer to this question would have nothing to do with the degree to which traditional "content" has been learned. A typical response would suggest that students with high standards are diligent, thoughtful, engaged, persistent, and thorough--no matter what they learn.
Such a description does not mean that students with "high standards'' are merely those who happen to be in the first quintile on tests; this utterly confuses cause and effect. Rather, their work and conduct regularly display qualitative differences from those of their peers. Students with high standards resist the tendency to be satisfied with slapdash work or merely "correct" answers.
And to equate high standards in a school with high test scores is fallacious reasoning. Many quick and even gifted students are disengaged from their work. They do not perform to high standards, even if their scores seem to say otherwise--just as many schools, blessed with bright children from well-to-do families, can hide their sins behind high test scores. Other students, because of poor prior schooling or slow learning styles, may have mastered less content than their peers; so what? Their daily work may still be regularly done to the highest standards.
I do not mean that there isn't a "core" worth learning. But while a curricular framework provides standards for ensuring that students are given high-quality assignments, it can provide no guarantee that the work students produce is of high quality. That depends on their receiving exemplary assignments and assessments. When multiple-choice tests drive instruction, many students go from one teacher to the next without gaining the self-discipline required for working at a high standard. Many teachers then cite the self-fulfilling prophecy that kids cannot work to higher standards, and the vicious cycle continues.
Standards are revealed in the everyday behaviors and policies of a person or school. To have "high standards" in intellectual affairs is to live out one's virtues with consistency--particularly in the face of daily hassles or bureaucratic constraints. I am interested not in whether students can cram for a high-pressure test but in what they are wont to do when the local and state authorities aren't watching.
The good school is a community on a quest for excellence; it seeks out and encourages its laggards. We would therefore expect to find "quality" in institutions by the coherence of the institution's overall performance--the habits and behaviors regularly revealed by all its members. A school with high standards, then, is recognizable not simply by the work of its best students and teachers but by the small gap between the work of its best and the work of its worst. Where is an accountability program that honors this basic truth about institutional quality?
We regularly confuse the difference between "standardized" tests and tests checking whether students' work is "up to standard"; we confuse "standards" with "standardized measures." A "standardized" test provides a uniform set of procedures and questions for purposes of making "valid" comparisons. It is a different matter--a matter of intellectual values--to ask: "What are 'standards of intellectual conduct' and do students display them in whatever work they do? What 'measures' will do most justice to our 'standards'?"
Seeing if we are "up to standard" does not require measures that are rigidly "standardized." Psychometric considerations now so dominate test design that the demand for uniform "measures" has corrupted the "standards." By defining the standard in terms of a standardized measure, we have turned quality on its head.
This statement seems less polemical and more insightful when we recapture common sense. Recall how the best colleges, graduate and professional schools, and businesses determine whether student or employee work is up to standard. Students and employees are judged on their own idiosyncratic, contextualized performance. The process may be uniform, but the questions and possible answers are not standardized. In most other countries, in fact, the major student assessments require extensive, open-ended writing; the tasks are of high quality, and do not require and will not yield machine-readable answers.
Answers to multiple-choice test questions cannot reveal students' qualities of mind and action. I learn little about students' standards from a test in which the right answers need only be chosen--whether or not they get the questions right. Standards are revealed in the manner by which work is approached and completed, over time. The only "quality" visible in such test results is the rightness or wrongness of answers. No multiple-choice item can show whether a student's answers derived from thoughtful understanding and good habits, dumb luck, or native cleverness hiding bad habits; no "norming" process can substitute for examining the student's habits directly.
At the highest levels of policymaking, there is confusion on this point. Many educators still erroneously think that changing test items--the "input"--enables us to better guarantee the quality of students' work--the "output." For example, according to the Dec. 13, 1989, issue of Education Week, the governing board of the National Assessment of Educational Progress, "is considering a plan to set national goals for student performance. ... Under the plan, the board would determine the skills and knowledge, as measured by NAEP test items, that ought to be mastered at each grade level."
The staff proposal "recommends that the board establish an advisory panel to examine the actual questions on the 1990 math assessment and determine which ones students need to answer correctly in order to reach the different performance levels."
This is confused thinking, driven only by a desire for expediency in assessment. The NAEP scales are a great idea, though still needing technical and empirical work; they are, in principle, necessary but not sufficient for judging quality. Compare this use of such scales with their use in diving or gymnastics: We do not assume that higher quality automatically results from adequately tackling more difficult work. The athlete's score is determined by multiplying the degree of difficulty by the score for the quality of the performance.
Until we have tests that center on the qualities of students' answers, we will lack the evidence necessary to judge whether a high score is good enough.
The technology of multiple-choice tests means that test "items" can never meet intellectual standards, even if they meet psychometric standards. "Items"--as the very word implies--can never be exemplary assessment tasks, even if they sample from an exemplary domain of "content." Students need only recognize the right answer from the multiple choices.
But can they produce high-quality work when the cue of the four choices is removed? Do they possess the skill required to call forth and integrate the bits into a whole, or the good judgment to know which element of their repertoire is required when? Do they have the discipline to persist in fashioning that whole?
The question for quality-assessment design must therefore be: What kinds of tasks are worth undertaking?
Building tests on the framework of a bell curve of results ensures that the system as a whole will never have high standards. Raising standards means involving students and adults in recalibrating their efforts against specified criteria of masterful performance, and judging success by the progress they all make in moving toward exemplary performance.
There is no reason that deliberate instruction should yield a standard spread of test results statewide, as if educational effects were random. Indeed, we will have succeeded in raising standards only when we alter the shape of the standard curve; it should become skewed to the right.
True progress depends on identifying shared, stable standards that illuminate daily work. At present, there are neither incentives nor structures to ensure that grades correspond to fixed performance criteria. There is no "inter-rater reliability" among teachers: What gets an 86 in one room can get a 72 down the hall, never mind at a different school. It was precisely the unreliability of the transcript that led to standardized tests in the first place.
Then, to the confusion of practically everyone, states and districts periodically "re-norm" their standardized tests, and scores go down. How can student or teacher performance improve under such conditions?
To assume that schools of "high standards" are those where teachers grade on a steep standard curve--as many "demanding" teachers do, and think they ought to--can only increase the gap between best and worst scores without improving quality.The most effective strategy for raising school standards is to devise a system that rewards schools for grading with criterion-referenced standards and achieving a high degree of inter-rater reliability, and that rewards students for progress according to these standards.
Standards cannot be raised unless they are demystified. Our years of being subjects of and then abettors to "secure" tests have dulled us to the foolishness of using secret standards and measures.
Secrecy is dysfunctional if the aim is bettering performance. Imagine trying to improve as a basketball player if one were "tested" by playing the heretofore secret game on the last day of a basketball "course."
The use of tests from outside vendors--and the "security" that protects the product's marketability, not just its validity--ensures that neither students nor teachers possess what they most need to raise standards: models of exemplary assessment and performance. They must know what mastery at exemplary tasks actually looks like--very different from our current practice of summarizing the curriculum and the general outcomes we intend, and following up with "secure" tests. The ultimate test of whether standards are demystified is whether students and teachers can accurately assess their own work on a regular basis; to do so, they must be able to compare their performance with exemplary work.
Higher standards do not mean higher dropout rates. The claim that they do, made by many critics of school-reform efforts, betrays a fundamental confusion about what a "standard" is. A standard is an exemplar; whether few, many, or all students can meet or choose to meet it is an independent issue, calling for separate strategies and incentives. We must first have a rich vision of the possible.
At least give NAEP credit for trying to invent stable criterion-referenced scales, despite the technical nightmares of doing so with matrix sampling and cross-age, multiple-choice testing. In NAEP's mathematics results, we discover, for instance, that only 6 percent of American students can work at "level 350," the highest level of problem-solving found in the test. The point is not to lower this standard because it is too hard to meet, but to set targets for the number of students who will meet the appropriate standard in years hence. (My enthusiasm for these scales should not, however, be construed as endorsement of the tests themselves.)
Any successful school reform must reinforce the belief of students, teachers, and parents that the tasks and scoring criteria are in reach and worth mastering--and not simply because the state says so. As the aim of assessment, students must internalize not "our" standards but "the" standards--just as it works in judging diving and debate.
Immature or weak students respond to such a system: Watch them play sports, musical instruments, and computer games. They may "fail" regularly at replicating the performances of their models and heroes, but the absence of invidious personal comparisons and the opportunity to gauge progress according to clear standards ensure that the proper incentives exist for their continued striving toward competence.
The commitment of President Bush and the nation's governors to establish national goals for education raises the question of how progress toward such goals should be assessed. If we intend to measure the academic achievement of students and programs, norm-referenced, multiple-choice tests will not do. For a more accurate gauge--one that won't narrow the curriculum--we need performance assessments.
Also called "authentic" or "alternative" assessments, these forms of evaluation directly measure actual performance in academic subjects. Norm-referenced, multiple-choice tests, in contrast, measure only test-taking skills directly, and little but good guessing indirectly.
Performance assessment focuses on activity, as opposed to the passive bubble-filling of multiple-choice tests. Its most familiar form is writing, already a standard feature of program assessment in 28 states. A writing component is currently being tested by the Educational Testing Service for use in the Scholastic Aptitude Test. Other techniques of authentic evaluation are also gaining ground. Open-ended questions in mathematics require students to write extended answers or draw their response to a problem. The California Assessment Program has included these kinds of problems in its 12th-grade test for the past two years, and in 1990, two-sevenths of the National Assessment of Educational Progress mathematics questions will be open-ended.
New York State assessed the science skills of its 4th graders in May 1989 with the performance-based Manipulative Skills Test. Connecticut has prepared a spectrum of performance tests, including vocational-skills assessments that were designed with the help of students' potential employers--business and industry. That state is now developing science and mathematics performance assessments that will evaluate its "common core of learning."
And the use of portfolios--collections of student work in any subject--is winning widespread attention. Vermont now includes them in its state assessment. The ETS is developing workshops on portfolios in writing, and Harvard University's Project Zero is working with arts teachers in Pittsburgh public schools on portfolios of creative writing. This approach could prove to be the most powerful of all performance assessments, if a reliable way of using it for accountability can be devised.
The Coalition of Essential Schools is spreading the use of "exhibitions" that demonstrate student mastery of whole curricular units. The Matsushita Foundation recently held a conference to urge its partnership schools in seven states to change to performance assessment as part of restructuring.
The variety of performance assessments is itself a virtue. Multiple-choice tests are all the same, whether they measure grammar, geography, mathematics, or reading, evoking from students glazed ennui, if not physical absence. But performance testing has what experts call "face validity": There is an obvious correspondence between one's understanding of a subject and the means of testing it. Since subjects differ, so do tests. They can also differ within a subject: Some history assessments are essays; others are mock trials.
There are two other reasons why performance assessments are gaining favor: They evaluate thinking, a major concern of American businesses dismayed by employees who cannot solve simple problems; and they enlarge rather than constrict the possibilities for classroom teaching.
It is a simple fact of life that assessment drives teaching: What gets tested is what gets taught. This is fine when testing involves, for example, the writing of essays that require the thoughtful combination of concepts with a wide range of details. But when teachers know that their students will only have to fill bubbles with No. 2 pencils, they concentrate on teaching discrete facts. Multiple-choice testing and its corollary, multiple-choice teaching, are major contributors to the boredom that pervades American classrooms.
In describing the undesirable tests, I have carefully avoided the term "standardized." Performance assessments can and should be standardized--in the sense of providing standards of achievement. Grading is not difficult with performance assessments. The products--whether written, drawn, recorded, or even videotaped--are scored according to a rubric that lays out the qualities required for each grade. Even when scores are assigned by direct observers of the performance--in the case, for example, of science experiments scrutinized by trained teachers--the results can be aggregated and statistically manipulated, just as conventional test scores are.
A related benefit of performance assessment lies in the common practice of having groups of teachers grade the tests: These teachers gain the enormous advantage of seeing the effects of their instruction. Because they thus face fundamental questions of teaching, their participation in the design, administration, and scoring of performance assessments is perhaps the most effective form of professional development. The money that a state or school district would have spent on commercial tests can be put into assessment strategies that foster professional growth. It's a short cut to teacher empowerment.
This extra benefit will be lost if districts buy the packaged "performance assessments" that test publishers--who, like the ETS, have seen the handwriting on the wall--are now issuing. Many educators will have noted, for example, the colorful advertisement for the Stanford Writing Assessment Program offered by the Psychological Corporation, Harcourt Brace Jovanovich. According to the ad, which associates writing with jungle adventure, this testing package uses holistic and analytic scoring, and returns the results to teachers as a basis for language-arts activities. But the teachers will not have read the papers or enjoyed the intellectual challenge of developing scoring guides to capture the essence of good writing.
The federal government has also sensed the wind. It now mandates nationally norm-referenced tests for reporting progress in Chapter 1 programs. But it has a committee of psychometric experts looking at possible equivalence between the required tests and the performance assessments state and local authorities might wish to use.
NAEP is moving toward performance assessment, but the elements it has incorporated so far--writing samples, open-ended tasks in mathematics and reading, and even a "writing portfolio" in 1990--are best characterized as tokens. The assessment is not yet the flexible, comprehensive instrument needed to authentically reflect progress toward national goals. In fairness, NAEP would need adequate funding from the Congress to do the job. Its puny budget is $4 million short of the amount needed to analyze data collected in 1990.
California is the first state to have declared a policy of shifting its testing program to performance assessments. Regional and county administrators have attended conferences to introduce them to the idea and enlist their help with the experiments and field tests needed to develop the tests. The development of the 8th-grade writing assessment has set a precedent for grassroots participation. In the handbook for the tests, four pages are needed to list the names of all the teachers and administrators who assisted with its development, field testing, and annual scoring.
A forthcoming statement to be signed by at least 20 organizations will urge President Bush and the governors not to use multiple-choice tests to measure progress toward national goals, but to require performance assessments. The cover letter will be signed by the Center for Fair and Open Testing, the Council for Basic Education, and the National Association for the Advancement of Colored People.
What may make the climate more favorable now for the shift to performance assessment is the widespread perception that teaching and testing are out of sync. The writing process introduced across the country by the National Writing Project has made multiple-choice tests of writing laughable. The pressure to introduce "thinking skills" has become pressure to evaluate them appropriately. The spate of revelations about what our students don't know has forced examination of what they're being taught and how.
A switch to performance assessment has the potential to benefit American education in three ways: It reveals the presence or absence of thoughtfulness and understanding, not simply of memorization; it requires the teaching of a thinking curriculum to all students; and, by involving them in assessment, it empowers teachers.
Grant Wiggins is director of research for Consultants on Learning, Assessment, and School Structure (CLASS), based in Rochester, N.Y.