Carnegie-Funded Project Aims To Improve Teacher Assessments
As the planning group charged with creating a national teacher-certification board tries to complete its work before the next school year, researchers here are seeking "creative'' ways to tackle the proposed board's stickiest problem: how to measure a teacher's knowledge and ability.
Backed by an $817,000 grant from the Carnegie Corporation of New York, the Stanford University research team is in the midst of a 15-month project to develop sample exercises that will provide a better gauge of the complexities of teaching than current teacher tests.
The set of exercises, or "prototypes,'' would put teachers through such representative—but seldom measured—tasks as evaluating a textbook, managing a disruptive student, and making out a lesson plan.
The project is not aimed at creating a national teacher test, said Lee S. Shulman, the principal investigator and professor of education at Stanford. But the prototypes developed, he said, should help the board make "the tough choices it's going to have to make about what kinds of assessments to commission.''
The Stanford researchers have focused their efforts on two very narrow sections that might be included in such national teacher tests: the capacities needed to teach fractions to elementary-school students and the American Revolution to high-school students.
The narrow focus was chosen, he said, because, "if we can't demonstrate that we can do one of these extremely well, it's unlikely we'll be able to do a whole assessment.''
The work of the 33-member planning group for the national board, which held its fourth meeting last week, is expected to be completed by late summer or fall.
By Christmas, Mr. Shulman and his research team will give Carnegie, in addition to the two prototypes, 16 commissioned papers on various aspects of assessment; a detailed description of how they went about their work; and case studies of exemplary mathematics and history teachers. (See related story, page 19.)
'More Like Teaching'
According to Suzanne Wilson, the project's director, the exercises the research team envisions "would look more like teaching than what now exists'' in other teacher tests.
For example, one part of the assessment might include "the critical analysis of textbooks,'' she said, "because that's something that teachers have to do.''
"They use textbooks. They think about them. They adapt them. They modify them for their teaching.''
"Similarly,'' she said, "you can think about having them grade papers, evaluate student performance, evaluate other teachers' performance, or perform themselves.''
For example, in one potential exercise, teachers would be asked to review a videotape of themselves teaching a lesson and then to discuss their objectives and the activities that had led up to the lesson. Scoring would be based on their reasoning, as well as on their classroom performance.
In another scenario, teacher candidates would watch a videotape of a class, during which students would ask the teacher questions. The tape would pause periodically, giving the candidates time to explain how they might respond.
A third proposed exercise, in high-school history, focuses on how teachers use source materials. After receiving documents from a historical period, test-takers might be asked to edit them for use in class, for instance, or to explain the misconceptions that students often have of a particular historical event.
No One Answer
Unlike traditional multiple-choice tests, the exercises are based on the premise that teaching is complex, and that there is no single right way to teach, the researchers said.
According to Ms. Wilson, "There are multiple ways of teaching something,'' and teachers should be given a chance to justify and explain their choices.
"In part, this project is an effort to produce a much more complex account of teaching, and a corresponding set of assessments,'' explained Gary Sykes, the project's associate director and professor of teacher education at Michigan State University.
"There are really two ways of thinking about it,'' he said. "If you're a critic, you will be able to look at this effort against an ideal and say, 'You've fallen short. Teaching is still much more complicated than you've been able to figure out how to measure.'''
"If you're a pragmatist,'' he continued, "you'll say, 'This effort is bound to produce a leap forward from where we are now.'''
Current teacher tests, Mr. Sykes argued, "leave out an awful lot that is associated with good teaching.''
They are based primarily on research involving generic teaching skills that effective teachers share.
By attempting to span many teaching situations, Mr. Sykes said, such research "sacrificed ... attention to specific contextual factors: the kinds of kids the teachers were encountering; the particular subject they were trying to impart; and so on.''
Nevertheless, he noted, policymakers rushed to translate that research into tests, evaluation guidelines, and rating forms.
Used alone, he said, such tests have "tended to trivialize and reduce teaching to a few measurable, but 'scientifically validated,' kinds of skills.''
"That's about where we are now,'' he said. "But everybody is recognizing that a lot of that just sells teaching terribly short.''
Added Mr. Shulman: "The missing feature of teaching, in all of the research, and in the policies built upon research, has been the recognition that teachers are engaged in teaching something in particular, to particular kinds of kids.''
"The essence of teaching,'' he contended, "is taking something that you already understand and transforming it in a way that makes it meaningful to young people.''
Some researchers now speculate that one reason so many minority candidates fail teacher tests is that the exams are too far removed from classroom situations.
While not aimed specifically at minorities, Mr. Shulman said, exercises that emphasize a teacher's performance should help to reduce minority failure rates.
"I think it's fair to say that minority performance is one of the central preoccupations of our work,'' he added, "because, if this whole standards-setting and assessment process further exacerbates the already unacceptable circumstances of minority participation in the teaching force, then the results will be totally unacceptable.''
"We've simply got to figure out how to simultaneously raise standards and increase minority participation in the field,'' he said, "and I'm not prepared to view those as contradictory goals or intractable problems.''
San Diego Team
In May, the research team is planning a "working seminar'' that will focus on the question of minority performance. In addition, it has subcontracted with a group of professors at the University of California at San Diego to examine the problem.
"In our design activities all along,'' Mr. Shulman said, "we have kept asking ourselves, 'What are the features of any exercise that we design that are most likely to have an adverse impact on minority performance, and how could they be avoided?'''
In creating the exercises, Edward Haertel, another associate director of the research project and a professor of education at Stanford, said the researchers are trying to exclude activities that are not central to the act of teaching, such as talking in academic terms or giving extended prose responses.
The San Diego team will test the exercises on minority candidates. Based on the outcomes, they will modify the exercises to weed out features that are not intrinsic to the task of teaching, but that do lower minority scores. They will also try to highlight ways to minimize bias in future exercises.
"All of these things, taken together, by no means will ensure that passing rates are going to be equal for all identifiable groups,'' Mr. Haertel said. "All that we ask is that the assessment accurately reflect the proficiency of teachers in the classroom in the same way, irrespective of any personal characteristics.''
To help design and critique the two prototypes, the researchers are relying on a 21-member teacher advisory board and two expert panels—one in elementary mathematics and one in secondary-school social studies and history.
Serving as co-chairmen for each panel are a scholar in the field and a practitioner. Members include subject-matter specialists, teachers, teacher educators, cognitive psychologists, philosophers, and testing experts.
Minority groups are represented on all three panels, and in the case studies of exemplary teachers.
Researchers at the University of California at Berkeley and at the University of Pittsburgh also have subcontracts with the Stanford team to assist in, among other tasks, the development of instructional materials and the case studies.
In addition, the project has tried to build on the innovative assessments now used or being developed by other professions, including architecture, the Foreign Service, law, and medicine.
"We're trying not to copy or emulate any of them,'' Mr. Shulman said, "but to learn from their experiences.''
In September, the Stanford team served as host for a four-day workshop on assessment technologies to identify the range of methods used to license and certify practitioners in other occupations.
"What we envision,'' Mr. Haertel said, "would be comparable in sophistication and complexity to what's done in some other areas, like architecture, where the trainees who become certified actually prepare plans for a building and defend those plans, as well as completing knowledge examinations in a number of specific areas.''
Among the more innovative methods being tried in other fields are videotapes that interject a candidate into a simulated work situation; "in baskets'' and "out baskets'' that require a candidate to make decisions based on real-life documents; and structured small-group interactions that measure, among other things, a candidate's ability to work collaboratively and to play a leadership role.
Figuring out how to score such exercises presents a difficult task, the researchers acknowledge.
"Our sense is that we will score most of these exercises in a more or less holistic fashion—which is to say, we might rate performance either as pass-fail, or on a scale of two or at most four levels of proficiency,'' Mr. Haertel said.
Each exercise would also be rated on only one or two dimensions.
Mr. Sykes compared it to the kind of qualitative decisions now made by judges in ice-skating or diving competitions.
"The judges have carefully written criteria,'' he said. "But when they go into the short and long programs, style—that ineffable quality in human performance—becomes paramount.''
He argued that there are many techniques for improving the reliability with which judges make such decisions.
"You should read the stuff about livestock judging,'' he said. "There are whole training programs for these people—carefully developed criteria, and so on—because that's big business.''
"We don't spend one-tenth the amount of money in our society figuring out how to judge good teaching that we do in preparing people to judge livestock, for God's sake.''
Another decision the researchers have grappled with is whether to gear their exercises toward new teachers, those with a few years of experience, or seasoned veterans.
Eventually, the proposed national board will make all such choices. But while the researchers are trying not to pre-empt the board's prerogative, they have had to make some tentative decisions to proceed with their work.
Currently, most of the exercises are geared toward good teachers with two to four years of classroom experience.
"That might mean,'' Mr. Shulman said, "that, if you took a random sample of teachers who had been out for two or three years and gave them this full set of assessments, perhaps 60 percent of them would pass the first time, but 40 percent might have to retake it.''
"It's really up to the board,'' he added. "I like the idea of a test occurring about this time, after a couple of years of work, so that the values represented in the assessment can get reflected in the workplace for beginning teachers.''
In addition, he argued, timing the tests to occur only a few years after graduation from a teacher-education program would mean that the exams would be likely to influence what those programs look like.
Currently, about a dozen exercises are at various stages of development, and they are being pilot-tested with teachers in California, Connecticut, Oregon, Michigan, and Pennsylvania.
Field testing of the elementary-math exercises co plural.lo will occur in late July; the history exercises will be tested in early August.
Each field test will bring about 20 teachers to Stanford for four days—some to serve as evaluators and observers, and some as test-takers.
They will include beginning teachers, minority teachers, experienced teachers, and a small number of people who have no training or experience in teaching, but who are well-versed in a particular subject area.
The idea, Mr. Haertel said, is to see "if there are predictable differences in the levels of performance of expert teachers, who've been practicing for a long time, versus beginning teachers, versus non-teachers.''
"We'll look to see whether teachers who are taking the assessment in their own area of expertise do better than teachers who are taking the assessments out of field. That way, we hope to identify possible weaknesses.''
Added Mr. Sykes: "There's a great shibboleth that strong subject-matter knowledge is all that's needed to teach. Let's start to test that assumption.''
"This effort has got to legitimize the concept of expertise in teaching,'' he said, "because that has been a millstone around this profession's neck for years.''
Mr. Shulman predicted that it will be four years before the first teacher can walk into an assessment center and take a board exam, and many more years before the full range of tests is in place.
One possible vision of the assessments, presented by the researchers, is a three-stage process: part one would focus on understanding the content of the subject matter; part two would examine the capacities needed to teach that content; and part three would involve direct observations of practice by carefully trained observers. So far, the research has focused primarily on part two.
"Our intention,'' Mr. Shulman said, "is to go through another full cycle in two more subject areas that are sufficiently different from elementary math and secondary-school social studies that they're likely to require different specifications.''
"Right now, we're thinking about elementary reading and writing and secondary biology,'' he said.
Until those are completed, he added, "it's going to be hard to know how generalizable a given prototype is, outside of that subject area. We're already getting a fairly clear notion that you can't generalize from elementary math to secondary history.''
Mr. Shulman argued, however, that the whole certification process cannot rest on assessment alone.
"We're beginning to think more and more about the need to complement assessment—under controlled conditions—with the documentation of performance during a residency,'' he said.
Studies are currently under way of what a residency or internship in teaching might look like. But, he said, "there's essentially nothing on how you would collect data on the characteristics of accomplishments during a residency.''
If the board could devise a certification process that linked performance on an assessment with documented performance in the field, he argued, it would "be superior to the assessments in any other profession.''
Build in Flight
An uneasy alliance is currently supporting the creation of a national board—including the teachers' unions, governors, businessmen, and legislators, many of whom are under political pressure to deliver relatively quick results.
Mr. Shulman said, however, that he is not feeling a lot of time pressure. Carnegie paid for the project, he said, "because it is research worth supporting, whether or not there is a national board.''
"The work I'm doing is not dependent on the existence of a board, any more than the board will be dependent on our work.''
"What we're betting on,'' he added, "is that doing the slow work of getting the prototypes established will really accelerate the subsequent development.''
One problem is trying to develop an assessment—or even pieces of an exam—without a complete map of what knowledge teachers need to possess, either generically or within particular content areas.
Many people, including Mr. Shulman, have been working to codify or delineate a knowledge base for teaching.
"The process of developing the assessment is itself an effort to begin to codify the knowledge base,'' Mr. Sykes said.
Mr. Shulman predicted that, in the next four to five years, a number of competing codifications will surface. Even so, he noted, the first board-administered tests are sure to be "terribly imperfect,'' in part because they will be based on a "very shaky'' knowledge base.
But one thing is certain, Mr. Sykes said: "Teachers are so defensive about these tests now that, if the early wave of test-takers don't walk out and say to their colleagues—'Geez, that was terrific; it was interesting; I learned something from it; It really felt like teaching; yeah, you should know that stuff if you're going to teach; I'm not sure how well I did on it, but that was O.K.'—if that doesn't happen, this thing is dead in the water.''
"If it's another, 'Come in, sit down, take a standard kind of test that they've all been used to,''' he argued, "forget it.''
Ultimately, Mr. Haertel said, "the proof of the examination will be if board-certified teachers are found, in practice, to be highly regarded by their colleagues and to produce superior results with students.''
But even the best test—and the best national board—cannot by themselves reform teaching and the problems of schools, Mr. Shulman cautioned.
"If five years from now all that has happened as a result of all these reform documents is the design of the world's state-of-the-art assessment,'' he said, "then I will declare myself as having wasted the last five or six years, and so will my colleagues.''
"That simply is not the answer,'' he added. "This activity makes sense only in conjunction with quite serious reconstruction of teacher preparation and of the character of the workplace. Absent those two parallel developments, the board will be of very little moment.''