A report analyzing Vermont’s pioneering assessment system has found severe problems with it and raised serious questions about alternative forms of assessment.
The Vermont system, which is being closely watched by educators around the country, is the first statewide assessment program to measure student achievement in part on the basis of portfolios.
But the report by the RAND Corporation, expected to be released this week, found that the “rater reliability’’ in scoring the portfolios--the extent to which scorers agreed about the quality of a student’s work--was very low. The researchers urged the state to release the assessment results only at the state level.
Daniel M. Koretz, a senior social scientist at RAND and the report’s author, said the low levels of reliability indicate that the scores are essentially meaningless, since a different set of raters could come up with a completely different set of scores.
“If you’re not rating reliably, you’re not rating,’' he said. “You can’t measure anything unless you measure it reliably.’'
The report recommends a number of changes that could boost reliability, including improving the training of teachers who rate the portfolios. Mr. Koretz noted that Vermont is considering taking steps in those directions.
But he cautioned that reliability might continue to pose a problem in Vermont’s program, as well as in other types of performance assessments and portfolios, in which teachers rate diverse student work.
Advocates of alternative assessment need to gather data on the quality of the new methods before using them to replace traditional tests, Mr. Koretz said.
“I hope this will cause them to be more realistic,’' he continued. “People’s expectations [for alternative assessments] are too unrealistic, across the board. This is one signal for them to calm down.’'
Commissioner of Education Richard P. Mills said that, in light of the RAND report, Vermont would report only statewide results of the portfolio portion of the assessment. Officials orginally had planned to make public the results for each supervisory union, or group of districts.
Mr. Mills added that the state is committed to continuing with the new system and plans to seek a doubling of the program’s budget next year. Despite the reliability problem, he suggested, the assessment program has reaped other benefits, including improvements in curriculum and instruction.
“Teachers report that Vermont is changing curriculum, assessment, and instruction all at the same time, and for the better,’' Mr. Mills said. “They also have told me in great detail how terribly difficult this is, and how important it is for us to persist and get it right. We will.’'
Created in 1988, the assessment program is Vermont’s first statewide testing program. It has drawn national attention at a time when a number of states and school districts are considering alternatives to traditional multiple-choice tests.
Under the program, 4th and 8th graders are assessed in writing and mathematics in three ways: a uniform test, which is a standardized test consisting of both multiple-choice and open-ended questions; a portfolio of classroom work completed throughout the year; and a “best piece’’ chosen by the student from the portfolio.
The program began on a pilot basis in the 1990-91 school year in 144 schools, and results were released statewide. It expanded to most, but not all, schools in the state last year.
Low Reliability Found
In a preliminary report on the first year of implementation, RAND researchers found that teachers and administrators considered the assessment time-consuming. But the study, conducted for the federally funded National Center for Research on Evaluation, Standards, and Student Testing, also found that educators felt the program had led to positive changes in curriculum and a greater understanding of student abilities. (See Education Week, Sept. 9, 1992.)
The new report, which examined how well the system fared as a measure of student performance, offers a far less positive picture.
Unlike traditional tests, in which computers register whether a student has filled in the correct bubble, the portfolios were scored by teachers, who evaluated each portfolio according to several criteria on a four-point scale.
For writing, the criteria consisted of purpose, organization, details, voice, and usage. For math, they were language of math, math representations, presentation, understanding of math, procedures, decisions, and outcomes.
The study used a standard statistical measure known as a reliability coefficient, which measures the extent to which two raters rank a student’s work the same. Under such a measure, no agreement would be zero, while total agreement would be 1.00.
The study found that reliability in the writing portfolios was quite low. Depending on the criterion on which the raters were evaluating, reliability coefficients ranged from 0.28 to 0.57. The reliability coefficients of the state’s uniform test in writing, by contrast, ranged from 0.67 in grade 8 to 0.75 in grade 4.
“For a variety of reasons, such as the variability of tasks used, it may be unrealistic to expect a portfolio program to reach as high a level of reliability as a standardized performance-assessment program’’ ... the report states. “However, the reliabilities obtained in Vermont in 1992 are sufficiently low to limit severely the uses to which the results can be put.’'
On the positive side, the study also found no evidence that teachers assigned higher or lower scores to their own students than did other raters.
In math, the study found similarly low reliability coefficients. Using a different measure of reliability, however, the study found that math raters assigned the same score to a portfolio about 60 percent of the time, compared with less than half the time for writing portfolios.
This discrepancy occurred, the report notes, because math scores tended to be concentrated at one or two points on the four-point scale.
Inadequate Training Seen
In examining possible reasons for the low levels of reliability, the RAND report suggests that the complex scoring scales may have contributed to the problem.
State officials said they are considering simplifying the scoring system.
“It’s difficult for evaluators to use all the criteria, and rate them all separately,’' said Douglas I. Tudhope, the chairman of the state board of education. “We have to look at that.’'
On a related point, the report also suggests that the training of the raters may have been inadequate.
Because the goal of the assessment program was to improve professional development as well as to evaluate student performance, Mr. Koretz pointed out, the state trained as many teachers as possible.
Other school systems that have conducted portfolio assessments, such as Pittsburgh, attained much higher levels of reliability by employing a relatively small number of highly trained raters, he noted.
But the effect of Vermont’s policy, Mr. Koretz said, was to produce a number of raters who may have been unsure of what high levels of performance looked like. This was particularly true in math, he said, where teachers accustomed to traditional forms of instruction were unfamiliar with such terms as “math representation,’' which indicates the extent to which students depict math concepts through graphs or drawings.
‘A Little Setback’
Vermont teachers involved in the project agreed that the training was inadequate.
“There wasn’t enough, and it came too late,’' said Clare Forseth, a math teacher at Marion Cross Elementary School in Norwich. “What has to happen is that teachers have to actually understand the criteria enough to have students understand it.’'
Commissioner Mills said that state officials would consider all of the report’s recommendations and make changes in the program. He said the state education department has asked RAND to evaluate the program again next year to determine if the reliability has improved.
But he stressed that the state is committed to portfolio assessment. He noted that the state board voted last week to request a doubling, to $841,000, of the program’s budget.
“This was a little setback,’' said Mr. Tudhope. “You can’t expect these things to be perfect the first time around.’'
A version of this article appeared in the December 16, 1992 edition of Education Week as RAND Study Finds Serious Problems in Vt. Portfolio Program