Opinion
Assessment Opinion

When Tests Don’t Measure Well What They Appear to be Measuring

By Deborah Meier — April 23, 2009 5 min read

Dear Diane,

Your Tuesday column set out good reasons for rejecting standardized testing to reward teachers. I sent a letter home to parents every year on the 10 factors that influenced their child’s reading scores. I’ll send it to you one day soon—it still applies.

Incidentally, some readers may not realize that the high-scoring nations that use standardized tests, if at all, use a different kind than those you were describing. They often consist of written and oral cross-examination, with grades determined by well-qualified judges. (The international scores we read about, readers should realize, are the results of low-stakes tests, which were given on a sampled basis.)

In short: the varied U.S. tests whose scores we so often hear about don’t measure well what they appear to be measuring. I’m not talking about the short-term versus long-term memory issue which lies behind the charming little comedy routine about the five-minute university (which just tests you on what college students remember two years later). The best rationale for national standards and tests is precisely in the lack of equivalence in current “standardized” state tests. If standardized tests were used properly, two different reading tests for students in the seventh month of 4th grade would be largely interchangeable—unless there was some fundamental philosophical disagreement about the nature of reading. Their sole merit is that one is comparing oranges to oranges. (This is also another good reason not to test young children in the process of learning to read—where scores must reflect the method of teaching, not the achievement of reading.)

Psychometric design of multiple-choice items requires some reasonable alternate answers that pick up reasonable alternate viewpoints, rather than simple-minded rights/wrongs. They also require test-makers to eliminate questions which don’t properly discriminate. Note: “Discriminate” here has a “narrow” psychometric meaning. (The unused passages and questions Jay Rosner of Princeton Review found in the pool of potential SAT items that black students more often got right than white students didn’t discriminate properly “statistically.”)

As E.D. Hirsch and I both note, such tests also abound in passages that require knowledge to which neither home nor school have equally exposed kids (and which Hirsch and I want to solve in different ways). All of these “faults” are built into the requirement to rank-order along a particular curve. These are not designed as pass/fail tests. Reliable psychometrics could only rank you by percentile—nothing more nor less. X percentage of students taking the test at the same time and under the same conditions got a higher number of “right” answers. There are no statistical methods to arrive at proficiency, etc. Those are “subjective"—i.e. human judgments.

Data distortion—as anyone studying our current economic crisis can tell us—is a serious problem. As in economics, so, too, in education. The “way” we report data can also distort it, as the term “grade level” has done. Which is why I am so often baffled about international comparisons: who is quoting what? (E.g. I’m skeptical when I note that China scored high on one of the recent tests, given that a high percentage of kids in China aren’t in schools at all, above all in rural China—which is still immense.)

When it comes to NYC test scores, I’m more of an expert. Or I was, when tests used to come to schools with the publisher’s background information, including a warning not to prep kids. And before we began to believe in test miracles (scores that went up by leaps and bounds one year, and down another). I used to be amused at how schools that contained district gifted programs bragged about their success at getting higher scores than their sister schools, which reporters somehow overlooked in their stories.

Yes, Diane, the intellectual discipline needed to exercise good judgment can be the enemy of improved standardized test scores. The Coalition of Essential Schools embarked 20 years ago on a different path—which included “standards” of a different sort. We “invented” examinations that sought to judge students by publicly accessible exercises of judgment by adults. A panel “judged”—and documented—how students defended their actual work in a variety of fields. They did so in ways that seemed appropriate both to the “discipline” and the mission of the school.

The Rhee/Klein/Duncan/et al traveling show would be amusing if it weren’t potentially influential. The capacity of our educational leaders to represent their views with the support—financial resources—of precisely the big-money boys whose accountability to the American public in their own sphere or expertise has proven to be so shamefully inept is immense. Inept at best, and corrupt at worst. They have transferred the same mindset now to a field they know precious little about. Juan Gonzalez’s “exposé” about the funds provided to Al Sharpton’s alliance (National Action Network) by hedge-fund allies of Bloomberg help explain that “odd” coalition of test “believers.” But we can’t all be attending closely to everything, and repeated untruths or half-truths can become “common sense.”

Mike Rose’s latest blog posts are treasures, and belong (alas) in another world entirely than the one we mostly blog about, Diane. Mike’s close attention to children’s learning seems passe. His original book—"Lives on the Boundary"—is a must re-read. He describes why the kind of education we are intensifying today led to the high college dropout rate in the ‘70s and ‘80s when he wrote it. The students who were arriving, even at selective colleges like UCLA, were woefully unprepared for the fundamental work of “higher” education. KIPP, I fear, will discover Rose’s point too late. Gerald Graff’s “Clueless in Academe” is a newer and differently oriented book making a similar point—as well as including a flattering chapter on the late CPESS (Central Park East Secondary School). Yes—we could all get higher “scores” as we become a stupider nation.

The Manhattan Institute study is just another example of how we selectively pick and choose “data,” just as Goldman, Sachs’ latest data—heralding better times for them—is based on a decision to change the calendar for comparison purposes! It reminds me of how we increased high school attendance some years ago—by counting attendance third period instead of first.

Best,
Deb

The opinions expressed in Bridging Differences are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.