“My multiple choice test on bike riding was very reliable.
How come none of my kids can ride a bike?”
As an advocate for evidence-based reform in education, I’m always celebrating the glorious possibilities of having educational policies and practices be based on the findings of “rigorous” research. Who could disagree? For this idea to have a little bite in it, however, it is important to understand what I mean by “rigorous.”
In general, a rigorous study evaluating an educational program is one that compares, say, some number of teachers or schools in an experimental program using program X, to others in a control group of very similar characteristics using program Y, which may just be traditional education. Clear enough so far.
One problem arises when we ask, “On what measures should programs X and Y be compared?” Often, this debate revolves around measures felt to be insensitive to real learning gains, as when a study of a science program uses a multiple choice science test. Such studies tend to understate likely program effects.
An even bigger problem occurs when experimenters make up their own measures that are closely linked to the experimental program (X) but not the control program (Y). For example, imagine that a researcher develops a vocabulary-building treatment for English learning and then creates a test around the words emphasized in the program (these words may never have even been introduced to control group). Or, imagine that a researcher develops a science program that spends twice as much time as usual on properties of light, and then develops a test with a heavy emphasis on the very concepts about light added in the extra time. Or a researcher introduces a topic earlier than usual (such as topics of mathematics in preschool) and then uses a measure of the content taught, to which the control group was never exposed. In each of these cases, the experimental group has a huge advantage over the control group, just because they received a lot more teaching on the topic being assessed.
There is a simple solution to this problem. Hold constant the content of instruction while varying the methods, or use widely accepted measures not developed by the experimenter. Studies using measures that are fair to the experimental group and control group tend to report much smaller impacts, but these impacts are a lot more believable than those from studies using measures slanted toward the experimental treatments.
Illustration: Slavin, R.E. (2007). Educational research in the age of accountability. Boston: Allyn & Bacon. Reprinted with permission of the author.
Next Week: Bad Science II: Brief, Small, and Artificial Studies
Find Bob Slavin on Facebook!