The headline in EdSurge’s newsletter was “HUGE STUDY VALIDATES ALGEBRA PROGRAM.” Their blurb:

“It could be the blended structure, or the product, or some other factor that made the difference, but one way or another a bunch of kids using Cognitive Tutor kicked butt in one of the biggest blended studies to date, conducted by the DOE and RAND corporation.”

I followed the link above to Carnegie Learning’s website about the study. Carnegie Learning publishes Cognitive Tutor, which is a computer-based, adaptive learning tool that assesses student competencies and provide instruction, problems, and hints designed to optimize individual student learning. The site about the study is AlgebraEffectiveness.org and the headline is “DOE/RAND Study Confirms Algebra Learning Doubles.”

These are pretty promising headlines—I’d love to see a curricular approach that doubles algebra learning—so I dashed off to the DOE/RAND study to read the details. As is often the case, reading the study is deflating after such triumphant headlines. The results from the DOE/RAND study are encouraging, but certainly more mixed than “product doubles learning.”

First, some kudos. The RAND team did a lovely job with the study, which is methodologically rigorous and carefully, fairly written. The study is a randomized trial of the use of Cognitive Tutor:Algebra involving 147 schools in 7 states. Kudos to the DOE for funding large-scale, real-world efficacy trials, and kudos to Carnegie Learning for putting the product up for rigorous third-party evaluation. (And kudos to the folks at EdSurge for pointing people to an important study, even if I’ll quibble with their headline.)

Let me try to summarize the study and findings, and then suggest why I think the top level message is “mixed, but encouraging findings for Cognitive Tutor:Algebra and blended learning approaches” rather than “product doubles learning.”

It’s probably best to understand the study as four studies: one study per year over two years in high school and one study per year over two years in middle schools. 147 schools representing over 18,000 students participated. Schools were matched in pairs with other very similar schools. In each pairing, one school was randomly assigned to be the treatment school, where they used Cognitive Tutor:Algebra in their classes after their teachers received a short summer training. The other school taught algebra as they always had. All students were given multiple-choice pre-tests and post-tests at the beginning and end of the year, and the study compared the gains of students in the control conditions with gains of students in the treatment conditions.

The first caveat in the report comes early on. Normally in a randomized trial, you report briefly on the effects of your randomization. Ideally, on all observable measures, students in control and treatment conditions should be identical. This section is usually brief because randomization often works. In this case, however, in both middle and high school studies, the treatment students scored lower on the pre-tests than the students in the control condition. In the high school, treatment students scored about .12 standard deviations below control students, and in the middle school, treatment students scored about about .33 standard deviations below control students.

This is important: arguably, students in the treatment groups are not exactly the same as students in the control group. Ideally, these groups would be as identical as possible so we can attribute any differences in algebra score gains entirely to the intervention (the Cognitive Tutor curriculum). Now, we have to wonder if any differences that we see are not because of the intervention, but because the kids were fundamentally different to begin with.

After examining implementation data, the report authors concluded that this was just random luck rather than some kind of systematic implementation issue. While the high school difference is considered “within acceptable limits,” the discussion section of the report begins: “It is necessary to be cautious in interpreting these results because mean student pretest scores were lower in the treatment group than the control group.” (I think it might have been helpful for the authors to explain exactly how readers should be cautious. It also might be helpful, if this is important, to repeat these cautions in the conclusion.)

So, cautiously, we move on to evaluate the findings. Here’s the most important pair of tables.

Table 4 is data on the high school studies. Table 5 is data on the middle school studies. Cohort 1 is the first year; cohort 2 is the second year. The *estimates* are the estimates of the differences in gains between the treatment group and the control group, as measured in standard deviation units. Positive estimates mean that the treatment (Cognitive Tutor) group did better. The p-value summarizes a test that evaluates whether the differences between treatment and control are statistically significant. When the p-value is above .05, many researchers will treat the estimates as zero, or assume no difference between the control and treatment groups. There are four models, because each statistical model evaluates the data in different ways. Model 1 doesn’t control for any predictors, including students previous performance. This model is a kind of straight difference between treatment and control. Model 4 controls for pre-test scores and a bunch of other things, and its the most appropriate model to evaluate.

So in year one, without controlling for covariates, students in high schools using Cognitive Tutor scored .19 standard deviation units below students in their regular algebra class. Controlling for covariates, we estimate that students using Cognitive Tutor would score .1 standard deviation units below control students, though this difference is not statistically significant. We might say something like “the impact trends negative, but we should treat it as zero.” That is not a triumph. In the middle school in year one, without controlling for covariates, students using Cognitive Tutor scored .2 standard deviation units below students in regular classes; controlled for covariates, the scores were identical. Again, not a triumph.

In year two, students in the treatment group did better. In high school, students using Cognitive Tutor scored .21 standard deviation units better than students in the control group. The middle school, there were no significant differences between the two groups, but the results trended positive.

Now, you might think that the improvements in year 2 were due to teachers better understanding how to use Cognitive Tutor, but that doesn’t appear to be the case. Since teachers move around, some teachers in year 2 were actually in their first year of using Cognitive Tutor. The study authors compared the results of teachers in year 2 who had used Cognitive Tutor the year before and those who hadn’t and found no difference. In other words, there was not evidence that the gains in year two were due to teacher improvement with using the new algebra approach. It seems better just to say that in year 1, the program did not work, and in year 2, it did in high school, but not middle school.

So here’s the round up: **High School**: *Year 1*, no difference, trending negative for Cognitive Tutor. *Year 2*, students using Cognitive Tutor do better. **Middle School**: *Year 1*, no difference. *Year 2*, no difference, trending positive for Cognitive Tutor.

Where does the concept of doubling come from? So a typical gain between pre-test and post-test is about .21 standard deviation units. In year two of the study, students in high school, in the treatment group, scored .21 standard deviation units above that. So they didn’t “double” in the sense that students completed Algebra I and Algebra II in the same year; they doubled in the sense that the gain scores between a pre and post test are about twice as high for Cognitive Tutor students in one year of the study. The “doubling” language doesn’t come directly from the study; that’s from the curriculum publisher. The study authors frame the gains this way: “The effect size of approximately 0.20 is educationally meaningful - equivalent to moving an algebra I student from the 50th to the 58th percentile.”

So here’s my top level summary: “Middle school students using Cognitive Tutor performed no better than students in a regular algebra class. In a two year study of high school students, one year Cognitive Tutor students performed the same as students in a regular algebra class, and in another year they scored better. In the year that students in the Cognitive Tutor class scored better, the gains were equivalent to moving an algebra I student from the 50th to the 58th percentile.”

I suppose that’s not quite as catchy as “kicked butt” or “doubled their learning.”

(If anyone out there thinks I’ve mis-stated the findings, I welcome alternative perspectives in the comments or link to something for me on Twitter.)

Now my summary isn’t necessarily a strike against Cognitive Tutor. They did demonstrate the efficacy of their product for some students in one of two years of a very carefully conducted experimental trial. New approaches are always at a disadvantage to established approaches since they require teachers and students to learn new patterns, tools, and routines, so positive results early on might suggest even greater gains for schools that commit to conducting professional learning, looking at student work, and lesson study around promising new approaches. If you are an advocate for the potential of blending learning, this study was good news for you.

But where I’m flummoxed is how we are supposed to provide practicing educators with the tools to evaluate these kinds of findings. I know you can’t sell a curriculum product or a newsletter with headlines like “HUGE STUDY PARTIALLY VALIDATES ALGEBRA PROGRAM, PARTIALLY DOESN’T.” I don’t expect Carnegie Learning to build a web site that says “Major study shows no significant impact of Cognitive Tutor in middle schools!” But it also isn’t clear to me who in the system is incentivized to provide disinterested, broadly-accessible, readable summaries of important studies that help educators make careful decisions with scarce resources based on careful interpretation of existing evidence.

If we could make that happen, that would kick butt.

*For regular updates, follow me on Twitter at @bjfr and for my publications, C.V., and online portfolio, visit EdTechResearcher.*