Comparing Paper-Pencil and Computer Test Scores: 7 Key Research Studies

By Benjamin Herold — February 04, 2016 8 min read
  • Save to favorites
  • Print


News that millions of students who took the PARCC exams on computers tended to score worse than those who took the tests on paper raises an important question:

Do the computer-based exams that are increasingly prevalent in K-12 measure the same things as more traditional paper-based tests?

Read Education Week’s coverage: PARCC Scores Lower for Students Who Took Exams on Computer

Broadly speaking, it’s a dilemma that researchers and psychometricians have been wrestling with for at least the past 20 years, said Derek Briggs, a professor of research and evaluation methodology at the University of Colorado at Boulder.

“But it’s really hit within the last two years, with Smarter Balanced and PARCC and even states that are going online with their own versions [of those exams],” said Briggs, who serves on the technical advisory committees for both the Partnership for Assessment of Readiness for College and Careers and the Smarter Balanced Assessment Consortium, both of which created exams aligned with the new Common Core State Standards that were administered over multiple states during the 2014-15 school year.

On one hand, Briggs noted, computer- and paper-based versions of an exam shouldn’t necessarily be expected to measure the same things, or have comparable results. Part of the motivation for pouring hundreds of millions of federal dollars into the new consortia exams, after all, was to use technology to create better tests that elicit more evidence of students’ critical thinking skills, ability to model and solve problems, and so forth.

But the reality is that in some states and districts, the technology infrastructure doesn’t exist to support administration of the computer-based exams. All children don’t have the same access to technology at home and in school, nor do their teachers use technology in the classroom in the same ways, even when it is present. And some students are much more familiar than others with basic elements of a typical computer-based exam’s digital interface—how to scroll through a window, use word-processing features such as cutting and pasting, and how to drag and drop items on a screen, for example.

As a result, there is a mounting body of evidence that some students tend to do worse on computer-based versions of an exam, for reasons that have more to do with their familiarity with technology than with their academic knowledge and skills.

For states and school districts expected to use the results of exams such as PARCC to make instructional decisions, for accountability purposes, and possibly even as a graduation requirement and evaluation measure, that’s a big problem.

For psychometricians, it’s a big, multi-pronged challenge.

The first step is to figure out the exact cause of the differences, said Lauress L. Wise, the immediate past president of the National Council on Measurement in Education, which sets standards for best practice in assessment. That can be trickier than it sounds, especially because differences between the student populations who take different versions of an exam may play a large role in any score discrepancies by format.

From there, Wise said, it’s often a matter of making the right adjustment to students’ scores,

“Kids tend to score a little bit higher on paper and pencil, and [the differences show up] a bit more frequent on open-ended questions that require more comprehensive responses and in subjects like geometry, where it’s a little easier to see and manipulate [problems] on paper than on a computer, " Wise said.

“The tendency is to make an adjustment, even though it means that kids who gave the same answer in one [format] might get a little higher score than kids who gave the same answer on [another format.]”

For a deeper understanding of the issues behind the type of “mode effect” that appears to be a widespread problem with PARCC results, here are seven research studies worth reading.

1. Online Assessment and the Comparability of Score Meaning (ETS, 2003)

Written by Randy Elliot Bennett, one of the leaders in the field, this overview explores a range of mode-comparability issues. “It should be a matter of indifference to the examinee whether the test is administered on computer or paper, or whether it is taken on a large-screen display or a small one,” Bennett wrote more than a decade ago.

“Although the promise of online assessment is substantial, states are encountering significant issues, including ones of measurement and fairness,” the paper reads. “Particularly distressing is the potential for such variation [in testing conditions] to unfairly affect population groups, such as females, minority-group members, or students attending schools in poor neighborhoods.”

2. Maintaining Score Equivalence as Tests Transition Online: Issues, Approaches, and Trends (Pearson, 2008)

The authors of this paper, originally presented at the National Council of Measurement in Education, highlight the “mixed findings” from studies about the impact of test-administration mode on student reading and math scores, saying they “promote ambiguity” and make life difficult for policymakers.

The answer, they say, is quasi-experimental designs carried out by testing entities (such as state departments of education.) The preferred technique, the paper suggests, is a matched-samples comparability analysis, through which researchers are able to create comparable groups of test-takers on each mode of administration, then compare how they performed.

3. Does It Matter If I Take My Mathematics Test on Computer? A Second Empirical Study of Mode Effects in NAEP (Journal of Technology, Learning, and Assessment, 2008)

Randy Elliot Bennett is also the lead author on this paper, which looked at results from a 2001 National Center for Education Statistics investigation of new technology for administering the NAEP math exam.

“Results showed that the computer-based mathematics test was significantly harder statistically than the paper-based test,” according to the study. “In addition, computer facility predicted online mathematics test performance after controlling for performance on a paper-based mathematics test, suggesting that degree of familiarity with computers may matter when taking a computer-based mathematics test in NAEP.”

4. The Nation’s Report Card: Writing 2011 (NCES, 2014)

As NCES moved to administer its first computer-based writing assessment, it also tracked the impact in this study of how 24,100 8th graders and 28,1000 12th graders performed. Doug Levin, then the director of the State Educational Technology Directors Association, summed up the findings well in this 2014 blog post, which reads:

“Students who had greater access to technology in and out of school, and had teachers that required its use for school assignments, used technology in more powerful ways” and “scored significantly higher on the NAEP writing achievement test,” Levin wrote. “Such clear and direct relationships are few and far between in education—and these findings raise many implications for states and districts as they shift to online assessment.”

5. Performance of Fourth-Grade Students in the 2012 NAEP Computer-Based Writing Pilot Assessment (NCES, 2015)

This working paper found that high-performing 4th graders who took NAEP’s computer-based pilot writing exam in 2012 scored “substantively higher on the computer” than similar students who had taken the exam on paper in 2010. Low- and middle-performing students did not similarly benefit from taking the exam on computers, raising concerns that computer-based exams might widen achievement gaps.

Likely key to the score differences, said Sheida White, one of the report’s authors whom I interviewed this month, is the role of “facilitative” computer skills such as keyboarding ability and word-processing skills.

“When a student [who has those skills] is generating an essay, their cognitive resources are focused on their word choices, their sentence structure, and how to make their sentences more interesting and varied - not trying to find letters on a keyboard, or the technical aspects of the computer,” White said.

6. Mathematics Minnesota Comprehensive Assessment-Series III (MCA-III) Mode Comparability Study Report (Minnesota Department of Education & Pearson, 2012)

This state-level examination of mode effects on exams administered in spring and summer of 2011 used the matched-samples comparability analysis technique described above.

“Although the results indicated the presence of relatively small overall mode effects that favored the paper administration, these effects were observed for a minority of items common to the paper and online forms,” the study reads.

7. Comparability of Student Scores Obtained from Paper and Computer Administrations (Oregon Department of Education, 2007)

This state level mode-comparability study looked across math, reading, and science tests administered both by computer and by paper.

“Results suggest that average scores and standard errors are quite similar across [computer] and paper tests. Although the difference were still quite small (less than a half a scale score point), 3rd graders tended to show slightly larger differences,” the paper reads. “This study provides evidence that scores are comparable across [Oregon’s computer] and paper delivery modes.”

Photo: A student at Marshall Simonds Middle School in Burlington, Mass., reviews a question on a PARCC practice test before 2014 field-testing of the computer-based assessments.--Gretchen Ertl for Education Week-File

Library intern Connor Smith provided research assistance.

An earlier version of this story incorrectly identified researcher Randy Elliot Bennett of ETS.

See also:

A version of this news article first appeared in the Digital Education blog.