Assessment Opinion

What’s a High-Quality Assessment Item?

By Robert Rothman — June 27, 2016 3 min read
  • Save to favorites
  • Print

In case you haven’t noticed, the debate over testing has become quite heated. Parents, teachers, public officials, and advocates have been arguing over how may tests students have to take, how long the tests are, how the results should count, and much else.

The Every Student Succeeds Act (ESSA) settled some of the issues. The law maintains the requirement for testing every student in grades three through eight every year in reading and mathematics, and in at least three grade spans in science. But the law also provides funds for states to conduct audits of their testing systems (presumably, with an eye toward reducing unnecessary tests). And it authorizes states to expand accountability systems to include measures other than test scores and leaves it up to states to determine how much test scores will count as a measure of school performance.

Missing in much of the debate is any discussion of the tests themselves. It’s as though the debate is test or no test, as if one test is just the same as another. The real issue, though, is what do the tests measure and how do they measure it? Surely, parents and teachers would be more supportive of high-quality tests that actually measured what’s worth knowing; the problem is with tests that don’t provide useful information about meaningful learning and that create perverse incentives for schools to focus on knowledge and skills that are less than meaningful.

A new report from Understanding Language/Stanford Center on Assessment, Learning and Equity (UL/SCALE), a research center at Stanford University, provides important insights that could place the focus of the debate over testing where it belongs: on the quality of the tests themselves.

The report’s authors examined English language arts, mathematics, science, and history/social studies test items from a number of large-scale assessments, including state assessments, Advanced Placement tests, the National Assessment of Educational Progress (NAEP), and the Programme for International Student Assessment (PISA). The test items included both multiple-choice and open-ended formats.

The goal was to identify items that, in the authors’ words, provided a “valid assessment of central disciplinary understandings and skills.” That is, the items measured what was most important in each discipline, in ways that closely represent work in the discipline. Based on their analysis, the authors came up with these “design features” of high-quality assessment items:

  1. Items focus on core disciplinary knowledge, concepts, and/or skills. The items measure what is most important in each subject area, such as using evidence from texts in English language arts or emphasizing big ideas in science.
  2. Items integrate disciplinary knowledge, understandings, and/or skills. The items go beyond random facts or generic skills to integrate conceptual understanding and knowledge. In history, for example, the items measure knowledge within its relevant context and knowledge is integrated with disciplinary practices such as analysis of sources.
  3. The item prompt and materials (texts, other sources) are presented in a way that maximizes student access and engagement and reduces bias. The items are worded as simply and concisely and clear as possible, while minimizing the use of so-called “construct-irrelevant” factors such as overly complex writing demands that distract from measurement of subject-area knowledge and skills. This feature is particularly important for English language learners, the authors note.
  4. In constructed-response and extended-response items, the item is open-ended enough to allow for a variety of student responses. The scoring criteria should examine students’ reasoning, rather than the expectation that the student arrived at a single right response.
  5. Items require students to work with source materials that are authentic to the discipline in a way that replicates the work of the discipline. In history, for example, students might interpret primary source materials to analyze a particular historical event. In science, students might manipulate variables to test a set of hypotheses.
  6. The use of technology-enhanced items is purposeful--i.e., the technology elevates the cognitive complexity of the item or makes the item more accessible.

The good news is that items that possess these design features exist now, in large-scale assessments that are in use. The challenge now for state policymakers is to use these criteria in examining their own assessments or choosing new ones. If that happens, this report will have helped in advancing the debate over assessment to a higher, more meaningful level.

The opinions expressed in Learning Deeply are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.