Accountability Opinion

Educational Testing: A Brief Glossary

By skoolboy — July 02, 2008 5 min read
  • Save to favorites
  • Print

While you’re waiting for Dan Koretz’ book on testing to arrive – I think eduwonkette and I should get some kind of consideration for shilling for this book so often here – here’s a brief skoolboy’s-eye view on testing. Actual psychometricians are welcome to correct what I have to say.

Tests are typically designed to compare the performance of students (whether as individuals, or as members of a group) either to an external standard for performance or to one another. Tests that compare students to an external standard are called criterion-referenced tests; those that compare students to one another are called norm-referenced tests. Even though criterion-referenced tests are intended to hold students’ performance up to an external standard, there is often a strong temptation to compare the performance of individual students and groups of students on such tests, as if they were norm-referenced.

A typical standardized test of academic performance will have a series of items to which students respond, generally either in a multiple-choice or constructed response format, which means that students are constructing a response to the item. There’s usually only one right answer to a multiple-choice item, whereas constructed-response items may be scored so that students get partial credit if they demonstrate partial mastery of the skill or competency that the item is intended to represent. For any test-taker, we can add up the number of right answers, plus the scores on the constructed-response items, to derive the student’s raw score on the test. A test with 45 multiple-choice items would have raw scores ranging from 0 to 45.

For individual test items, we can look at the proportion of test-takers who answered the item correctly, which is referred to as the item difficulty or p-value, which has nothing to do with the p-values used in tests of statistical significance, but rather the proportion (p) of examinees who got the item right. Some test items are more difficult than others, and hence items will have varying p-values.

Raw scores are rarely interpretable, in part because they are a function of the difficulty of the items. For this reason, they are typically transformed into scale scores, which are designed to generate a score that will mean the same thing from one version of a test to the next, or from one year to the next. The scale for scale scores is arbitrary; the SAT is reported on a scale ranging from 200 to 800, whereas the NAEP scale ranges from 0 to 500.

The process of transforming raw scores into scale scores is computationally intensive, generally using a technique known as Item Response Theory (IRT), which simultaneously estimates the difficulty of an item, how well the item discriminates between high and lower performers, and the performance of the examinee. An examinee who successfully answers highly difficult items that discriminate between high and low performers will be judged to have more ability, and hence a higher scale score, than an examinee who gets the difficult items wrong.

There’s no one right way to transform raw scores into scale scores, and it’s always a process of estimation, which is sometimes obscured by the fact that scores are reported as definite quantities. (A little skoolboy editorializing here…) The expansion of testing hastened by NCLB has placed a lot of pressure on states, and their testing contractors, to construct scale scores for a test that represent the same level of performance from one year to the next (a process known as test equating). Much of this is done under great time pressure, and shielded from public view. The process is complicated by the fact that states typically don’t want to release the actual test items they use, because then they can’t use them in subsequent assessments as anchor items that are common across different forms of a test, since students’ performance on such items could change due to practice. Some tests are vertically equated, which means that a given score on the fourth-grade version of a test represents the same level of performance as that same score on the fifth-grade version of the test. In a vertically-equated test, if the average scale score is the same for fourth-graders as it is for fifth-graders, we’d infer that the fifth-graders haven’t learned anything during fifth-grade.

Proficiency scores represent expert judgments about what level of scale score performance should describe a student as proficient or not proficient at the underlying skill or competency that the test is measuring. For example, NAEP defines three levels of proficiency for each subject at each of the grades tested (4th, 8th and 12th): basic, proficient, and advanced. Cut scores divide the scale scores into categories that represent these proficiency levels, with students classified as below basic, basic, proficient, or advanced. These proficiency scores do not distinguish variations in students’ performance within the category; one student could be really, really advanced and another just advanced, and whereas a scale score would record that difference, a proficiency score would simply classify both students as advanced. The fact that proficiency levels are determined by expert judgment, and not by the properties of the test itself, means that they are arbitrary; the level of performance designated as proficient on NAEP may not correspond to the level of performance designated as proficient on an NCLB-mandated state test. Many researchers (including Dan Koretz, eduwonkette, and me) are concerned that the focus on proficiency demanded by NCLB accountability policies has the unintended consequence of concentrating the attention of school leaders and practitioners on a narrow range of the test-score distribution, right around the cut score for the category of “proficient,” to the detriment of students who are either well below or well above that threshold. Such a focus is a political judgment, not a psychometric one, and there are arguments both for and against it.

I’ll update this as more knowledgeable readers weigh in. If experts in measurement were to judge proficiency thresholds for knowledge about testing, I’d probably be classified as basic; Dan Koretz is definitely advanced. For a lively and readable treatment of these kinds of issues, get his book!

The opinions expressed in eduwonkette are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.

Commenting has been disabled on edweek.org effective Sept. 8. Please visit our FAQ section for more details. To get in touch with us visit our contact page, follow us on social media, or submit a Letter to the Editor.


This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Teaching Webinar
What’s Next for Teaching and Learning? Key Trends for the New School Year
The past 18 months changed the face of education forever, leaving teachers, students, and families to adapt to unprecedented challenges in teaching and learning. As we enter the third school year affected by the pandemic—and
Content provided by Instructure
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Curriculum Webinar
How Data and Digital Curriculum Can Drive Personalized Instruction
As we return from an abnormal year, it’s an educator’s top priority to make sure the lessons learned under adversity positively impact students during the new school year. Digital curriculum has emerged from the pandemic
Content provided by Kiddom
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Equity & Diversity Webinar
Leadership for Racial Equity in Schools and Beyond
While the COVID-19 pandemic continues to reveal systemic racial disparities in educational opportunity, there are revelations to which we can and must respond. Through conscientious efforts, using an intentional focus on race, school leaders can
Content provided by Corwin

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Accountability Did Washington D.C.'s Education Overhaul Help Black Children? This Study Says Yes
Researchers said the district's "market-based" reforms accelerated achievement versus other districts and states.
5 min read
Accountability Opinion What Next-Gen Accountability Can Learn From No Child Left Behind
As we ponder what's next for accountability and assessment, we’d benefit from checking the rearview mirror more attentively and more often.
4 min read
Image shows a multi-tailed arrow hitting the bullseye of a target.
DigitalVision Vectors/Getty
Accountability Opinion Let’s Make Transparency the Pandemic’s Educational Legacy
Transparency can strengthen school communities, allow parents to see what’s happening, and provide students more of the support they need.
3 min read
Image shows a multi-tailed arrow hitting the bullseye of a target.
DigitalVision Vectors/Getty
Accountability The Feds Offered Waivers on ESSA Accountability. Here's Where States Stand on Getting Them
While they get less attention than testing waivers, flexibility related to low-performing schools is an important federal and state issue.
5 min read
Image of a student taking a test with a mask on.
Rich Vintage/E+