With a federal testing deadline looming, now is the time for states to improve their science assessments.
Beginning just three years from now, with the 2007-08 school year, states will be required under the federal No Child Left Behind Act to test students in science at least once a year in each of three grade spans: 3-5, 6-9, and 10- 12. These assessments, like those already in place for reading and math, are to be of high quality and aligned with each state’s academic standards. How well states meet this requirement will influence the breadth and depth of the science content students are taught. But there are indications that much work remains to be done.
When carefully designed and properly administered, large-scale state science tests have the potential to provide educators and the public with useful information about what students know and can do. Unfortunately, few states have developed high-quality science assessments that are closely aligned with challenging content standards.
To produce a test that can accurately measure performance against rigorous science standards is a complex task. What concerns us, however, are tests that present a misleading picture of students’ knowledge of science.
Although not all states provide publicly released science-test items or sample tests, we were able to collect and review content standards and released test items in science from 39 states. Of those, the test items from 22 of the more populous states were subjected to close scrutiny. Here are some of the problem areas we found:
Science “lite.” Many states do not appear to challenge students to learn science deeply. High school science tests are often based on middle- grades content, and middle-grades tests are often based on elementary-grade content. This often occurs because the tests are based on general or vague content standards.
A good science assessment should have a mixture of test items with different levels of difficulty, so the performance of students can be accurately measured over a wide range of performance levels. Bearing in mind that a modicum of easy test items should be expected, it was nonetheless surprising to find that items that ought to present little challenge represented the preponderance of items on some tests.
The following example from a state high school graduation exam tests a simple concept:
As a football play begins, a lineman from one team pushes the opposing lineman backward. This is an example of
A. a balanced force
B. the force of gravity
C. an unbalanced force
D. the force of friction
While this test item has several technical and scientific flaws, a perhaps more serious concern is that the concept underlying it is more appropriate for middle-grades students and is based on a very general standard: “The student will understand concepts of force and motion.” A broad learning objective underlies the standard: “Relate Newton’s three laws of motion to real-world applications.”
Unclear standards. Clear and specific standards matter. “Describe common ecological relationships among species,” for example, could mean that students are to know that cats chase rats, or it could mean something far deeper. No performance level can be discerned from such a general standard.
Standards that have no clear interpretation do not provide adequate guidance to test developers about the expected performance level of students and permit topics to be tested without depth, or consistency in depth, across topics. Teaching science, writing science textbooks, and designing science tests based on vague standards can lead to instructional roulette, where figuring out what students must learn becomes a game of chance.
Items that don’t require science knowledge. Some state assessments include information in the question stem of a test item that should already be known by the student. The item becomes not a test of science knowledge, but a test of how students use or process the information that has just been given.
In its worst manifestation, this “instructional first aid” means that the science curriculum is relieved of the burden of actually teaching students anything specific. An example from a 9th grade state assessment illustrates this instructional first aid in a question relating parts of scissors:
A lever is a bar that turns about a fixed point called a fulcrum. A pair of scissors is made of two levers that move in opposition [an illustration, with the parts of the scissors marked A-D, is shown]. Which of the following points is the fulcrum of the two levers?
A. point A
B. point B
C. point C
D. point D
Science as background scenery. Some states avoid directly testing whether students have learned meaningful foundation knowledge, aiming instead to measure how well students “relate and use knowledge.” Testing on use of knowledge, rather than possession of knowledge, can lead to test items that are poorly grounded in the content of science, as this writing prompt from an 8th grade state assessment illustrates:
Your school’s Academic Team has chosen Archimedes as its mascot, and for the team shirt you have created a new symbol to represent Archimedes and his discoveries. The team members have asked you to attend their next meeting to inform them about your symbol. Write a speech to read to the team members, which describes and explains your symbol and tells why it is appropriate for the team.
Perhaps designing an Archimedes logo for a shirt does relate and use knowledge of Archimedes, but it also tends to treat science as a “scenic” background rather than a central element of the test. Another example from a 5th grade assessment asks students to measure the length of a caterpillar in a picture. The item tests only the skill of measuring, not knowledge about the living organism or its development.
Technical-design flaws. At the turn of the last century, the famous horse “Clever Hans” appeared to solve complicated problems in mathematics. Clever Hans would be asked a question by his owner and would paw the ground a certain number of times to indicate his answer. While the observer believed Clever Hans was really listening to the problem and was able to solve it, the horse was actually pawing the ground until he received a subtle cue to stop from his owner.
Defective test items can also guide students without knowledge of the answer to the correct response, thereby boosting performance. Our review found that items on science assessments are not always carefully developed and unnecessarily suffer from poor technical design. There are four recurring technical-design flaws in science assessments:
- Repetition cues. Repetition of words in the question stem of a test item can lead students to the correct answer, as in the following test item:
After Nita connected a copper wire from one terminal of a battery to the other, the wire became very hot. Why did the wire get hot?
A. The circuit was not complete.
B. Air around the circuit became electrified.
C. Chemical energy in the battery produced vibrations.
D. Electrical energy was changed into heat energy.
In this example, the word “hot” appears twice in the item question and cues the correct answer. Only one answer, the correct one (D), contains a word related to “hot.” When a test item provides cues, the student may not need to think about the science required to answer the question, and the test result may be deceptive.
- Implausible choices. Another technical defect is completely implausible answer choices. This defect reduces the difficulty of a test item. In the example below, choice A (Calidris alba) and choice C (Quercus rubra) are not plausible because they do not contain the cue words “Canis” or “lupus” in them. By excluding two incorrect foils, or distracters, a student has a 50 percent chance of selecting the correct answer from the remaining choices, B or D:
To which of these organisms is the gray wolf (Canis lupus) most clearly related?
A. Calidris alba
B. Anarhichas lupus
C. Quercus rubra
D. Canis familiaris
In fact, item analysis on the examination form shows that only 3 percent of students selected either of the two implausible distracters (A and C), 26 percent selected the single plausible distracter (B), and 68 percent selected the correct answer, D. Student performance on the item might have been lower than 68 percent had it been free of technical defects and cues.
- Logic cues. A third technical defect is the use of a pair of almost identical answers—only one of which can be true. The next example, from a state high school graduation exam, has diametrically opposing answers for the first two choices. The object of the test item is to identify the statement that is false. Students may focus their attention on answer choices A and B because both cannot be true. This increases the chance of selecting the correct answer because the other two distracters (C and D) can summarily be ignored:
All of the following statements about the scientific name of an organism are true except
A. The genus name is listed first.
B. The species name is listed first.
C. The genus name begins with a capital letter.
D. The species name does not begin with a capital letter.
- Writing style. Students may also divine the correct answer on a test item by responding to differences in writing style. In the example below, answer D is the only one written with a qualification ("... that it could contain”) and is also the wordiest answer:
The statement that the relative humidity is 50 percent means that
A. The chance of rain is 50 percent.
B. The atmosphere contains 50 kilograms of water per cubic kilometer.
C. The clouds contain 50 grams of water per liter.
D. The atmosphere contains 50 percent of the amount of water that it could contain at its present temperature.
Even if students don’t know the correct answer, they may sense that more effort went into writing answer choice D, and that the writing seems more cautious and scientific. As a result, students may be drawn to the correct answer for reasons unrelated to knowledge of science. Stylistic considerations are subtle, most likely influencing students at a subconscious level.
The academic standards of states represent the promises to students of a sound, basic education in science. But most states have not developed science assessments that adequately measure how well their promises have been kept.
Teaching science, writing science textbooks, and designing science tests based on vague standards can lead to instructional roulette.
Some may suggest that performance-based tests are the solution for producing challenging science assessments. Concerns about the cost, scoring, and validity of alternatives to multiple-choice tests, however, limit the viability of performance-based tests. Moreover, both performance assessments and traditional assessments can suffer from the unclear standards and technical flaws we have described.
So what can states do to improve their science standards and assessments? Here are three suggestions:
1. All states should carefully review the clarity of their science standards and their associated learning objectives. A carefully crafted science test should provide an operational definition of a state’s standards by defining in measurable terms what students should know. This is possible only when standards are clear enough to provide guidance to test developers about the expected performance level of students.
2. There is no perfect test, but state assessments often appear to be written unprofessionally, having many technical defects such as internal and external cuing. States must demand higher standards of technical quality from testing contractors and assessment writers.
3. High-quality state assessment programs in science must measure foundational knowledge in the elementary and middle grades and significantly increase the expectations placed on students in high school. Policymakers need to be willing to test students with many items that are difficult, and to set a high standard for achievement.
If states take these steps and begin now to improve their existing science tests, then the depth and breadth of science that could be learned in our schools, and is learned in the schools of many other nations, may be reflected nationwide in the science assessments of 2007.