Education Opinion

Response: ‘Getting What You Pay For’ In Teacher Evaluations

By Larry Ferlazzo — November 07, 2014 19 min read
  • Save to favorites
  • Print

(This is the last post in a three-part series on this topic. You can see Part One here and Part Two here.)

This week’s question-of-the-week is:

If we don’t want to use student test scores as part of a teacher evaluation, then what are alternatives?

Part One in this series included responses from American Federation of Teachers President Randi Weingarten, California Teachers Association President Dean Vogel, and 2012 National Teacher Of The Year Rebecca Mieliwocki.

Part Two featured contributions from Julian Vasquez Heilig (with Lisa Hernandez), Ben Spielberg, David Berliner and Paul Bruno.

This last post in the series shares commentaries from W. James Popham, Barnett Berry, Pia Lindquist Wong, Rick Stiggins and Derek Cabrera, along with thoughts from readers.

You can also listen to a ten-minute conversation I had with Julian and Ben on this topic at my BAM! RadioShow.

I also thought readers might be interested in a piece I wrote for The Washington Post a few years ago titled The Best Kind Of Teacher Evaluation.

Response From W. James Popham

W. James Popham is a UCLA Emeritus Professor and the author of Corwin’s 2013 book, Evaluating America’s Teachers: Mission Possible? He is also the author of several books published by ASCD on assessment:

Some educators don’t want to use test-elicited evidence of student growth in the evaluation of teachers. This is a major mistake. What should be opposed, with zeal, is the evaluation of teachers when inappropriate kinds of test-score results are used. Let’s face it, most teachers--including me--went into this game to help kids learn. Accordingly, failing to employ some evidence regarding students’ learning to evaluate us is sublimely wrong-headed.

Today’s problem, however, is that teacher evaluators often rely on students’ performances on state-approved standardized accountability tests. And those tests are accompanied by no evidence whatsoever that they actually distinguish between well taught and badly taught students. When teachers are evaluated, the tests used must be instructionally sensitive. Such tests are demonstrably able to differentiate between effectively and ineffectively taught students.

If properly assessed, I regard evidence of students’ learning as the most important evidence to employ when evaluating a teacher. Yet, because current federal preferences appropriate call for “multiple measures,” two other kinds of evaluative evidence can also be used in appraising a teacher.

Classroom Observations:
When properly trained evaluators appropriately observe what’s going on in a teacher’s classroom, the resultant evidence can contribute to accurate teacher evaluation. Regrettably, what’s going on now in the U.S. is far from appropriate. The chief deficit is that observers are trying to observe too many things. Wherever classroom-observation evidence plays a prominent role in the evaluation of a teacher, the observation forms being used typically ask observers to spot the presence, absence, or quality of several dozen factors--sometime more than 50 such factors. These forms were originally created to be used in a formative, that is, improvement context, not in today’s summatively oriented teacher evaluations. Although classroom observers can do a decent job working with a half-dozen or so observable factors, reliance on too many factors diminishes the quality of the resultant data. Lean observations win; cluttered observations lose.

Student Affect: Finally, a key factor in evaluating a teacher is the impact the teacher has on students’ attitudes, interests, and values. By employing anonymously completed pre-instruction and post-instruction inventories assessing such variables as students’ interest in the content being taught, important evaluative evidence is collected. The results of such anonymous inventories, however, must be used to arrive at group-focused shifts in students’ affect, not shifts in individual student’s affect.

Classroom-observation evidence, changes in students’ affect, and the right kinds of test-scores can contribute to defensible teacher evaluation.

Response From Barnett Berry

Barnett Berry (@BarnettCTQ) is founder, partner, and CEO at the Center for Teaching Quality (@teachingquality), a national nonprofit that connects, readies, and mobilizes teachers to transform teaching. Barnett, whose most recent book is Teacherpreneurs: Innovative Teachers Who Lead But Don’t Leave, has been a scholar of, and advocate for, teachers and the profession for decades:

Secretary of Education Arne Duncan recently outlined a slight shift in his teacher evaluation policies, offering states a one-year delay in using student test results to assess teacher effectiveness. Teachers need time to adjust to new college- and career-ready standards, and tests need to be better aligned, he explained. #truth

But so far Duncan has stopped short of acknowledging that value-added measures (VAMs) based on test scores are extremely unstable. And teachers rarely receive usable and timely feedback from this data. #YIKES After all, the purpose of evaluation shouldn’t be to label supposedly “good” and “bad” teachers. Instead, evaluation should be just one piece of a professional learning system that supports all teachers to improve throughout their careers.

Many alternative approaches are available to Duncan and state policymakers:

1) Peer-to-peer accountability systems. In top-performing nations, trained administrators and master teachers assess their teaching colleagues’ performance and suggest improvements. Evaluation is powered by practitioners’ professional judgment rather than rigid VAM data and “fly-by” observations. Such systems are working well in American districts like Montgomery County (MD), Poway (CA), and San Juan Unified (CA).

2) Portfolios and self-assessment. Teachers could assemble videos and analyses of their teaching, along with representative samples of student work, in a process akin to that of the National Board for Professional Teaching Standards. Teachers in Singapore assemble evidence of their impact on the whole child (not just test performance).

3) Evidence of data-driven improvement. Teachers could be judged on how they use student learning results (including but not limited to traditional test scores) to improve their teaching.

4) Contributions to their schools and community. Teacher assessments could account for practitioners’ roles in their local context, encouraging effective collaboration within and beyond school walls as well as the spread of expertise among teaching colleagues.

We can do this. Other nations have relied on American researchers’ work to build effective evaluation policies within holistic professional learning systems. And new digital tools like Show Evidence make it easier than ever to implement alternative approaches, allowing teachers to electronically assemble portfolio data and artifacts related to specific aspects of effective teaching. Accomplished teachers (including Kentucky teacherpreneur Ali Wright and many, many others in the CTQ Collaboratory) can offer expert counsel on improving evaluation systems. Implementation is within reach. All we need is the political will.

Perhaps--thanks to deep conversation with educators nationwide--Secretary Duncan is poised to retrieve his legacy, advancing the teaching profession all kids deserve. Or will he leave that monumental achievement to his successor?

* A recent analysis of New York City VAM scores shows that the average error margin is plus or minus 30 percentile points. Matthew DiCarlo notes: “That puts the “true score” (which we can’t know) of a 50th percentile teacher at somewhere between the 20th and 80th percentile--an incredible 60 point spread.”

Response From Pia Lindquist Wong

Pia Lindquist Wong, Ph.D., is a Professor in and Chair of the Department of Teaching Credentials at California State University, Sacramento (CSUS). She is the chair of the newly formed Teaching Credentials Department in her college. From 2000 until 2008, she was the Project Director for the Equity Network, an urban school reform and teacher preparation partnership. Her key publications include Prioritizing urban children, their teachers, and schools through professional development schools (2009, co-edited with Ronald Glass) and Education and Democracy: Paulo Freire, Education Reform and Social Movements (1998 co-authored with Maria Pilar O’Cadiz and Carlos Alberto Torres):

When I think about using test scores to evaluate the effectiveness of a teacher, I am reminded of a conversation I had with my daughter 3 years ago when she was in 8th grade when we were reviewing her California Standards Test scores. This wasn’t something that we as a family made a big deal about, but because it was the second time she had scored 100% on the statistics and probability section of the math exam (only 5 problems, I should add), I mentioned that she must be pretty good at or even really like statistics and probability. Her response to me - “what’s that?!” Surely we don’t want to make high stakes policy decisions and take high stakes programmatic actions based on results that are either meaningless or indecipherable to teachers and students.

At the same time, I am very concerned with the larger policy questions of teacher performance and teacher effectiveness. On a personal level, I am concerned since my own children are/were in public high school. On a professional/personal level I am also concerned since my children and many others are/were taught by teachers who graduated from the credential program that I teach in and many of the teachers at their school also mentor current candidates in that same program. In these contexts, I (with my university and K-12 colleagues) wrestle constantly with questions about teacher effectiveness, how to measure high quality performance, and what the predictive value is of the various tasks, projects and experiences our pre-service candidates complete for our program.

On some days, it seems appealing to boil all of this down to a score. Indeed the allure (false though it is) of a process where we could “tinker” with inputs and consequently increase scores by predictable quartiles is powerful. But we know that we would simply be deceiving ourselves and debasing our profession. Instead, we should be asking the kinds of questions that honor the situational and contextual nature of teaching and do justice to its complexities and multiple layers. We should also be asking to include a range of perspectives in this process, including those not normally present in this dialogue. It is unlikely that we can or even want to eliminate large-scale, system-level accountability measures, but these must be balanced by local, contextualized measures that reflect the lived experiences of the stakeholders directly involved in the teaching/learning process and provide useful information to guide future decisions and actions.

Perhaps we should include students in the teacher performance evaluation process. For example, according to my children and their peers, “good” teachers make excellent use of class time, they actively create classroom community, they are organized, their grading policies are consistent and not arbitrary, they are fair and humane, they offer second chances, their directions are clear, their assignments are challenging but engaging and relevant, they provide specific and timely feedback, they communicate an interest in their students as students and people, they express a genuine sense of caring, they are firm but they are flexible, and they are interesting people. Systematically surveying students (perhaps 4th grade and older) could yield feedback on their classroom experience in ways that a test score cannot and does not.

We could also gather input from other stakeholders in this process - e.g., teachers who “receive” students in subsequent grades or for subjects taken during the same school year. Parents could also be a source of feedback to teachers and the local system (grade level, school or district level) as a whole. While we often worry that these kinds of processes can devolve into popularity contests, there are certainly ways to solicit this input so that the results are valid and reliable.

We can also construct more direct measures of what a teacher’s efforts “render” in a given academic period that could approximate what we think we are getting with value added measures. Teachers - by subject area or grade level or some other logical category - could select specific tasks, assignments or projects, standardize the evaluation criteria and then assess students’ work over time or at select points in time. This kind of a task-specific growth model could be used to evaluate a specific teacher’s effectiveness. It could also inform the larger group; for example, if one teacher had especially compelling growth, his/her strategies and practices could be illuminated for the learning benefit of other teachers. While this approach may seem unrealistically grassroots, a high quality task could certainly be created that aligned to important benchmarks and standards so that student performance on that task was linked and maybe even predictive of other kinds of performance on standardized measures.

It is worth considering these alternatives to using standardized test scores to evaluate teacher performance. Unlike standardized test scores, these methods can provide comprehensive information about a teacher’s performance on a range of domains we know are critical to teacher (and student) success. They also provide more qualitative information that can be clearly and immediately actionable. Overall, the testing route will be less costly than other options...but be wary, because in teacher evaluation, like other efforts, you get what you pay for!

Response From Rick Stiggins

Rick Stiggins is retired founder and president of the Assessment Training Institute. He has devoted his career to understanding the task demands of classroom assessment and to helping teachers and school leaders learn how to meet the challenges of that level of assessment. He is the author of Defensible Teacher Evaluation. Connect with him at www.rickstiggins.com:

First, it is a very good idea for you to avoid using standardized test scores in this context. I have written extensively about the reasons why in my recent Corwin publication, Defensible Teacher Evaluation. These tests have not been validated for the purpose of teacher evaluation; their ability to detect differences in the quality of instruction has never been researched, let alone proven. Further, reliance on annual tests allows a full year for a wide variety of factors beyond the control of the teacher to influence student growth: school climate, student characteristics, curriculum materials, impact of previous teachers, and home and family background factors. Teacher factors account for barely 10% of the variance in student performance on these tests. So it is patently unfair to hold teachers accountable to growth defined in the manner.

However, this does not mean we cannot factor student achievement growth into the teacher evaluation process. I believe we can and should. But we must tap proper sources of evidence of growth to make this work. The subtitle of my book is Student Growth through Classroom Assessment. We can make this work to the benefit of students, teachers, and the school community, but only if certain conditions are satisfied. Here’s the success scenario:

Each teacher/supervisor team identifies at least one high priority achievement standard in each of the teacher’s assigned subjects. The teacher develops verifiably high-quality classroom assessments of each of those standards. When instruction is offered on one of the priority standards, it is bracketed by pre- and post-testing to document changes in student achievement. That evidence is reviewed immediately with the supervisor to establish its credibility and to plan for any necessary re-teaching. At the end of the year each teacher assembles the standard-by-standard evidence into a portfolio advancing as strong an argument as possible of the efficacy of his or her instructional practices, also addressing any extraneous factors beyond the teacher’s control that may have influence learning. A carefully-selected and well-trained committee reviews each portfolio with the teacher to render a judgment about this facet of that teacher’s performance. This process honors the professionalism of all involved by affording teachers the opportunity to build their own case for the efficacy of their instruction.

However, this can only work if all teachers and supervisors are sufficiently assessment literate to develop and use high-quality classroom assessments. This condition is rarely satisfied in schools today due to the ongoing lack if pre- and in-service training. But that problem is fixable; very effective and inexpensive professional development programs are readily available to fill this gap in competence.

Response From Derek Cabrera

Derek Cabrera holds a PhD from Cornell University, is an author of six books and an internationally recognized expert in cognition, systems, and learning, and taught at Cornell University. Derek is currently co-Founder and senior research scientist at Cabrera Research Lab in Ithaca, New York. He is the co-author of Thinking at Every Desk: Four Simple Skills to Transform Your Classroom (W. W. Norton; 2012). Visit him at cabreraresearch.org:

This is one of those cases where the question itself might underlie the problem. We need to think differently about how we evaluate teachers and it’s not as simple and clear-cut as some might think. For example, many educational leaders, even those who think of themselves as fairly innovative and thoughtful, will begin listing what they think are creative solutions to not using student test scores to evaluate teachers: engagement, enrollment, attendance, graduation rates, etc. The problem is the underlying assumption that teachers have enough influence over students that the teacher’s professional evaluation should be based on the student’s performance. But, how could we not use student performance as the way to evaluate teachers? What else would you use? Well, this incredulous chorus is in part because of our own errors in thinking about the profession of teaching; we wouldn’t make the same flawed assumptions or be quite as incredulous if we were talking about doctors, lawyers, or other professional groups.

We don’t evaluate doctors by the number of patients who die, nor by how healthy their patients or communities are. The truth is, we understand that people make choices, have genetic histories, and cultural context and that doctors don’t have quite as much influence over their patients’ health as we might like.

If we are ever to liberate teaching from its gendered roots then we must work to professionalize the field of teaching similar to the fields of law and medicine. There are many systems that need to be in place that for this to occur, not the least of which is the lacking importance our culture places on the profession itself. Like any group, teachers desire agency and the political power to influence the profession they love. Collective action is critically important for teachers, and it can come from a variety of powerful places. The future of teacher collective power is in the professionalization and self-governance of their burgeoning profession. I think, for example, that there should be a “teacher bar association” and various other systems for teachers to govern best practices, malpractices, and take a leadership role in pushing research and innovation in their field forward. They should not be evaluated by non-teachers without professional knowledge of teaching any more than a heart surgeon should be evaluated by a plumber. We must move the profession of teaching toward self-governance and the reputation of teaching toward one of society’s highest paid and most competitive opportunities.

We can all provide a list of test score alternatives for teacher evaluation, but the truth is that such a list actually causes us to focus on the wrong question. The question we should be asking ourselves is not “how will we evaluate teachers?” but how will we elevate the profession of teaching in our society to such a high degree that teachers govern themselves and the general public trusts them to do so.

Responses From Readers

Extra Credit?:

Since politicians and elected officials make our evaluation systems, how about we have ALL politicians (Public Servants) be required to visit classrooms for 40 hours per nine weeks - visiting elementary for 40 hours - middle school for 40 hours and high school for 40 hours with the final 40 hours of time spent in teacher professional development courses. This might help them understand what our jobs are like before they create some outlandish evaluation system.

Another way to measure me: Put me in charge of passing and failing, not a test. Give me the social value to educate and not a test. This would be a great way to see if I am doing my job effectively, you know... like Finland does!

Steve Shapiro:

Hold the principal accountable for a school’s overall academic performance (including standardized test scores) but provide that administrator with greater leeway on staffing decisions as well as other budgeting issues. Competent administrators should have a fairly good sense of which staff members are contributing to a school’s success and which ones may not be effective in their role(s) within the school. That administrator should be able to consider standardized test results along with many other metrics when evaluating staff and should have a better understanding of the context from which those test scores emanate. This is essentially how management of human resources occurs in the private sector.

Thanks to James, Barnett, Pia, Rick and Derek, and to readers, for their contributions!

Please feel free to leave a comment your reactions to the topic or directly to anything that has been said in this post.

Consider contributing a question to be answered in a future post. You can send one to me at lferlazzo@epe.org.When you send it in, let me know if I can use your real name if it’s selected or if you’d prefer remaining anonymous and have a pseudonym in mind.

You can also contact me on Twitter at @Larryferlazzo.

Anyone whose question is selected for weekly column can choose one free book from a number of education publishers.

Just a reminder -- you can subscribe and receive updates from this blog via email or RSS Reader... And,if you missed any of the highlights from the first three years of blog, you can see a categorized list below. You won’t see posts from school year in those compilations, but you can review those new ones by clicking on the monthly archives link on blog’s sidebar:

Student Motivation

Implementing The Common Core

Teaching Reading and Writing

Parent Engagement In Schools

Teaching Social Studies

Best Ways To Begin & End The School Year

Teaching English Language Learners

Using Tech In The Classroom

Education Policy Issues

Teacher & Administrator Leadership

Instructional Strategies


Teaching Math & Science

Brain-Based Learning

School Relationships

Author Interviews

Professional Development

Education Week has published a collection of posts from blog -- along with new material -- in an ebook form. It’s titled Classroom Management Q&As: Expert Strategies for Teaching.

Watch for the next question-of-the-week in a few days....

The opinions expressed in Classroom Q&A With Larry Ferlazzo are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.