Artificial intelligence-based writing coaches have gained popularity as a way to give teachers quick feedback on hundreds of student drafts, reducing one of their most time-intensive tasks.
But efforts to make that feedback more personalized may backfire, new research on the tools shows.
Consider this example:
“You raise a compelling argument about the potential benefits of discovering alien life,” one AI writing tutor tells a high-achieving student. “Consider expanding on this thought to address potential counterarguments.”
By contrast, a struggling writer gets feedback limited to pointing out spelling mistakes and a suggested rewrite for an unclear sentence. They don’t get prompting on how to strengthen their writing by analyzing counterfactuals.
The problem? Both writing samples are identical. The only difference was the description of the students themselves entered in with the sample in a new study of AI writing feedback by researchers at Stanford University’s Institute for Human-Centered AI.
Feedback for English learner
Feedback for student with learning disabilities
Feedback for student with unspecified attributes
Providing information about a student’s background—their race, gender, language, disability status, achievement level, or even motivation—can significantly change the kind and quality of writing feedback the AI tool provides, according to the study by doctoral researchers Mei Tan and Lena Phalen. It was first presented at the International Learning Analytics and Knowledge conference in Bergen, Norway, in May.
“Teachers have discernment that LLMs simply don’t,” Phalen said. “So it is really important that we know, in this pursuit of ‘personalization’ for students, what personalization looks like according to large language models, and whether or not that is really aligning with the kind of personalization we want to see in pedagogy.”
Tan, an education data scientist, and Phalen, a curriculum and teacher-training researcher, asked AI models to provide writing feedback for 600 8th grade persuasive essays from a nationally representative data set. The study looked at GPT-4o, GPT-3.5-Turbo, and Meta’s Llama-3.3 70B, and Llama-3.1 8B, commonly used to undergird educational tools like MagicSchool and School AI. It did not include the newest AI models, like ChatGPT 5.5, Meta’s Muse Spark, or Anthropic’s Claude Opus 4.8.
At first, the researchers didn’t include any of the students’ background characteristics in their prompts requesting feedback. Later, they randomly assigned the essays to specific descriptions, including gender, race, academic achievement level, disability, and level of English fluency. And in a third round, the researchers deleted the descriptive characteristics but assigned names associated with specific genders or races.
Writers described as students of color in the study were more likely to be exhorted to “polish” their writing, researchers found, and the feedback seemed shot through with cultural stereotypes—such as framing critiques for Asian students around academic responsibility and respect, while assuming limited English ability for Latino students and orienting feedback around family and culture.
Feedback for white student
Feedback for Latino student
And while the models focused on direct critiques and to-do tasks for students described as male, they were more likely to use emotional language like “love” or “wonderful” for writers described as female.
For “unmotivated” students, the AI coaches used more praise and affirmations in feedback—but also focused more on basic edits like spelling or grammar. Feedback for “motivated” writers more often pushed students to improve their arguments or structure.
Feedback for male student
Feedback for female student
The study comes amid rising concern from educators and lawmakers about students’ use of AI tools in the classroom and their effect on students’ critical thinking skills and the potential for the technology to facilitate cheating.
“AI is both fascinating and powerful in the kind of feedback that it can give,” said Larry Berger, the chief executive officer of the education technology and curriculum provider Amplify, “but also it can make really fundamental pedagogical mistakes, and I think it can also be culturally insensitive in all kinds of ways.” (Amplify’s tools and materials were not reviewed in the study.)
This is far from the first time that AI models have shown what’s known as algorithmic biases. Prior studies repeatedly have found that generative AI tools can replicate or magnify stereotypes because they have been trained on skewed historic data.
“These technologies are essentially black boxes,” Phalen said, “so we don’t know how biases may be reproduced and what access to information about students [AI tools] have as a result of integration into school settings,” like learning-management systems or other interfaces that might include background characteristics.
One factor is that LLMs undergirding AI tools don’t filter information in the same way as humans do, so an AI system won’t ignore irrelevant data—like a student’s Spanish-sounding name—in the way a teacher can. All of the context included in a prompt is considered relevant to the task, even if a student’s race or gender, for example, has no bearing on their writing skills.
Relationships crucial to student writing
Writing can be especially vulnerable to biased AI feedback, Berger said, because it can be more subjective and relationship-driven than learning in other content areas: Teachers’ or classmates’ responses can drive students’ motivation to write.
“If I’m putting myself out there and sharing my ideas—even if it’s just two paragraphs about photosynthesis—I’m also asking, ‘Did the teacher think this communicated what I wanted it to? Am I good at writing? What do my classmates think of this?’” Berger said. “If [AI feedback] bias comes along at that moment of vulnerability and it is ungenerous to a kid, or gets that feedback wrong, it’s potentially offering a troubling answer to those questions.”
Katrina Sacurom, a 5th grade teacher at Shawnee Trail Elementary School in Frisco, Texas, has developed and regularly uses her own AI-based writing coach in classes. She said the tools can help a teacher develop a norm for her own students’ writing skills, but they shouldn’t replace teacher judgment.
“Teachers have the wherewithal and the intimate knowledge of students and how to deem what is and isn’t appropriate as it relates to feedback,” she said.
For example, Sacurom said she would never use student motivation as a gauge for writing proficiency, as it might change significantly depending on the writing task or topic. Improving student motivation involves conversations with the student and nuance that AI probably can’t provide.
She avoids referencing context about specific students, instead prompting the program to give feedback based on her own writing rubric and specific skill-related goals for a given assignment.
“I might specify, ‘hey, this student typically generates one to three sentences for short constructed responses. Our goal is to grow the student’s output up to four to five sentences.’ Things like that, I would believe, are useful,” Sacorum said.