Recently, I came across a Hechinger Report article about how Asian Americans are graded harsher by AI auto-grading. This was particularly alarming because as an Asian myself and in the business of assessment, racial bias like these need to be addressed.
A quick TLDR: ETS, the folks who run SATs, analyzed more than 13K essays and found that GPT4o graded students 13% lower overall while graded submissions from Asian students by 18%. Essays were written in a period from 2015 to 2019.
I want to start with some external explanations.
That was the first question I asked myself. However, given that these essays were written (mostly) in person and from a period before ChatGPT, I would consider the essays relatively clean.
Another question was the prompt. Could there have been instructions in the prompt or in ETS’ grading rubric that contributed to the bias from GPT4o? I think this is highly possible but ETS did not share their prompt.
From my experience using GPT4o for AI-assisted grading, if the highest score on the rubric had too many conditions or none at all, GPT4o might be reluctant to give higher scores. Giving a full grade is riskier than giving a partial grade which is a behaviour also present in less experienced graders.
It could also be a model-specific behaviour to err on the safe side and thus high scores are rarely given.
This would explain why Asian students were penalized more since they usually scored higher than other racial groups on SATs.
Now I want to dive into some internal explanations from my own anecdotal experiences
Very early on, as a student in China, I was taught to adhere to a distinct writing style, from sentence to paragraph to the entire essay.
For example - “The Essay Sandwich”. You needed an introduction, at least three body paragraphs, and a conclusion. For each paragraph, you needed an opening sentence, supporting points, and a summary of the paragraph.
The content was different but the structure stayed the same.
I still remember my English teacher, prepping me for SATs, telling me to use adjectives whenever I could. The belief was that more adjectives showcased our vocabulary and made the essay stand out when being reviewed by the ETS grade. For example, “I still vividly remember…”.
Funny how, from my observations, this is behaviour that GPT4o also exhibits. It sticks to specific words and it also uses adjectives a lot.
I have a feeling that AI-detectors utilize this to a certain extent to detect AI-generated words which explains why AI-detectors have more false positives with ESL and Asian students.
You can read the full Hechinger Report here.
Enhancing education with AI-powered grading and feedback.