AI News & Updates
July 15, 2024

Asians are Graded More Harshly by AI. An Anecdotal Explanation of Why

Female asian student using a laptop outdoors

Recently, I came across a Hechinger Report article about how Asian Americans are graded harsher by AI auto-grading. This was particularly alarming because as an Asian myself and in the business of assessment, racial bias like these need to be addressed.

A quick TLDR: ETS, the folks who run SATs, analyzed more than 13K essays and found that GPT4o graded students 13% lower overall while graded submissions from Asian students by 18%. Essays were written in a period from 2015 to 2019.

I want to start with some external explanations.

Could there be bias in the data & prompt?

That was the first question I asked myself. However, given that these essays were written (mostly) in person and from a period before ChatGPT, I would consider the essays relatively clean.

Another question was the prompt. Could there have been instructions in the prompt or in ETS’ grading rubric that contributed to the bias from GPT4o? I think this is highly possible but ETS did not share their prompt.

Are AI models reluctant to give high scores?

From my experience using GPT4o for AI-assisted grading, if the highest score on the rubric had too many conditions or none at all, GPT4o might be reluctant to give higher scores. Giving a full grade is riskier than giving a partial grade which is a behaviour also present in less experienced graders.

It could also be a model-specific behaviour to err on the safe side and thus high scores are rarely given.

This would explain why Asian students were penalized more since they usually scored higher than other racial groups on SATs.

Now I want to dive into some internal explanations from my own anecdotal experiences

We are taught to have a certain writing style

Very early on, as a student in China, I was taught to adhere to a distinct writing style, from sentence to paragraph to the entire essay.

For example - “The Essay Sandwich”. You needed an introduction, at least three body paragraphs, and a conclusion. For each paragraph, you needed an opening sentence, supporting points, and a summary of the paragraph.

The content was different but the structure stayed the same.

We use adjectives. A lot.

I still remember my English teacher, prepping me for SATs, telling me to use adjectives whenever I could. The belief was that more adjectives showcased our vocabulary and made the essay stand out when being reviewed by the ETS grade. For example, “I still vividly remember…”.

Funny how, from my observations, this is behaviour that GPT4o also exhibits. It sticks to specific words and it also uses adjectives a lot.

I have a feeling that AI-detectors utilize this to a certain extent to detect AI-generated words which explains why AI-detectors have more false positives with ESL and Asian students.

Some big takeaways
  1. AI grading is being explored by all manners of institutions and organizations but it is not without its risks.
  2. I do not believe we should use auto-grading for high-stakes assignments because of the inherent bias and hallucinations.
  3. We can still make use of AI-assisted grading tools that have guardrails in place to ensure that every assessment is reviewed by a human grader.
  4. Given the interest in AI grading, it is likely it will become an integral step in the full grading and feedback process.

You can read the full Hechinger Report here.

Check out our other blog articles!

TimelyGrader Logo

Enhancing education with AI-powered grading and feedback.