Addressing Racial Bias in AI Grading

Recently, I came across a Hechinger Report article about how Asian Americans are graded harsher by AI auto-grading. This was particularly alarming because as an Asian myself and in the business of assessment, racial bias like these need to be addressed.

‍

A quick TLDR: ETS, the folks who run SATs, analyzed more than 13K essays and found that GPT4o graded students 13% lower overall while graded submissions from Asian students by 18%. Essays were written in a period from 2015 to 2019.

I want to start with some external explanations.

‍

Could there be bias in the data & prompt?

That was the first question I asked myself. However, given that these essays were written (mostly) in person and from a period before ChatGPT, I would consider the essays relatively clean.

‍

Another question was the prompt. Could there have been instructions in the prompt or in ETS’ grading rubric that contributed to the bias from GPT4o? I think this is highly possible but ETS did not share their prompt.

‍

Are AI models reluctant to give high scores?

From my experience using GPT4o for AI-assisted grading, if the highest score on the rubric had too many conditions or none at all, GPT4o might be reluctant to give higher scores. Giving a full grade is riskier than giving a partial grade which is a behaviour also present in less experienced graders.

‍

It could also be a model-specific behaviour to err on the safe side and thus high scores are rarely given.

‍

This would explain why Asian students were penalized more since they usually scored higher than other racial groups on SATs.

‍

Now I want to dive into some internal explanations from my own anecdotal experiences

‍

We are taught to have a certain writing style

Very early on, as a student in China, I was taught to adhere to a distinct writing style, from sentence to paragraph to the entire essay.

‍

For example - “The Essay Sandwich”. You needed an introduction, at least three body paragraphs, and a conclusion. For each paragraph, you needed an opening sentence, supporting points, and a summary of the paragraph.

‍

The content was different but the structure stayed the same.

‍

We use adjectives. A lot.

I still remember my English teacher, prepping me for SATs, telling me to use adjectives whenever I could. The belief was that more adjectives showcased our vocabulary and made the essay stand out when being reviewed by the ETS grade. For example, “I still vividly remember…”.

‍

Funny how, from my observations, this is behaviour that GPT4o also exhibits. It sticks to specific words and it also uses adjectives a lot.

‍

I have a feeling that AI-detectors utilize this to a certain extent to detect AI-generated words which explains why AI-detectors have more false positives with ESL and Asian students.

‍

Some big takeaways

AI grading is being explored by all manners of institutions and organizations but it is not without its risks.
I do not believe we should use auto-grading for high-stakes assignments because of the inherent bias and hallucinations.
We can still make use of AI-assisted grading tools that have guardrails in place to ensure that every assessment is reviewed by a human grader.
Given the interest in AI grading, it is likely it will become an integral step in the full grading and feedback process.

‍

You can read the full Hechinger Report here.

Asians are Graded More Harshly by AI. An Anecdotal Explanation of Why

Could there be bias in the data & prompt?

Are AI models reluctant to give high scores?

We are taught to have a certain writing style

We use adjectives. A lot.

Some big takeaways

Check out our other blog articles!

Introducing: Rubric-Based Feedback

Coming Soon: Advanced Grading Settings!

Content Bank Launches This Wednesday!

Capabilities

Product

Pricing

Resources