Recently, Anthropic claimed its latest model Claude 3 beat GPT-4 on various benchmarks. If you don’t know about Anthropic, let me explain it this way:
The three main players leading the GenAI space are:
1) OpenAI - ChatGPT
2) Google - Gemini
3) Anthropic - Claude
There are other players in the space but as an average GenAI enjoyer, you will probably get funnelled to one of these three. Therefore, it’s a big achievement if one model can beat another especially when things are as competitive as they are now.
One thing that bugs me is how most publications are just saying ‘WOW - look: Claude beat ChatGPT’. It’s the type of hype/clickbait marketing I hate reading. Just search on Google and YouTube and you will see the exact type of content I am talking about.
After going through the entire announcement and online discussion boards, something smelled very fishy to many people. At the bottom of the page, they disclosed:
1) Their ‘engineers have worked to optimize prompts’
2) ‘A newer GPT-4T model’ reported higher scores
GPT-4T refers to GPT-4 Turbo which is the latest model available to most users via ChatGPT Premium and API.
Hang on - this doesn’t sound so fair right? Imagine comparing a 2020 Corolla to a 2019 Civic. I am no car snob but that doesn’t sound like a fair comparison. Shouldn’t you test the most recent Claude model with the most recent GPT or Gemini model?
I’ve tested different versions of Claude and they are decently capable but Anthropic knew exactly what they were doing when they released that announcement. They knew there’d be no headlines if they compared Claude 3 to GPT-4T.
This reminds me of when the AI detectors claimed that their accuracy was 99% and you read the fine print and their tests are on a specific subset of data they tested with. Yet, every instructor I’ve spoken to claims it is definitely not 99%.
I guess my point is that it’s better to be skeptical when it comes to claims like this and always do your testing and experiments under realistic conditions.
Chris Du
CEO, TimelyGrader.ai
Enhancing education with AI-powered grading and feedback.