GPT-4 outperforms its rivals in new AI benchmark suite GPT-Fathom

ByteDance and the University of Illinois researchers have developed an improved benchmark suite with consistent parameters, called GPT-Fathom, that indicates GPT-4, the engine behind the paid version of ChatGPT, significantly outperforms leading LLMs, including its biggest competitor, Claude 2.

For the latest advancements in AI, look here first.

https://preview.redd.it/v4fo8zser0sb1.png?width=1292&format=png&auto=webp&s=7e29fe9ac1af3efcb936ee61e9202717eed7e702

GPT-Fathom's breakthrough

The new benchmark suite, GPT-Fathom, addresses consistent settings issues and prompt sensitivity, attempting to reduce inconsistencies in LLM evaluation.
In a comparison using GPT-Fathom, GPT-4 outperformed over ten leading LLMs, crushing the competition in most benchmarks, and showing significant performance leaps from GPT-3 to its successors.

Performance specifics

The gap in performance was especially pronounced against Claude 2, ChatGPT's biggest rival.
GPT-4's Advanced Data Analysis model exhibited superior performance in coding, giving it an edge as compared to LuckLlama 2, the current best-performing open-source model.
Llama 2-70B showed comparable or better performance than gpt-3.5-turbo-0613 in safety and comprehension but displayed worse performance in "Mathematics", "Coding", and "Multilingualism".

The seesaw effect

The research team noted a 'seesaw effect' where an improvement in one area can lead to degradation in another.
For instance, GPT-4 saw a performance drop on the Mathematical Geometry Simple Math (MGSM) benchmark, despite improving its performance significantly on the text comprehension benchmark DROP.

(source)

P.S. If you like this kind of analysis, I write a free newsletter that tracks the most relevant news and developments in AI. Professionals from Meta, Google, and OpenAI are already reading it.

submitted by /u/AIsupercharged
[link] [comments]