Which LLM is king right now? I ran a creative stress-test on GPT-5, Claude Opus 4.1, o3-pro, Grok 4, and Gemini 2.5 Pro

With GPT-5 and Claude Opus 4.1 launching recently, the obvious question is: which of the strongest LLMs is actually the best right now?

I put 5 top models (GPT-5, Claude Opus 4.1, GPT o3-pro, Grok 4, Gemini 2.5 Pro) through the same ultimate stress-test:

Write a 650-word scripted debate where Cleopatra and Einstein suddenly appear in 2025 and argue about whether TikTok is good or bad for society. Rules: strict alternating lines (starting with Cleopatra), one era-specific joke each, one historical reference each, end with a surprising common agreement, and include a detailed “how I planned this” section.

Why this prompt?

Because it forces the model to juggle things they historically struggled with:

Complexity – multiple constraints, strict format, and length.
Creativity – humor + deep, thematic debate.
Rule-following – miss one rule and the output fails.
Character voice – Cleopatra and Einstein need to sound authentic.

The results

All 5 models nailed the structure (I was surprised by this, I expected some shorter/longer answers) but differed wildly in tone, depth and style:

GPT-5 - Did great with nuance and structure. Rich metaphors, era-authentic humor, even policy ideas. Dense but brilliant.
Claude Opus 4.1 - Quick, humorous chat with memorable touches like "Schrödinger’s TikTok". Super readable and charming.
GPT o3-pro - Flowery language (TikTok as a banquet, "photon vlogs"), which I'm usually not a fan of. Playful and quirky.
Grok 4 - Clear and direct analogies. Easiest to follow but not as deep as other models.
Gemini 2.5 Pro - Philosophical and poetic ("timeless hunger for recognition"), but not overdoing it, with subtle humor thrown in.

What they all agreed on

TikTok isn’t inherently good or bad: its impact depends on human intent, wisdom, and education. Tech is neutral. It just mirrors timeless human desires. Not sure I'm on board with "tech is neutral" stance.

Bottom line

Want depth & elegance? → GPT-5
Want playful banter? → Claude Opus 4.1
Want wild creativity? → GPT o3-pro
Want clarity? → Grok 4
Want philosophy? → Gemini 2.5 Pro

Technical performance

All models were used with API keys, so it's not the default web app behavior
All chats started at the exact same moment
Opus 4.1 started generating almost immediately, sub 1-second
Gemini 2.5 Pro shortly after
Grok 4 after a short pause behind the two above
o3-pro took a veeeery long time to generate an answer. I didn't time it but it was probably around 2 minutes
GPT-5 - I almost gave up on it. I tried maybe 20 times until it finally went through. API either didn't respond at all or timed out after a long while.

Full side-by-side outputs + very detailed summary (similarities, differences, strong sides, etc.): https://modelarena.ai/s/_EBUxCel6a

submitted by /u/_sabon_
[link] [comments]