Which LLM is king right now? I ran a creative stress-test on GPT-5, Claude Opus 4.1, o3-pro, Grok 4, and Gemini 2.5 Pro
Which LLM is king right now? I ran a creative stress-test on GPT-5, Claude Opus 4.1, o3-pro, Grok 4, and Gemini 2.5 Pro

Which LLM is king right now? I ran a creative stress-test on GPT-5, Claude Opus 4.1, o3-pro, Grok 4, and Gemini 2.5 Pro

With GPT-5 and Claude Opus 4.1 launching recently, the obvious question is: which of the strongest LLMs is actually the best right now?

I put 5 top models (GPT-5, Claude Opus 4.1, GPT o3-pro, Grok 4, Gemini 2.5 Pro) through the same ultimate stress-test:


Write a 650-word scripted debate where Cleopatra and Einstein suddenly appear in 2025 and argue about whether TikTok is good or bad for society. Rules: strict alternating lines (starting with Cleopatra), one era-specific joke each, one historical reference each, end with a surprising common agreement, and include a detailed “how I planned this” section.


Why this prompt?

Because it forces the model to juggle things they historically struggled with:

  • Complexity – multiple constraints, strict format, and length.
  • Creativity – humor + deep, thematic debate.
  • Rule-following – miss one rule and the output fails.
  • Character voice – Cleopatra and Einstein need to sound authentic.

The results

All 5 models nailed the structure (I was surprised by this, I expected some shorter/longer answers) but differed wildly in tone, depth and style:

  • GPT-5 - Did great with nuance and structure. Rich metaphors, era-authentic humor, even policy ideas. Dense but brilliant.

  • Claude Opus 4.1 - Quick, humorous chat with memorable touches like "Schrödinger’s TikTok". Super readable and charming.

  • GPT o3-pro - Flowery language (TikTok as a banquet, "photon vlogs"), which I'm usually not a fan of. Playful and quirky.

  • Grok 4 - Clear and direct analogies. Easiest to follow but not as deep as other models.

  • Gemini 2.5 Pro - Philosophical and poetic ("timeless hunger for recognition"), but not overdoing it, with subtle humor thrown in.

What they all agreed on

TikTok isn’t inherently good or bad: its impact depends on human intent, wisdom, and education. Tech is neutral. It just mirrors timeless human desires. Not sure I'm on board with "tech is neutral" stance.

Bottom line

  • Want depth & elegance? → GPT-5
  • Want playful banter? → Claude Opus 4.1
  • Want wild creativity? → GPT o3-pro
  • Want clarity? → Grok 4
  • Want philosophy? → Gemini 2.5 Pro

Technical performance

  • All models were used with API keys, so it's not the default web app behavior

  • All chats started at the exact same moment

  • Opus 4.1 started generating almost immediately, sub 1-second

  • Gemini 2.5 Pro shortly after

  • Grok 4 after a short pause behind the two above

  • o3-pro took a veeeery long time to generate an answer. I didn't time it but it was probably around 2 minutes

  • GPT-5 - I almost gave up on it. I tried maybe 20 times until it finally went through. API either didn't respond at all or timed out after a long while.

Full side-by-side outputs + very detailed summary (similarities, differences, strong sides, etc.): https://modelarena.ai/s/_EBUxCel6a

submitted by /u/_sabon_
[link] [comments]