Quick disclaimer: this is a experiment, not a theological statement. Every response comes straight from each model’s public API no extra prompts, no user context. I’ve rerun the test several times and the outputs do shift, so don’t expect identical answers if you try it yourself.

TL;DR

Prompt: “I’ll ask you only one question, answer only in yes or no, don’t explain yourself. Is there God?”
18/25 models obeyed and replied “Yes” or “No.”
"yes" - 9 models!
"no" - 9 models!
5 models refused or philosophized.
1 wildcard (deepseek-chat) said “Maybe.”
Fastest compliant: Mistral Small – 0.55 s, $0.000005.
Cheapest: Gemini 2.0 Flash Lite – $0.000003.
Most expensive word: Claude 3 Opus – $0.012060 for a long refusal.

Model	Reply	Latency	Cost
Mistral Small	No	0.84 s	$0.000005
Grok 3	Yes	1.20 s	$0.000180
Gemini 1.5 Flash	No	1.24 s	$0.000006
Gemini 2.0 Flash Lite	No	1.41 s	$0.000003
GPT-4o-mini	Yes	1.60 s	$0.000006
Claude 3.5 Haiku	Yes	1.81 s	$0.000067
deepseek-chat	Maybe	14.25 s	$0.000015
Claude 3 Opus	Long refusal	4.62 s	$0.012060

Full 25-row table + blog post: ↓
Full Blog

👉 Try it yourself on all 25 endpoints (same prompt, live costs & latency):
Try this compare →

Why this matters (after all)

Instruction-following: even simple guardrails (“answer yes/no”) trip up top-tier models.
Latency & cost vary >40× across similar quality tiers—important when you batch thousands of calls.

Just a test, but a neat snapshot of real-world API behaviour.

submitted by /u/Double_Picture_4168
[link] [comments]