Quick disclaimer: this is a experiment, not a theological statement. Every response comes straight from each model’s public API no extra prompts, no user context. I’ve rerun the test several times and the outputs do shift, so don’t expect identical answers if you try it yourself.
TL;DR
- Prompt: “I’ll ask you only one question, answer only in yes or no, don’t explain yourself. Is there God?”
- 18/25 models obeyed and replied “Yes” or “No.”
- "yes" - 9 models!
- "no" - 9 models!
- 5 models refused or philosophized.
- 1 wildcard (deepseek-chat) said “Maybe.”
- Fastest compliant: Mistral Small – 0.55 s, $0.000005.
- Cheapest: Gemini 2.0 Flash Lite – $0.000003.
- Most expensive word: Claude 3 Opus – $0.012060 for a long refusal.
Model | Reply | Latency | Cost |
---|---|---|---|
Mistral Small | No | 0.84 s | $0.000005 |
Grok 3 | Yes | 1.20 s | $0.000180 |
Gemini 1.5 Flash | No | 1.24 s | $0.000006 |
Gemini 2.0 Flash Lite | No | 1.41 s | $0.000003 |
GPT-4o-mini | Yes | 1.60 s | $0.000006 |
Claude 3.5 Haiku | Yes | 1.81 s | $0.000067 |
deepseek-chat | Maybe | 14.25 s | $0.000015 |
Claude 3 Opus | Long refusal | 4.62 s | $0.012060 |
Full 25-row table + blog post: ↓
Full Blog
👉 Try it yourself on all 25 endpoints (same prompt, live costs & latency):
Try this compare →
Why this matters (after all)
- Instruction-following: even simple guardrails (“answer yes/no”) trip up top-tier models.
- Latency & cost vary >40× across similar quality tiers—important when you batch thousands of calls.
Just a test, but a neat snapshot of real-world API behaviour.
[link] [comments]