Gemini 3 Flash has a 91% hallucination rate on the Artificial Analysis Omniscience Hallucination Rate benchmark!?
Can you actually use this for anything serious?
I wonder if the reason Anthropic models are so good at coding is that they hallucinate much less. Seems critical when you need precise, reliable output.
AA-Omniscience Hallucination Rate (lower is better) measures how often the model answers incorrectly when it should have refused or admitted to not knowing the answer. It is defined as the proportion of incorrect answers out of all non-correct responses, i.e. incorrect / (incorrect + partial answers + not attempted).
Notable Model Scores (from lowest to highest hallucination rate):
- Claude 4.5 Haiku: 26%
- Claude 4.5 Sonnet: 48%
- GPT-5.1 (high): 51%
- Claude 4.5 Opus: 58%
- Grok 4.1: 64%
- DeepSeek V3.2: 82%
- Llama 4 Maverick: 88%
- Gemini 2.5 Flash (Sep): 88%
- Gemini 3 Flash: 91% (Highlighted)
- GLM-4.6: 93%
Credit: amix3k
[link] [comments]