Is Humanity’s Last Exam a benchmark that measures real intelligence for AGI?
With Grok 4 and Gemini 3 the models have become really good at the known benchmarks like ARC-AGI and HLE. But is it really a proof of intelligence? Does acing these benchmarks truly show capabilities for original research and real understanding? I ask …