<span class="vcard">/u/SpecialistBuffalo580</span>
/u/SpecialistBuffalo580

Is Humanity’s Last Exam a benchmark that measures real intelligence for AGI?

With Grok 4 and Gemini 3 the models have become really good at the known benchmarks like ARC-AGI and HLE. But is it really a proof of intelligence? Does acing these benchmarks truly show capabilities for original research and real understanding? I ask …