Is Humanity’s Last Exam a benchmark that measures real intelligence for AGI?

With Grok 4 and Gemini 3 the models have become really good at the known benchmarks like ARC-AGI and HLE. But is it really a proof of intelligence? Does acing these benchmarks truly show capabilities for original research and real understanding? I ask this because I saw an interview with Demis Hassabis from two months ago ans he said that AGI will come with 50 confidence between 5-10 years. He also called "nonsense" to the qualification "PhD Level" that some.people attribute to the models.

submitted by /u/SpecialistBuffalo580
[link] [comments]