<span class="vcard">/u/mrconter1</span>
/u/mrconter1

First AI Benchmark Solved Before Release: The Zero Barrier Has Been Crossed

submitted by /u/mrconter1 [link] [comments]

DiceBench: A Simple Task Humans Fundamentally Cannot Do (but AI Might)

submitted by /u/mrconter1 [link] [comments]

When AI Beats Us In Every Test We Can Create: A Simple Definition for Human-Level AGI

submitted by /u/mrconter1 [link] [comments]

H-Matched: A website tracking shrinking gap between AI and human performance

Hi! I wanted to share a website I made that tracks how quickly AI systems catch up to human-level performance on benchmarks. I noticed this 'catch-up time' has been shrinking dramatically – from taking 6+ years with ImageNet to just months with…

HuggingFace Paper Explorer: View Top AI Papers from Past Week and Month

Hi! I've created a simple tool that extends HuggingFace's daily papers page, allowing you to explore top AI research papers from the past week and month, not just today. It's a straightforward wrapper that aggregates and sorts papers, makin…

BenchmarkAggregator: Comprehensive LLM testing from GPQA Diamond to Chatbot Arena, with effortless expansion

BenchmarkAggregator is an open-source framework for comprehensive LLM evaluation across cutting-edge benchmarks like GPQA Diamond, MMLU Pro, and Chatbot Arena. It offers unbiased comparisons of all major language models, testing both depth and br…

ModelClash: Dynamic LLM Evaluation Through AI Duels

Hi! I've developed ModelClash, an open-source framework for LLM evaluation that could offer some potential advantages over static benchmarks: Automatic challenge generation, reducing manual effort Should scale with advancing model capabiliti…