The ways we measure LLMs' abilities, and thereby predict their impact, are seriously flawed. Basically all AI benchmarks have serious shortcomings:
https://www.lesswrong.com/posts/aFW63qvHxDxg3J8ks/nobody-is-doing-ai-benchmarking-right
[link] [comments]