- Anthropic and Inflection AI release competitive generative models.
- Current benchmarks fail to reflect the real-world use of AI models.
- GPQA and HellaSwag were criticized for their lack of real-world applicability.
- Evaluation crises in the industry due to outdated benchmarks.
- MMLU's relevance was questioned due to the potential for rote memorization.
Read more:
https://techcrunch.com/2024/03/07/heres-why-most-ai-benchmarks-tell-us-so-little/
[link] [comments]