This is more of a general question for the entire community (developers, end users, curious individuals).
How do you see evals + benchmarking? Are they really relevant behind your decision to use a certain AI model? Are AI model releases (such as Llama 4 or Grok 3) overoptimizing for benchmark performance?
For people actively building or using AI products, how do evals play a role? Do you tend to use the same public evals reported in results, or do you try to do something else?
I see this being discussed more and more frequently when it comes to generative AI.
Would love to know your thoughts!
[link] [comments]