LLM overkill is real: I analyzed 12 benchmarks to find the right-sized model for each use case πŸ€–
LLM overkill is real: I analyzed 12 benchmarks to find the right-sized model for each use case πŸ€–

LLM overkill is real: I analyzed 12 benchmarks to find the right-sized model for each use case πŸ€–

hey there!

With the recent explosion of open-source models and benchmarks, I noticed many newcomers struggling to make sense of it all. So I built a simple "model matchmaker" to help beginners understand what matters for different use cases.

TL;DR: After building two popular LLM price comparison tools (4,000+ users), WhatLLM and LLM API Showdown, I created something new: LLM Selector

βœ“ It’s a tool that helps you find the perfect open-source model for your specific needs.
βœ“ Currently analyzing 11 models across 12 benchmarks (and counting).

While building the first two, I realized something: before thinking about providers or pricing, people need to find the right model first. With all the recent releases choosing the right model for your specific use case has become surprisingly complex.

## The Benchmark puzzle

We've got metrics everywhere:

  • Technical: HumanEval, EvalPlus, MATH, API-Bank, BFCL
  • Knowledge: MMLU, GPQA, ARC, GSM8K
  • Communication: ChatBot Arena, MT-Bench, IF-Eval

For someone new to AI, it's not obvious which ones matter for their specific needs.

## A simple approach

Instead of diving into complex comparisons, the tool:

  1. Groups benchmarks by use case
  2. Weighs primary metrics 2x more than secondary ones
  3. Adjusts for basic requirements (latency, context, etc.)
  4. Normalizes scores for easier comparison

Example: Creative Writing Use Case

Let's break down a real comparison:

Input: - Use Case: Content Generation
Requirement: Long Context Support
How the tool analyzes this:
1. Primary Metrics (2x weight): - MMLU: Shows depth of knowledge - ChatBot Arena: Writing capability
2. Secondary Metrics (1x weight): - MT-Bench: Language quality - IF-Eval: Following instructions
Top Results:
1. Llama-3.1-70B (Score: 89.3)
β€’ MMLU: 86.0% β€’ ChatBot Arena: 1247 ELO β€’ Strength: Balanced knowledge/creativity
2. Gemma-2-27B (Score: 84.6) β€’ MMLU: 75.2% β€’ ChatBot Arena: 1219 ELO β€’ Strength: Efficient performance

Important Notes

- V1 with limited models (more coming soon)
- Benchmarks β‰  real-world performance (and this is an example calculation)
- Your results may vary
- Experienced users: consider this a starting point
- Open source models only for now
- just added one api provider for now, will add the ones from my previous apps and combine them all

## Try It Out

πŸ”— https://llmselector.vercel.app/

Built with v0 + Vercel + Claude

Share your experience:
- Which models should I add next?
- What features would help most?
- How do you currently choose models?

submitted by /u/medi6
[link] [comments]