This is an automated archive made by the Lemmit Bot.
The original was posted on /r/artificial by /u/medi6 on 2024-11-07 15:19:03+00:00.
hey there!
With the recent explosion of open-source models and benchmarks, I noticed many newcomers struggling to make sense of it all. So I built a simple "model matchmaker" to help beginners understand what matters for different use cases.
TL;DR: After building two popular LLM price comparison tools (4,000+ users), WhatLLM and LLM API Showdown, I created something new: LLM Selector
✓ It’s a tool that helps you find the perfect open-source model for your specific needs.
✓ Currently analyzing 11 models across 12 benchmarks (and counting).Â
While building the first two, I realized something: before thinking about providers or pricing, people need to find the right model first. With all the recent releases choosing the right model for your specific use case has become surprisingly complex.
##Â The Benchmark puzzle
We've got metrics everywhere:
- Technical: HumanEval, EvalPlus, MATH, API-Bank, BFCL
- Knowledge: MMLU, GPQA, ARC, GSM8K
- Communication: ChatBot Arena, MT-Bench, IF-Eval
For someone new to AI, it's not obvious which ones matter for their specific needs.
##Â A simple approach
Instead of diving into complex comparisons, the tool:
- Groups benchmarks by use case
- Weighs primary metrics 2x more than secondary ones
- Adjusts for basic requirements (latency, context, etc.)
- Normalizes scores for easier comparison
Example: Creative Writing Use CaseÂ
Let's break down a real comparison:
Input: - Use Case: Content Generation
Requirement: Long Context Support
How the tool analyzes this:
-
Primary Metrics (2x weight): - MMLU: Shows depth of knowledge - ChatBot Arena: Writing capability
-
Secondary Metrics (1x weight): - MT-Bench: Language quality - IF-Eval: Following instructions
Top Results:
- Llama-3.1-70B (Score: 89.3)
• MMLU: 86.0% • ChatBot Arena: 1247 ELO • Strength: Balanced knowledge/creativity
- Gemma-2-27B (Score: 84.6) • MMLU: 75.2% • ChatBot Arena: 1219 ELO • Strength: Efficient performance
Important NotesÂ
-
V1 with limited models (more coming soon)Â
-
Benchmarks ≠real-world performance (and this is an example calculation)
-
Your results may varyÂ
-
Experienced users: consider this a starting pointÂ
-
Open source models only for now
-
just added one api provider for now, will add the ones from my previous apps and combine them all
##Â Try It Out
🔗Â
Built with v0 + Vercel + Claude
Share your experience:
-
Which models should I add next?
-
What features would help most?
-
How do you currently choose models?