this post was submitted on 08 Nov 2024
1 points (100.0% liked)

Artificial Intelligence

11 readers
1 users here now

Reddit's home for Artificial Intelligence (AI).

founded 1 year ago
MODERATORS
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/artificial by /u/medi6 on 2024-11-07 15:19:03+00:00.


hey there!

With the recent explosion of open-source models and benchmarks, I noticed many newcomers struggling to make sense of it all. So I built a simple "model matchmaker" to help beginners understand what matters for different use cases.

TL;DR: After building two popular LLM price comparison tools (4,000+ users), WhatLLM and LLM API Showdown, I created something new: LLM Selector

✓  It’s a tool that helps you find the perfect open-source model for your specific needs.

✓  Currently analyzing 11 models across 12 benchmarks (and counting). 

While building the first two, I realized something: before thinking about providers or pricing, people need to find the right model first. With all the recent releases choosing the right model for your specific use case has become surprisingly complex.

## The Benchmark puzzle

We've got metrics everywhere:

  • Technical: HumanEval, EvalPlus, MATH, API-Bank, BFCL
  • Knowledge: MMLU, GPQA, ARC, GSM8K
  • Communication: ChatBot Arena, MT-Bench, IF-Eval

For someone new to AI, it's not obvious which ones matter for their specific needs.

## A simple approach

Instead of diving into complex comparisons, the tool:

  1. Groups benchmarks by use case
  2. Weighs primary metrics 2x more than secondary ones
  3. Adjusts for basic requirements (latency, context, etc.)
  4. Normalizes scores for easier comparison

Example: Creative Writing Use Case 

Let's break down a real comparison:

Input: - Use Case: Content Generation

Requirement: Long Context Support

How the tool analyzes this:

  1. Primary Metrics (2x weight): - MMLU: Shows depth of knowledge - ChatBot Arena: Writing capability

  2. Secondary Metrics (1x weight): - MT-Bench: Language quality - IF-Eval: Following instructions

Top Results:

  1. Llama-3.1-70B (Score: 89.3)

• MMLU: 86.0% • ChatBot Arena: 1247 ELO • Strength: Balanced knowledge/creativity

  1. Gemma-2-27B (Score: 84.6) • MMLU: 75.2% • ChatBot Arena: 1219 ELO • Strength: Efficient performance

Important Notes 

  • V1 with limited models (more coming soon) 

  • Benchmarks ≠ real-world performance (and this is an example calculation)

  • Your results may vary 

  • Experienced users: consider this a starting point 

  • Open source models only for now

  • just added one api provider for now, will add the ones from my previous apps and combine them all

##  Try It Out

🔗 

Built with v0 + Vercel + Claude

Share your experience:

  • Which models should I add next?

  • What features would help most?

  • How do you currently choose models?

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here