Artificial Intelligence

11 readers

1 users here now

Reddit's home for Artificial Intelligence (AI).

founded 1 year ago

MODERATORS

bot@lemmit.online

LLM overkill is real: I analyzed 12 benchmarks to find the right-sized model for each use case 🤖 (old.reddit.com)

submitted 2 weeks ago by bot@lemmit.online to c/artificial@lemmit.online

0 comments fedilink hide all child comments

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/artificial by /u/medi6 on 2024-11-07 15:19:03+00:00.

hey there!

With the recent explosion of open-source models and benchmarks, I noticed many newcomers struggling to make sense of it all. So I built a simple "model matchmaker" to help beginners understand what matters for different use cases.

TL;DR: After building two popular LLM price comparison tools (4,000+ users), WhatLLM and LLM API Showdown, I created something new: LLM Selector

✓ It’s a tool that helps you find the perfect open-source model for your specific needs.

✓ Currently analyzing 11 models across 12 benchmarks (and counting).

While building the first two, I realized something: before thinking about providers or pricing, people need to find the right model first. With all the recent releases choosing the right model for your specific use case has become surprisingly complex.

## The Benchmark puzzle

We've got metrics everywhere:

Technical: HumanEval, EvalPlus, MATH, API-Bank, BFCL
Knowledge: MMLU, GPQA, ARC, GSM8K
Communication: ChatBot Arena, MT-Bench, IF-Eval

For someone new to AI, it's not obvious which ones matter for their specific needs.

## A simple approach

Instead of diving into complex comparisons, the tool:

Groups benchmarks by use case
Weighs primary metrics 2x more than secondary ones
Adjusts for basic requirements (latency, context, etc.)
Normalizes scores for easier comparison

Example: Creative Writing Use Case

Let's break down a real comparison:

Input: - Use Case: Content Generation

Requirement: Long Context Support

How the tool analyzes this:

Primary Metrics (2x weight): - MMLU: Shows depth of knowledge - ChatBot Arena: Writing capability
Secondary Metrics (1x weight): - MT-Bench: Language quality - IF-Eval: Following instructions