WeSearch

You don't need all the LLM benchmarks

·4 min read · 0 reactions · 0 comments · 10 views
#machine learning#benchmarks#language models
⚡ TL;DR · AI summary

The article discusses the redundancy of various benchmarks used to evaluate language models, suggesting that many can be skipped without losing predictive power. It highlights that a small subset of benchmarks can effectively predict performance across a wide range of tasks. The author proposes a method for selecting these benchmarks based on statistical principles, emphasizing the importance of efficient benchmarking in model evaluation.

Key facts
Original article
Smola
Read full at Smola →
Opening excerpt (first ~120 words) tap to expand

Every time a new model comes out, somebody runs it on MMLU (57 subjects), MTEB (56 tasks), HELM, the Open LLM Leaderboard, AlpacaEval, LiveBench, BigCodeBench, WildBench, Arena-Hard, MT-Bench, and a dozen others. That’s days of GPU time and a lot of human babysitting. But if you’ve ever stared at a leaderboard for ten minutes you already know the dirty secret: the columns are wildly correlated. If a model is good at one math benchmark it’s good at all of them. So how much of this can we just skip? A lot, as it turns out. On MMLU, 5 subjects out of 57 predict the remaining 52 with \(R^2 \approx 0.91\), across 5,452 models, with 10-fold cross-validation. The eigenspectrum of the score covariance tells the same story: two components capture 90% of the variance on MMLU, six on MTEB.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Smola.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Smola