You don't need all the LLM benchmarks
The article discusses the redundancy of various benchmarks used to evaluate language models, suggesting that many can be skipped without losing predictive power. It highlights that a small subset of benchmarks can effectively predict performance across a wide range of tasks. The author proposes a method for selecting these benchmarks based on statistical principles, emphasizing the importance of efficient benchmarking in model evaluation.
- ▪Many language model evaluations rely on numerous benchmarks that are often highly correlated.
- ▪A small number of subjects can predict the performance of a larger set of benchmarks with high accuracy.
- ▪The article introduces a statistical approach to select the most informative benchmarks for evaluation.
Opening excerpt (first ~120 words) tap to expand
Every time a new model comes out, somebody runs it on MMLU (57 subjects), MTEB (56 tasks), HELM, the Open LLM Leaderboard, AlpacaEval, LiveBench, BigCodeBench, WildBench, Arena-Hard, MT-Bench, and a dozen others. That’s days of GPU time and a lot of human babysitting. But if you’ve ever stared at a leaderboard for ten minutes you already know the dirty secret: the columns are wildly correlated. If a model is good at one math benchmark it’s good at all of them. So how much of this can we just skip? A lot, as it turns out. On MMLU, 5 subjects out of 57 predict the remaining 52 with \(R^2 \approx 0.91\), across 5,452 models, with 10-fold cross-validation. The eigenspectrum of the score covariance tells the same story: two components capture 90% of the variance on MMLU, six on MTEB.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Smola.