Signals

AI groups rush to redesign model testing and create new benchmarks

November 11, 2024 - LLM Benchmarking
1 mins

Existing benchmarks are becoming saturated with new AI models, highlighting the need for new benchmarks.

Companies conduct “evaluations” of AI models by teams of staff and outside researchers. These are standardised tests, known as benchmarks, that assess models' abilities and the performance of different groups' systems or older versions. However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks. “The pace of the industry is extremely fast. We are now starting to saturate our ability to measure some of these systems [and as an industry] it is becoming more and more difficult to evaluate [them],” said Ahmad Al-Dahle, generative AI lead at Meta.

see also LiveCodeBench


Related

LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code

It appears LLMs have developed quite the appetite for LeetCode-style puzzles, becoming remarkably good at solving them. Given that most LLM benchmarks test against similar puzzle-based tests, this overfitting is not that surprising. Unfortunately for actual software developers—who spend surprisingly little time reversing binary trees in production—this expertise is somewhat less useful.

Enter LiveCodeBench: a benchmark that continuously harvests fresh programming problems, ensuring LLMs can’t simply memorize their way to algorithmic enlightenment while leaving real-world programming challenges in the dark.

LiveCodeBench is a holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time. Particularly, LiveCodeBench also focuses on broader code-related capabilities, such as self-repair, code execution, and test output prediction, beyond mere code generation

LLM Benchmarking
November 2, 2024