AI groups rush to redesign model testing and create new benchmarks

Existing benchmarks are becoming saturated with new AI models, highlighting the need for new benchmarks.

Companies conduct “evaluations” of AI models by teams of staff and outside researchers. These are standardised tests, known as benchmarks, that assess models' abilities and the performance of different groups' systems or older versions. However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks. “The pace of the industry is extremely fast. We are now starting to saturate our ability to measure some of these systems [and as an industry] it is becoming more and more difficult to evaluate [them],” said Ahmad Al-Dahle, generative AI lead at Meta.