LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code

It appears LLMs have developed quite the appetite for LeetCode-style puzzles, becoming remarkably good at solving them. Given that most LLM benchmarks test against similar puzzle-based tests, this overfitting is not that surprising. Unfortunately for actual software developers—who spend surprisingly little time reversing binary trees in production—this expertise is somewhat less useful.

Enter LiveCodeBench: a benchmark that continuously harvests fresh programming problems, ensuring LLMs can’t simply memorize their way to algorithmic enlightenment while leaving real-world programming challenges in the dark.

LiveCodeBench is a holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time. Particularly, LiveCodeBench also focuses on broader code-related capabilities, such as self-repair, code execution, and test output prediction, beyond mere code generation

LiveCodeBench Website