Researchers have developed MathDuels, a novel self-play benchmark designed to assess the capabilities of large language models (LLMs) as both problem authors and solvers. Traditional evaluations have struggled to differentiate model performance due to their focus on fixed problem sets. MathDuels features a three-stage generation pipeline for math problems and employs a Rasch model to jointly estimate solver abilities and problem difficulties. Experiments with 19 advanced models indicate that the ability to create and solve problems can be partially decoupled, revealing distinctions that are not apparent in conventional testing methods. The benchmark evolves dynamically, adapting its difficulty as new models are introduced. A public leaderboard is available to track advancements as models are released.
MathDuels Introduces Innovative Benchmark for Evaluating LLMs as Problem Creators and Solvers
More Articles From This Day
Google to Invest Up to $40 Billion in Anthropic
Google has announced an initial investment of $10 billion in Anthropic PBC, valuing the company at $350 billion. The tech giant is also considering an additional investment of $30 billion in the future. This deal was discussed by Bloomberg's Shirin Ghaffary with Ed Ludlow on 'Bloomberg Tech.'
