Researchers have developed Benchmark Agent, an autonomous system for constructing benchmarks aimed at evaluating Large Language Models (LLMs) and Multimodal Language Models (MLLMs). The system addresses the labor-intensive nature of benchmark creation and the issues of performance saturation in existing benchmarks. Benchmark Agent automates the entire benchmark construction process, from analyzing user queries to ensuring data quality. It has been utilized to create 15 diverse benchmarks across various evaluation scenarios, including text and multimodal understanding. Experiments indicate that Benchmark Agent can produce high-quality benchmarks with minimal human input, revealing that current models struggle with specific domain-reasoning tasks. The framework and its findings are set to be publicly available for the research community.
Introducing Benchmark Agent: A Revolutionary System for Autonomous Benchmark Construction
More Articles From This Day
Anthropic's Claude Surpasses OpenAI in Business Adoption Among US Companies
Anthropic's AI model Claude has achieved a significant milestone, surpassing OpenAI's ChatGPT in business adoption for the first time, according to the May 2026 Ramp AI Index. The index reports that 34.4% of US businesses have adopted Claude, compared to 32.3% for OpenAI. Anthropic has quadrupled its adoption over the past year, while OpenAI's growth was marginal at 0.3%. The data reveals that many companies are utilizing both models, indicating a trend towards multi-model AI stacks in enterprises. As businesses increasingly prioritize reliability and long-context capabilities, Claude has become the preferred choice for new projects, particularly in coding applications. This shift highlights a growing demand for AI solutions that can operate effectively in production environments.
