Researchers have introduced Claw-Eval-Live, a live benchmark designed to assess LLM agents' capabilities in managing evolving workflows across various software and business services. Unlike traditional benchmarks that rely on fixed tasks, Claw-Eval-Live incorporates a refreshable signal layer that updates based on public workflow demand. The benchmark includes 105 controlled tasks and evaluates 13 advanced models, revealing that the top-performing model only completes 66.7% of tasks. Challenges persist in HR, management, and multi-system workflows, indicating that reliable workflow automation is still a significant hurdle for current models. The study emphasizes the need for dual grounding in external demands and verifiable agent actions for effective evaluation.
Introducing Claw-Eval-Live: A Benchmark for Evaluating Workflow Agents in Real-Time
More Articles From This Day
Pentagon Secures New Military AI Contracts with Nvidia, Microsoft, and Amazon
The Pentagon has signed new contracts with technology giants Nvidia, Microsoft, and Amazon, following a recent clash with Anthropic regarding the use of its AI model, Claude. These agreements underscore the ongoing collaboration between the military and leading tech firms in advancing artificial intelligence capabilities for defense applications.
