Introducing Claw-Eval-Live: A Benchmark for Evaluating Workflow Agents in Real-Time

Researchers have introduced Claw-Eval-Live, a live benchmark designed to assess LLM agents' capabilities in managing evolving workflows across various software and business services. Unlike traditional benchmarks that rely on fixed tasks, Claw-Eval-Live incorporates a refreshable signal layer that updates based on public workflow demand. The benchmark includes 105 controlled tasks and evaluates 13 advanced models, revealing that the top-performing model only completes 66.7% of tasks. Challenges persist in HR, management, and multi-system workflows, indicating that reliable workflow automation is still a significant hurdle for current models. The study emphasizes the need for dual grounding in external demands and verifiable agent actions for effective evaluation.

Read Full Article

View All For This Day

AIMilitary+3

AIMilitaryNVIDIAMicrosoft

Pentagon Secures New Military AI Contracts with Nvidia, Microsoft, and Amazon

The Pentagon has signed new contracts with technology giants Nvidia, Microsoft, and Amazon, following a recent clash with Anthropic regarding the use of its AI model, Claude. These agreements underscore the ongoing collaboration between the military and leading tech firms in advancing artificial intelligence capabilities for defense applications.

DatabaseAI Agents+2

DatabaseAI AgentsPostgresCodex

Introducing Ghost: The First Database Built for AI Agents

Ghost, a new Postgres database platform developed for AI agents, allows for easy creation, forking, inspection, querying, manipulation, and deletion of databases. Described as 'agent-first,' Ghost operates on a cloud infrastructure, making it ideal for testing, prototyping, and disposable database environments, contrasting with traditional managed databases that are typically designed for long-term use. The platform enhances AI tools like Codex and Claude Code by providing direct database management capabilities through its built-in MCP server. The article also details how to install Ghost and use it effectively with AI coding agents, including specific command-line instructions for different operating systems.

Towards Data ScienceRead →

RoboticsAI+2

RoboticsAIHumanoid TechnologyMeta

Meta Acquires Robotics AI Firm to Advance Humanoid Technology Development

Meta has acquired a robotics AI company as part of its ongoing efforts to develop humanoid technology. This acquisition aims to enhance Meta's capabilities in the robotics space, signaling the company's commitment to advancing its technological initiatives in artificial intelligence and robotics. The move is expected to bolster Meta's research and development in creating more sophisticated humanoid robots, aligning with its broader vision of integrating AI into various applications.

Bloomberg TechnologyRead →

OpenAIStartup Funding+1

OpenAIStartup FundingProduct Launch

OpenAI CFO Reports Strong Demand for Company Products

OpenAI Chief Financial Officer Sarah Friar stated that the company is successfully meeting its objectives and is experiencing a significant surge in demand for its products, describing it as a 'vertical wall of demand.' Bloomberg's Shirin Ghaffary discussed these remarks with Caroline Hyde and Ed Ludlow on the program 'Bloomberg Tech.'

Bloomberg TechnologyRead →

AIData Management+2

AIData ManagementScientific DiscoveryEnergy

Scale AI Partners with DOE to Propel Genesis Mission for Scientific Discovery

Scale AI has formalized a partnership with the U.S. Department of Energy (DOE) through a Memorandum of Understanding (MOU) to support the Genesis Mission, an initiative aimed at enhancing scientific discovery using advanced AI and computing. The Genesis Mission seeks to create an integrated platform that leverages federal datasets for innovation and scientific advancement. The collaboration focuses on addressing the 'data bottleneck' by improving the usability and accessibility of vast datasets from the national labs, ensuring they are accurate, complete, and secure for researchers and AI applications. This initiative is pivotal for enhancing energy independence and advancing national security.

Scale AI BlogRead →

LLMReinforcement Learning+2

LLMReinforcement LearningNLPSafety

Advancing Neuro-symbolic Causal Rule Synthesis and Verification for Safety-critical AI Systems

This paper addresses the limitations of rule-based systems in safety-critical domains by extending a neuro-symbolic causal framework that integrates first-order logic and deep reinforcement learning. The authors introduce a meta-level layer to enhance scalability and mitigate goal misspecification through a Goal/Rule Synthesizer and a Rule Verification Engine. The synthesis process utilizes large language models to decompose goals, consolidate semantics, and translate them into formal rules. The verification process checks for syntax, logical consistency, and safety before integrating the rules into a knowledge base. Evaluation in autonomous driving scenarios demonstrates the framework's ability to derive minimal rule sets based on human-specified goals, supporting a traceable synthesis grounded in legal and safety principles.

arXiv AIRead →