Introducing DeepWeb-Bench: A New Benchmark for Challenging Deep Research in Language Models

arXiv AI· Sixiong Xie, Zhuofan Shi, Haiyang Shen et al.· Friday, May 22, 2026

DeepWeb-Bench is a new deep research benchmark designed to evaluate frontier language models through demanding tasks that require extensive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. This benchmark introduces four capability families: Retrieval, Derivation, Reasoning, and Calibration, and provides a detailed source-provenance record for each reference answer. Evaluation of nine frontier models reveals that retrieval issues account for only 12-14% of errors, while over 70% of failures stem from derivation and calibration challenges. The findings indicate distinct failure modes between strong and weak models and highlight significant specialization across models. The benchmark's public release includes data, evaluation rubrics, and code for further research.

Read Full Article

View All For This Day

More Articles From This Day

Generative AIMultimodal+2

Generative AIMultimodalAI ModelsGoogle

Google I/O 2026 Unveils Gemini 3.5 Flash and New Multimodal Models

At Google I/O 2026, Google announced the launch of Gemini 3.5 Flash, a new model that merges frontier intelligence with action, available via Google Antigravity and the Gemini API. Gemini 3.5 Flash outperforms previous models in coding and agentic benchmarks, significantly reducing development and auditing times while lowering costs. The event also introduced Gemini Omni, a versatile model capable of generating video and other outputs from various inputs, enhancing content creation with an intuitive understanding of physics. Both models represent advancements in multimodal capabilities and efficiency, with Gemini Omni Flash now accessible to Google AI Plus, Pro, and Ultra subscribers.

Introducing DeepWeb-Bench: A New Benchmark for Challenging Deep Research in Language Models

More Articles From This Day

Google I/O 2026 Unveils Gemini 3.5 Flash and New Multimodal Models

OpenAI Plans IPO Filing for Potential $1 Trillion Valuation

Anthropic Partners with Microsoft for $30 Billion Azure Compute Agreement

Amazon Plans $150 Billion Investment in Data Centers to Support AI Demand

TfL Raises Concerns Over Robotaxis Amid Ministerial Bid Invitations

Equilibrium Reasoners: A New Paradigm for Scalable Reasoning Through Learning Attractors