DeepWeb-Bench is a new deep research benchmark designed to evaluate frontier language models through demanding tasks that require extensive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. This benchmark introduces four capability families: Retrieval, Derivation, Reasoning, and Calibration, and provides a detailed source-provenance record for each reference answer. Evaluation of nine frontier models reveals that retrieval issues account for only 12-14% of errors, while over 70% of failures stem from derivation and calibration challenges. The findings indicate distinct failure modes between strong and weak models and highlight significant specialization across models. The benchmark's public release includes data, evaluation rubrics, and code for further research.
Introducing DeepWeb-Bench: A New Benchmark for Challenging Deep Research in Language Models
More Articles From This Day
Google I/O 2026 Unveils Gemini 3.5 Flash and New Multimodal Models
At Google I/O 2026, Google announced the launch of Gemini 3.5 Flash, a new model that merges frontier intelligence with action, available via Google Antigravity and the Gemini API. Gemini 3.5 Flash outperforms previous models in coding and agentic benchmarks, significantly reducing development and auditing times while lowering costs. The event also introduced Gemini Omni, a versatile model capable of generating video and other outputs from various inputs, enhancing content creation with an intuitive understanding of physics. Both models represent advancements in multimodal capabilities and efficiency, with Gemini Omni Flash now accessible to Google AI Plus, Pro, and Ultra subscribers.
