Current AI Models Achieve Below 1% on ARC-AGI-3 Benchmark

Recent evaluations show that the latest AI models, including GPT-5.5 and Opus 4.7, have scored below 1% on the ARC-AGI-3 benchmark, with GPT-5.5 achieving 0.43% and Opus 4.7 at 0.18%. The analysis identifies three main failure modes: a true local effect with a false world model, an incorrect level of abstraction from the training data, and a failure to reinforce the reward despite solving the level. Further insights can be found in the full analysis linked in the discussion.

Read Full Article

View All For This Day

Generative AIOpenAI+2

Generative AIOpenAIStartup FundingAI Safety

Elon Musk Claims Deception in OpenAI Trial, Warns of AI Threats

In the ongoing trial between Elon Musk and OpenAI, Musk accused CEO Sam Altman and President Greg Brockman of misleading him into funding the company, claiming he provided $38 million to support a nonprofit aimed at benefiting humanity. He expressed concerns that AI could pose existential threats, referencing his own AI company, xAI, which utilizes OpenAI's models. Musk is seeking to oust Altman and Brockman and revert OpenAI to its original nonprofit status. The trial's outcome could significantly impact OpenAI's anticipated IPO, while xAI is projected to go public as part of Musk's SpaceX with a target valuation of $1.75 trillion. Musk's testimony emphasized his commitment to AI safety, countered by claims from OpenAI's legal team suggesting his motives were competitive rather than altruistic.

AI BenchmarksModel Auditing+2

AI BenchmarksModel AuditingGenerative AIOpen Source

In-Depth Analysis of Frontier Model Failure Modes Revealed in ARC-AGI-3 Testing

A new analysis released on the ARC-AGI-3 blog examines the failure modes of frontier AI models, specifically OpenAI's GPT-5.5 and Anthropic's Opus 4.7, through their performance in 160 challenging environments. The study highlights the models' reasoning processes and identifies three common failure modes experienced during testing. The environments were designed to isolate abstract reasoning, without cultural knowledge, requiring models to adapt to novel situations. The findings indicate that while the models can observe local effects, they struggle to integrate these observations into a coherent world model, leading to performance failures. The analysis package is now open-sourced for public access.

ARC PrizeRead →

AIGenerative AI+2

AIGenerative AIDefenseNLP

Pentagon Signs AI Deployment Agreements with Nvidia, Microsoft, and AWS for Classified Networks

The U.S. Defense Department has finalized agreements with Nvidia, Microsoft, Amazon Web Services, and Reflection AI to integrate their AI technologies into classified networks for operational purposes. This move aims to enhance the military's capabilities in decision-making across various warfare domains, marking a step towards establishing an AI-first military force. The agreements follow a shift in vendor strategy by the Pentagon amidst ongoing legal disputes with Anthropic regarding the usage terms of its AI models. The Department aims to prevent vendor lock-in, ensuring access to a diverse range of AI capabilities while safeguarding national security. The technology will be deployed in high-security environments classified as Impact Level 6 and 7, and has already been accessed by over 1.3 million personnel through the GenAI.mil platform, which supports non-classified tasks such as research and data analysis.

TechCrunch AIRead →

Speech RecognitionOpen Source+2

Speech RecognitionOpen SourceTransformersNLP

Cohere Launches Open-Source 2B Parameter Speech Recognition Model

Cohere has released the cohere-transcribe-03-2026, a 2B-parameter open-source speech recognition model available on Hugging Face under an Apache 2.0 license. This model supports 14 critical languages and achieves superior accuracy and efficiency, boasting an offline throughput three times higher than competitors of similar size. It ranks first on the Hugging Face Open ASR Leaderboard for English and performs comparably across the other supported languages. Designed for production use, the model integrates with vLLM for efficient serving and features a robust multilingual tokenizer and optimized data mix, trained on 0.5M hours of audio transcript pairs. Cohere-transcribe aims to set a new standard in speech recognition technology.

Hacker NewsRead →

QuantizationVector Databases+2

QuantizationVector DatabasesEDENTurboQuant

2021 EDEN Quantization Algorithm Outperforms TurboQuant from 2026

A comparative study reveals that the EDEN quantization algorithm, first introduced in 2021, consistently outperforms TurboQuant, a method presented at ICLR 2026. The research, co-authored by Amit Portnoy and others, highlights that TurboQuant's mse variant is a degenerate case of EDEN, which employs rotation-based vector quantization more effectively. EDEN uses a deterministic quantizer with an analytically derived scale, yielding a significant reduction in mean squared error (MSE) compared to TurboQuant-mse across various dimensions and bit-widths. The findings demonstrate that EDEN maintains a measurable advantage in practical applications involving embeddings and KV caches.

Towards Data ScienceRead →

CybersecurityFintech+2

CybersecurityFintechAIVulnerabilities

Claude's Mythos AI Model Uncovers Vulnerabilities in Financial Software

Claude's Mythos AI model has been identified as a significant tool in revealing vulnerabilities within financial software, raising concerns about the safety of monetary assets from cyber attacks. This development highlights the increasing sophistication of AI technologies and their implications for cybersecurity in the finance sector. As financial institutions grapple with potential risks, the need for robust security measures becomes ever more critical.

Financial Times TechRead →