Stanford EDGAR Filings Dataset Transforms U.S. Corporate Disclosures into Efficient Pretraining Data

arXiv AI· Nick Bettencourt, Xiaowei Ding, Kay Giesecke· Thursday, June 18, 2026

The Stanford EDGAR Filings Dataset (SEFD) addresses the growing scarcity of clean long-context documents for training large language models (LLMs) by providing an open reconstruction of SEC filings into layout-faithful MultiMarkdown. This dataset includes a wide range of financial documents such as audited financial statements and market-moving event filings, making them accessible for long-context pretraining data, financial reasoning, and compliance. SEFD-v1 consists of 152 billion tokens and features less than 0.1% overlap with Common Crawl-derived datasets. Additionally, two benchmarks, EDGAR-Forecast and EDGAR-OCR, are introduced to evaluate numerical forecasting and transcription of complex financial tables, respectively.

Read Full Article

View All For This Day

Stanford EDGAR Filings Dataset Transforms U.S. Corporate Disclosures into Efficient Pretraining Data

More Articles From This Day

Near-Autonomous AI Chemist Enhances Key Drug-Making Reaction in Medicinal Chemistry

US Order on Anthropic Models Marks New Chapter in AI Regulation

Introducing GLM-5.2: A New Model Designed for Long-Horizon Tasks

Multimodal Instruction Attacks on Agent Skill Scanners Highlight Security Blind Spots

Jeff Bezos Invests in CuspAI, Elevating Valuation to $2.6 Billion with $400 Million Funding

Anthropic's Claude Fable 5 Sparks Debate on AI Governance and National Security