The Stanford EDGAR Filings Dataset (SEFD) addresses the growing scarcity of clean long-context documents for training large language models (LLMs) by providing an open reconstruction of SEC filings into layout-faithful MultiMarkdown. This dataset includes a wide range of financial documents such as audited financial statements and market-moving event filings, making them accessible for long-context pretraining data, financial reasoning, and compliance. SEFD-v1 consists of 152 billion tokens and features less than 0.1% overlap with Common Crawl-derived datasets. Additionally, two benchmarks, EDGAR-Forecast and EDGAR-OCR, are introduced to evaluate numerical forecasting and transcription of complex financial tables, respectively.
Stanford EDGAR Filings Dataset Transforms U.S. Corporate Disclosures into Efficient Pretraining Data
More Articles From This Day
Near-Autonomous AI Chemist Enhances Key Drug-Making Reaction in Medicinal Chemistry
OpenAI, in collaboration with Molecule.one, has demonstrated how a near-autonomous AI chemist powered by GPT-5.4 has successfully improved a crucial reaction in drug manufacturing. This advancement represents a significant step forward in medicinal chemistry research, showcasing the potential of AI technologies to enhance complex chemical processes.
