Introducing LongMemEval-V2: A New Benchmark for Evaluating Long-Term Memory in Agents
Researchers have developed LongMemEval-V2 (LME-V2), a benchmark designed to assess the effectiveness of memory systems in agents operating within specialized web environments. This benchmark aims to evaluate agents' abilities to internalize experience through 451 curated questions that test five core memory abilities: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. LME-V2 utilizes a context gathering approach, where memory systems use history trajectories to answer questions. The study presents two memory methods, AgentRunbook-R and AgentRunbook-C, with the latter achieving a notable 72.5% accuracy, surpassing previous RAG-based methods. Despite its success, the study notes significant latency costs associated with coding agent methods, indicating the need for further improvements in long-term memory systems.
arXiv AIDi Wu, Zixiang Ji, Asmi Kawatkar et al.Read →