Optimizing LLM-Based Multi-Agent Systems through Reinforcement Learning and Orchestration Traces
This paper explores the role of reinforcement learning (RL) in enhancing large language model (LLM) agents as they transition from isolated tool users to coordinated teams. It investigates RL for LLM-based multi-agent systems using orchestration traces, which are temporal interaction graphs that capture various events such as sub-agent spawning, delegation, communication, tool use, aggregation, and stopping decisions. The study identifies three key technical axes: reward design encompassing eight families, the association of reward and credit signals to various units, and orchestration learning involving decisions related to spawning, delegation, communication, aggregation, and stopping. The research further highlights a significant gap between publicly reported deployment practices and academic evaluation frameworks. The authors provide access to their dataset and resources for further exploration.