This research investigates how language models, specifically Qwen3-8B, internally assess the value of their current trajectory, which is defined as the likelihood of achieving their goals. Utilizing synthetic in-context reinforcement learning data, the study constructs a 'value' axis that differentiates between various performance indicators such as verbalized confidence and the outcomes of rollouts with and without backtracking. Findings reveal that steering towards high-value actions can suppress self-correction and verbosity, while low-value steering encourages exploration. The study demonstrates that direct preference optimization increases the internal value associated with rewarded behaviors, enhancing the model's confidence. Additionally, the research applies the value axis in real-world settings, showing that Qwen assigns low value to politically sensitive queries post-training, and highlights the role of supervised fine-tuning in boosting internal confidence within the training domain.
Language Models Track Internal Value and Confidence in Goal Achievement
More Articles From This Day
US and Europe Explore AI Model Access Following Anthropic Dispute
The United States and Europe are in discussions regarding a 'trusted partner' scheme aimed at granting US allies the opportunity to test advanced artificial intelligence models. This initiative comes in the wake of a dispute involving Anthropic, highlighting the strategic importance of collaboration between the two regions in the AI sector. The partnership is designed to enhance access to cutting-edge AI technologies for allied nations.
