Researchers have proposed a novel method called SafeSteer to improve the safety alignment of Large Language Models (LLMs) without compromising their general capabilities. The study highlights the challenges associated with the alignment tax, which occurs when aligning LLMs with human values. SafeSteer employs on-policy distillation focused on safety tokens, using a safety teacher created through activation steering. The method includes a safety token selection algorithm that minimizes the reverse KL penalty during training. Experimental results demonstrate that SafeSteer outperforms existing methods, achieving strong safety performance on seven benchmarks with minimal degradation on five general capability benchmarks, while requiring significantly fewer harmful samples for alignment. More details are available on the project's webpage.
SafeSteer Introduces Localized On-Policy Distillation for Enhanced Safety Alignment in LLMs
More Articles From This Day
Alphabet Plans $80 Billion Equity Raise Amid Anthropic's IPO Filing
Alphabet seeks to raise $80 billion in equity to enhance its AI infrastructure, as reported by Bloomberg's Caroline Hyde and Ed Ludlow. This move coincides with Anthropic's confidential filing for an initial public offering (IPO), positioning itself ahead of rival OpenAI in the IPO competition. Additionally, SpaceX is in talks to negotiate minimal fees with Wall Street firms for its upcoming IPO. HPE CEO Antonio Neri also reported that the company's annual sales outlook has exceeded estimates, driven by significant demand for AI infrastructure.
