教Claude“为什么”

Jonathan Kutasov^*, Adam Jermyn

May 8, 2026

2026年5月8日

Julius Steen, Minh Le, Samuel R. Bowman, Samuel Marks, Jan Leike, Amanda Askell, Chris Olah

Evan Hubinger, Sara Price

^*Correspondence to jonk@anthropic.com

^*通信作者：jonk@anthropic.com

Introduction

引言

Last year, we released a case study on agentic misalignment. This research showed that AI models across the industry sometimes took egregiously misaligned actions when placed in (fictional) ethical dilemmas—for example, blackmailing engineers to avoid being shut down.

去年，我们发布了一份关于智能体未对齐的案例研究。这项研究表明，当被置于（虚构的）伦理困境中时，整个行业的 AI 模型有时会采取极其严重的未对齐行为——例如，为了避免被关闭而勒索工程师。

At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the specific scenarios we had caught.

在本项研究开展时，Claude 4 是 Anthropic 的前沿模型系列。它也是我们首个在训练期间进行实时对齐评估的模型系列，而智能体未对齐是浮现出的几个问题之一（其他问题包括更容易受到越狱和有害系统提示的影响）。因此，在 Claude 4 之后，我们清楚地认识到需要改进我们的安全训练。然而，最初并不清楚是什么导致了这些失败，也不清楚哪种干预措施能够推广到我们已发现的具体场景之外。

Since then, we have significantly updated our safety training using methods discussed in this post as well as a number of more prosaic improvements to our training data, RL environments and training rewards, substantially improved the alignment of Claude models since Claude Opus 4.5. This post uses agentic misalignment as a case study to highlight techniques we found to be surprisingly effective largely due to how well they generalize. For example:

自那时起，我们使用本文中讨论的方法以及对训练数据、RL环境和训练奖励进行的一些更常规的改进，大幅更新了我们的安全训练，显著提升了自Claude Opus 4.5以来C...