Teaching Claude Why

Jonathan Kutasov^*, Adam Jermyn

May 8, 2026

Julius Steen, Minh Le, Samuel R. Bowman, Samuel Marks, Jan Leike, Amanda Askell, Chris Olah

Evan Hubinger, Sara Price

^*Correspondence to jonk@anthropic.com

Introduction

Last year, we released a case study on agentic misalignment. This research showed that AI models across the industry sometimes took egregiously misaligned actions when placed in (fictional) ethical dilemmas—for example, blackmailing engineers to avoid being shut down.

At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the specific scenarios we had caught.

At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the s...