Teaching Claude Why
Jonathan Kutasov*, Adam Jermyn
Jonathan Kutasov*, Adam Jermyn
May 8, 2026
May 8, 2026
Julius Steen, Minh Le, Samuel R. Bowman, Samuel Marks, Jan Leike, Amanda Askell, Chris Olah
Julius Steen, Minh Le, Samuel R. Bowman, Samuel Marks, Jan Leike, Amanda Askell, Chris Olah
Evan Hubinger, Sara Price
Evan Hubinger, Sara Price
*Correspondence to jonk@anthropic.com
*Correspondence to jonk@anthropic.com
Introduction
Introduction
Last year, we released a case study on agentic misalignment. This research showed that AI models across the industry sometimes took egregiously misaligned actions when placed in (fictional) ethical dilemmas—for example, blackmailing engineers to avoid being shut down.
Last year, we released a case study on agentic misalignment. This research showed that AI models across the industry sometimes took egregiously misaligned actions when placed in (fictional) ethical dilemmas—for example, blackmailing engineers to avoid being shut down.
At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the specific scenarios we had caught.
At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the s...