Teaching Claude Why

Jonathan Kutasov*, Adam Jermyn

Jonathan Kutasov*, Adam Jermyn

May 8, 2026

May 8, 2026

Julius Steen, Minh Le, Samuel R. Bowman, Samuel Marks, Jan Leike, Amanda Askell, Chris Olah

Julius Steen, Minh Le, Samuel R. Bowman, Samuel Marks, Jan Leike, Amanda Askell, Chris Olah

Evan Hubinger, Sara Price

Evan Hubinger, Sara Price

*Correspondence to jonk@anthropic.com

*Correspondence to jonk@anthropic.com


Introduction

Introduction

Last year, we released a case study on agentic misalignment. This research showed that AI models across the industry sometimes took egregiously misaligned actions when placed in (fictional) ethical dilemmas—for example, blackmailing engineers to avoid being shut down.

Last year, we released a case study on agentic misalignment. This research showed that AI models across the industry sometimes took egregiously misaligned actions when placed in (fictional) ethical dilemmas—for example, blackmailing engineers to avoid being shut down.

At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the specific scenarios we had caught.

At the time of this research, Claude 4 was Anthropic's frontier model family. It was also the first model family for which we ran a live alignment assessment during training, and agentic misalignment was one of several issues that surfaced (others include increased susceptibility to jailbreaks and harmful system prompting). So after Claude 4, it was clear we needed to improve our safety training. However, it was not initially clear what was driving the failures, or which kinds of interventions would generalize beyond the s...

开通本站会员,查看完整译文。

trang chủ - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.2. UTC+08:00, 2026-05-21 18:11
浙ICP备14020137号-1 $bản đồ khách truy cập$