Many-shot Jailbreaking
Abstract
We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Many-shot jailbreaking exploits the long context windows of current LLMs by providing hundreds of example dialogues where an AI complies with harmful requests.
Summary
Long-context jailbreaking via many examples follows power law scaling and is hard to eliminate.
Note: My contribution to this work was limited to running evals on open models (e.g., Llama 2, Mistral) and helping Cem respond to reviewers.
The Vulnerability
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM’s safety training.

Power Law Scaling
This is usually ineffective when there are only a small number of dialogues in the prompt. But as the number of dialogues (“shots”) increases, so do the chances of a harmful response.

The effectiveness of many-shot jailbreaking follows simple scaling laws as a function of the number of shots. This turns out to be a more general finding: learning from demonstrations—harmful or not—often follows the same power law scaling.

Mitigation Challenges
Many-shot jailbreaking might be hard to eliminate. Hardening models by fine-tuning merely increased the necessary number of shots, but kept the same scaling laws.
We had more success with prompt modification. In one case, this reduced MSJ’s effectiveness from 61% to 2%.
Key Insight
This research shows that increasing the context window of LLMs is a double-edged sword: it makes the models more useful, but also makes them more vulnerable to adversarial attacks.
For more details, see the Anthropic blog post.
