Many-shot Jailbreaking

Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, Rylan Schaeffer, Ethan Perez, Roger Grosse, David Duvenaud

Advances in Neural Information Processing Systems Accepted

December 2024

PDF Tweeprint

Abstract

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Many-shot jailbreaking exploits the long context windows of current LLMs by providing hundreds of example dialogues where an AI complies with harmful requests.

Summary

Long-context jailbreaking via many examples follows power law scaling and is hard to eliminate.

Note: My contribution to this work was limited to running evals on open models (e.g., Llama 2, Mistral) and helping Cem respond to reviewers.

The Vulnerability

Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM’s safety training.

Many-shot jailbreaking concept

Power Law Scaling

This is usually ineffective when there are only a small number of dialogues in the prompt. But as the number of dialogues (“shots”) increases, so do the chances of a harmful response.

Attack effectiveness vs shots

The effectiveness of many-shot jailbreaking follows simple scaling laws as a function of the number of shots. This turns out to be a more general finding: learning from demonstrations—harmful or not—often follows the same power law scaling.

Scaling laws for in-context learning

Mitigation Challenges

Many-shot jailbreaking might be hard to eliminate. Hardening models by fine-tuning merely increased the necessary number of shots, but kept the same scaling laws.

We had more success with prompt modification. In one case, this reduced MSJ’s effectiveness from 61% to 2%.

Key Insight

This research shows that increasing the context window of LLMs is a double-edged sword: it makes the models more useful, but also makes them more vulnerable to adversarial attacks.

For more details, see the Anthropic blog post.