Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data

Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dvijotham

arXiv preprint Under Review

February 2025

Abstract

Summary

Refusal mechanisms in LLMs can be exploited through harmless fine-tuning data.