Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data

Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dj Dvijotham

ICLR 2025 Workshop on Building Trust in Language Models and Applications Accepted

April 2025

Summary

Workshop version: refusal mechanisms can be exploited through harmless fine-tuning data.