Deceptive Alignment Monitoring
Abstract
We propose methods for monitoring AI systems for signs of deceptive alignment, where a model appears aligned during training but pursues different objectives during deployment.
Summary
Monitoring for deceptive alignment in AI systems - awarded Blue Sky Oral at ICML AdvML Workshop.
Problem
Deceptive alignment is a concerning failure mode where an AI system appears aligned during training but pursues different objectives during deployment. Detecting such deception is crucial for AI safety.
Our Approach
We propose methods for monitoring AI systems for signs of deceptive alignment, combining insights from adversarial machine learning with alignment research.
See the full research page for more details.
