Deceptive Alignment Monitoring

Andres Carranza, Dhruv Pai, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

ICML 2023 Workshop: Adversarial Machine Learning Frontiers Accepted Blue Sky Oral

July 2023

PDF Poster Video

Abstract

We propose methods for monitoring AI systems for signs of deceptive alignment, where a model appears aligned during training but pursues different objectives during deployment.

Summary

Monitoring for deceptive alignment in AI systems - awarded Blue Sky Oral at ICML AdvML Workshop.

Problem

Deceptive alignment is a concerning failure mode where an AI system appears aligned during training but pursues different objectives during deployment. Detecting such deception is crucial for AI safety.

Our Approach

We propose methods for monitoring AI systems for signs of deceptive alignment, combining insights from adversarial machine learning with alignment research.

See the full research page for more details.