FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation

Dhruv Pai, Andres Carranza, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

ICML 2023 Workshop: Adversarial Machine Learning Frontiers Accepted

July 2023

AI Safety Adversarial ML Anomaly Detection Mechanistic Interpretability

PDF Poster

Abstract

We introduce FACADE, a framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability techniques.

Summary

Framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability.

Problem

Adversarial examples can manipulate neural network circuits in unexpected ways. Detecting such manipulations requires understanding the internal workings of these circuits.

FACADE Framework

We introduce FACADE (Framework for Adversarial Circuit Anomaly Detection and Evaluation), which leverages mechanistic interpretability techniques to identify anomalous circuit behavior that may indicate adversarial manipulation.

See the full research page for more details.