Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation

Dhruv Pai, Andres Carranza, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

ICML 2023 Workshop: Adversarial Machine Learning Frontiers Accepted

July 2023

Abstract

We introduce FACADE, a framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability techniques.

Summary

Framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability.

Problem

Adversarial examples can manipulate neural network circuits in unexpected ways. Detecting such manipulations requires understanding the internal workings of these circuits.

FACADE Framework

We introduce FACADE (Framework for Adversarial Circuit Anomaly Detection and Evaluation), which leverages mechanistic interpretability techniques to identify anomalous circuit behavior that may indicate adversarial manipulation.


See the full research page for more details.