FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation
Abstract
We introduce FACADE, a framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability techniques.
Summary
Framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability.
Problem
Adversarial examples can manipulate neural network circuits in unexpected ways. Detecting such manipulations requires understanding the internal workings of these circuits.
FACADE Framework
We introduce FACADE (Framework for Adversarial Circuit Anomaly Detection and Evaluation), which leverages mechanistic interpretability techniques to identify anomalous circuit behavior that may indicate adversarial manipulation.
See the full research page for more details.
