
International Conference on Learning Representations (ICLR 2026) Accepted · January 2026
Three scaling laws for predicting generative evaluation performance. Key finding: gold reference likelihoods are stable across 5 orders of magnitude.
arXiv preprint (arXiv) Under Review · January 2026
Quantifying how test set contamination affects generative evaluation metrics.
NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle (NeurIPS Workshop 2025) Accepted · December 2025
Explaining the paradox of when test set contamination matters and when it doesn't.
arXiv preprint (arXiv) Under Review · October 2025
Hijacking chain-of-thought reasoning in large language models.
arXiv preprint (arXiv) Under Review · October 2025
Efficient methods for predicting pass@k scaling behavior in large language models.
arXiv preprint (arXiv) Under Review · October 2025
Understanding why representation-space adversarial attacks fail to transfer while data-space attacks succeed.
arXiv preprint (arXiv) Under Review · September 2025
Evaluating how robust the Chinchilla compute-optimal scaling laws are.
International Conference on Machine Learning (ICML 2025) Accepted Oral Presentation · July 2025
Understanding the origins of power law scaling in large language model inference-time compute.
arXiv preprint (arXiv) Under Review · June 2025
Proposing that ML conferences should have a dedicated track for refutations and critiques.
arXiv preprint (arXiv) Under Review · June 2025
Critical analysis of min-p sampling and its claimed benefits for language model generation.
ICLR 2025 Workshop on Building Trust in Language Models and Applications (ICLR Workshop 2025) Accepted · April 2025
Workshop version: refusal mechanisms can be exploited through harmless fine-tuning data.
arXiv preprint (arXiv) Under Review · March 2025
Clarifying misconceptions about model collapse in the literature.
Technical Report (Technical Report) Accepted · March 2025
Technical report for Gemini 2.5, Google's frontier multimodal AI model.
arXiv preprint (arXiv) Under Review · February 2025
Refusal mechanisms in LLMs can be exploited through harmless fine-tuning data.
arXiv preprint (arXiv) Under Review · February 2025
Predicting human evaluations of language models from NLP benchmark scores.
Nature (Nature) Accepted · January 2025
Brain-wide neural representations of prior information during mouse decision-making from the International Brain Laboratory.
Nature (Nature) Accepted · January 2025
Brain-wide map of neural activity during complex behaviour from the International Brain Laboratory.
Advances in Neural Information Processing Systems (NeurIPS 2024) Accepted · December 2024
Long-context jailbreaking via many examples follows power law scaling and is hard to eliminate.
arXiv preprint (arXiv) Under Review · December 2024
Extending best-of-N jailbreaking attacks to audio language models.
arXiv preprint (arXiv) Under Review · December 2024
Polysemanticity may arise from incidental causes rather than superposition.
arXiv preprint (arXiv) Under Review · December 2024
Best-of-N sampling as a jailbreaking technique for large language models.
arXiv preprint (arXiv) Under Review · December 2024
Analyzing limitations of existing jailbreak defenses and proposing a transcript-classifier approach.
UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models (UniReps 2024) Accepted · December 2024
Investigating whether maximizing neural regression scores actually teaches us about the brain.
arXiv preprint (arXiv) Under Review · December 2024
Incidental polysemanticity poses challenges for mechanistic interpretability.
arXiv preprint (arXiv) Under Review · October 2024
Compression-based data selection that outperforms embedding-based methods while being faster and simpler.
arXiv preprint (arXiv) Under Review · October 2024
Clarifying and unifying the literature on perils and promises of synthetic data.
arXiv preprint (arXiv) Under Review · July 2024
Survey of open problems in technical AI governance.
arXiv preprint (arXiv) Under Review · July 2024
Image-based jailbreaks don't transfer well between vision-language models.
arXiv preprint (arXiv) Under Review · June 2024
Assessing data leakage and memorization patterns in frontier AI models.
arXiv preprint (arXiv) Under Review · June 2024
Learning energy functions through in-context learning.
arXiv preprint (arXiv) Under Review · June 2024
Quantifying and understanding variance in LLM evaluation benchmarks.
arXiv preprint (arXiv) Under Review · June 2024
Understanding Maximum Manifold Capacity Representations from information theory, double descent, and scaling law perspectives.
arXiv preprint (arXiv) Under Review · June 2024
Why predicting downstream capabilities from scale has remained elusive: the sequence of transformations to compute accuracy decorrelates performance from scale.
arXiv preprint (arXiv) Under Review · April 2024
Model collapse is avoidable: accumulating synthetic data across iterations prevents degradation, unlike replacement.
arXiv preprint (arXiv) Under Review · February 2024
Connecting associative memory models with probabilistic modeling frameworks.
arXiv preprint (arXiv) Under Review Best Paper Award at ICLR 2024 Data Problems for Foundation Models Workshop · January 2024
Deliberately contaminating pretraining data reveals surprising U-shaped effects and highlights flaws in current contamination detection.
Advances in Neural Information Processing Systems (NeurIPS 2023) Accepted · December 2023
Self-supervised learning on spatial tasks generates multi-modular grid cell-like representations.
Advances in Neural Information Processing Systems (NeurIPS 2023) Accepted Outstanding Paper · December 2023
Emergent abilities in LLMs may be a mirage created by metric choice, not fundamental model behavior changes.
Advances in Neural Information Processing Systems (Datasets & Benchmarks Track) (NeurIPS 2023) Accepted · December 2023
Comprehensive trustworthiness assessment benchmark for GPT models.
arXiv preprint (arXiv) Under Review · December 2023
Separating genuine grid cell phenomena from artifacts in deep learning models.
arXiv preprint (arXiv) Under Review · November 2023
Testing the assumptions underlying unified theories of grid cell origins.
arXiv preprint (arXiv) Accepted · September 2023
Satirical paper showing that pretraining on the test set yields perfect benchmark scores.
ICML 2023 Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning (ICML Workshop 2023) Accepted · July 2023
Logically invalid chain-of-thought prompts can be as effective as valid ones - what does this tell us about LLM reasoning?
ICML 2023 Workshop: Adversarial Machine Learning Frontiers (ICML AdvML Workshop 2023) Accepted · July 2023
Framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability.
ICML 2023 Workshop: Adversarial Machine Learning Frontiers (ICML AdvML Workshop 2023) Accepted Blue Sky Oral · July 2023
Monitoring for deceptive alignment in AI systems - awarded Blue Sky Oral at ICML AdvML Workshop.
International Conference on Machine Learning (ICML 2023) Accepted · July 2023
Adding noise to network inputs causes activations to become sparse - a discovery with implications for both neuroscience and deep learning.
arXiv preprint / NeurIPS 2023 Workshops (ATTRIB, M3L) (arXiv / NeurIPS Workshops) Under Review · March 2023
Identifying and ablating the sources of double descent using only linear regression and SVD - the simplest possible explanation.
Advances in Neural Information Processing Systems (NeurIPS 2022) Accepted · December 2022
Deep learning models of the brain don't automatically provide scientific insight without careful analysis.
Conference on Lifelong Learning Agents (CoLLAs 2022) Accepted · August 2022
Streaming inference algorithms for infinite non-stationary clustering - handling evolving cluster structures online.
International Conference on Machine Learning (ICML 2022) Accepted · July 2022
Streaming inference algorithms for infinite feature models (Indian Buffet Process).
NeurIPS 2021 Workshop: Metacognition in the Age of AI (NeurIPS Workshop 2021) Accepted · December 2021
A simple modification to Actor-Critic that enables RL agents to detect and correct their own mistakes through metacognitive interaction.
Uncertainty in Artificial Intelligence (UAI 2021) Accepted · July 2021
Efficient online inference algorithms for nonparametric mixture models.
Computational and Systems Neuroscience (COSYNE 2021) Accepted · February 2021
Neural network model of memory engram formation and function in the amygdala.
Advances in Neural Information Processing Systems (NeurIPS 2020) Accepted · December 2020
Reverse-engineering RNN solutions to understand hierarchical inference in mice.