Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


Publications

Filter by Tag

Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo

International Conference on Learning Representations (ICLR 2026) Accepted · January 2026

Three scaling laws for predicting generative evaluation performance. Key finding: gold reference likelihoods are stable across 5 orders of magnitude.

Scaling Laws Language Models Generative Evaluation Reasoning

Quantifying the Effect of Test Set Contamination on Generative Evaluations

Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Abhay Puri, Niloofar Mireshghallah, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · January 2026

Quantifying how test set contamination affects generative evaluation metrics.

Language Models Data Contamination Generative Evaluation Benchmarks

The Contamination Paradox: Why Test Set Leakage Can Be Both Potent and Negligible

Rylan Schaeffer, Ken Liu, Brando Miranda, Ahmed M Ahmed, Niloofar Mireshghallah, Sanmi Koyejo

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle (NeurIPS Workshop 2025) Accepted · December 2025

Explaining the paradox of when test set contamination matters and when it doesn't.

Language Models Data Contamination Evaluation Benchmarks

Chain-of-thought hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

arXiv preprint (arXiv) Under Review · October 2025

Hijacking chain-of-thought reasoning in large language models.

Language Models AI Safety Chain-of-Thought Adversarial Attacks

Efficient Prediction of Pass@k Scaling in Language Models

Joshua Kazdan, Rylan Schaeffer, Youssef Allouah, Colin Sullivan, Kyssen Yu, Noam Levi, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · October 2025

Efficient methods for predicting pass@k scaling behavior in large language models.

Language Models Scaling Laws Pass-at-k Generative Evaluation

Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed

Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · October 2025

Understanding why representation-space adversarial attacks fail to transfer while data-space attacks succeed.

Adversarial Attacks Transfer Learning Robustness Deep Learning

Evaluating the Robustness of Chinchilla Compute-Optimal Scaling

Rylan Schaeffer, Noam Levi, Andreas Kirsch, Theo Guenais, Brando Miranda, Elyas Obbad, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · September 2025

Evaluating how robust the Chinchilla compute-optimal scaling laws are.

Scaling Laws Language Models Compute-Optimal Training

How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

International Conference on Machine Learning (ICML 2025) Accepted Oral Presentation · July 2025

Understanding the origins of power law scaling in large language model inference-time compute.

Language Models Scaling Laws Power Laws Inference-time Compute

Position: Machine Learning Conferences Should Establish a 'Refutations and Critiques' Track

Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge

arXiv preprint (arXiv) Under Review · June 2025

Proposing that ML conferences should have a dedicated track for refutations and critiques.

Machine Learning Scientific Publishing Position Paper Peer Review

Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models

Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch

arXiv preprint (arXiv) Under Review · June 2025

Critical analysis of min-p sampling and its claimed benefits for language model generation.

Language Models Sampling Decoding

No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data

Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dj Dvijotham

ICLR 2025 Workshop on Building Trust in Language Models and Applications (ICLR Workshop 2025) Accepted · April 2025

Workshop version: refusal mechanisms can be exploited through harmless fine-tuning data.

Language Models AI Safety Fine-tuning Refusal Mechanisms

Position: Model collapse does not mean what you think

Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · March 2025

Clarifying misconceptions about model collapse in the literature.

Model Collapse Synthetic Data Language Models Position Paper

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google Gemini Team, Rylan Schaeffer

Technical Report (Technical Report) Accepted · March 2025

Technical report for Gemini 2.5, Google's frontier multimodal AI model.

Language Models Multimodal Reasoning Agents

No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data

Joshua Kazdan, Lisa Yu, Rylan Schaeffer, Chris Cundy, Sanmi Koyejo, Krishnamurthy Dvijotham

arXiv preprint (arXiv) Under Review · February 2025

Refusal mechanisms in LLMs can be exploited through harmless fine-tuning data.

Language Models AI Safety Fine-tuning Refusal Mechanisms

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S Chatterji, Vedanuj Goswami

arXiv preprint (arXiv) Under Review · February 2025

Predicting human evaluations of language models from NLP benchmark scores.

Language Models Evaluation Human Evaluation Benchmarks

Brain-wide Representations of Prior Information in Mouse Decision-making

Charles Findling, Felix Hubert, International Brain Laboratory, Luigi Acerbi, Brandon Benson, Julius Benson, Daniel Birman, Niccolo Bonacchi, E Kelly Buchanan, Sebastian Bruijns, Rylan Schaeffer

Nature (Nature) Accepted · January 2025

Brain-wide neural representations of prior information during mouse decision-making from the International Brain Laboratory.

Neuroscience Decision Making Neural Representations

A Brain-wide Map of Neural Activity during Complex Behaviour

International Brain Laboratory, Dora Angelaki, Brandon Benson, Julius Benson, Daniel Birman, Niccolo Bonacchi, Kcenia Bougrova, Sebastian A Bruijns, Matteo Carandini, Joana A Catarino, Rylan Schaeffer

Nature (Nature) Accepted · January 2025

Brain-wide map of neural activity during complex behaviour from the International Brain Laboratory.

Neuroscience Brain Mapping Decision Making

Many-shot Jailbreaking

Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, Rylan Schaeffer, Ethan Perez, Roger Grosse, David Duvenaud

Advances in Neural Information Processing Systems (NeurIPS 2024) Accepted · December 2024

Long-context jailbreaking via many examples follows power law scaling and is hard to eliminate.

Language Models AI Safety Jailbreaking Long Context Adversarial Attacks

Attacking Audio Language Models with Best-of-N Jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Ethan Perez, Mrinank Sharma

arXiv preprint (arXiv) Under Review · December 2024

Extending best-of-N jailbreaking attacks to audio language models.

Audio Language Models AI Safety Jailbreaking Adversarial Attacks

What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · December 2024

Polysemanticity may arise from incidental causes rather than superposition.

Mechanistic Interpretability Polysemanticity Neural Networks Representation Learning

Best-of-N Jailbreaking

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

arXiv preprint (arXiv) Under Review · December 2024

Best-of-N sampling as a jailbreaking technique for large language models.

Language Models AI Safety Jailbreaking Inference-time Compute

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Tony T Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

arXiv preprint (arXiv) Under Review · December 2024

Analyzing limitations of existing jailbreak defenses and proposing a transcript-classifier approach.

Language Models AI Safety Jailbreaking Defense Mechanisms

Does Maximizing Neural Regression Scores Teach Us About The Brain?

Rylan Schaeffer, Mikail Khona, Sarthak Chandra, Mitchell Ostrow, Brando Miranda, Sanmi Koyejo

UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models (UniReps 2024) Accepted · December 2024

Investigating whether maximizing neural regression scores actually teaches us about the brain.

Neuroscience Neural Networks Representation Learning Evaluation

Incidental Polysemanticity: A New Obstacle for Mechanistic Interpretability

Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · December 2024

Incidental polysemanticity poses challenges for mechanistic interpretability.

Mechanistic Interpretability Polysemanticity Neural Networks AI Safety

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · October 2024

Compression-based data selection that outperforms embedding-based methods while being faster and simpler.

Data Selection Language Models Compression Fine-tuning

Collapse or thrive? Perils and promises of synthetic data in a self-generating world

Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L Donoho, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · October 2024

Clarifying and unifying the literature on perils and promises of synthetic data.

Model Collapse Synthetic Data Language Models Survey

Open problems in technical AI governance

Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Rylan Schaeffer

arXiv preprint (arXiv) Under Review · July 2024

Survey of open problems in technical AI governance.

AI Governance AI Safety Policy Survey

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristobal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes

arXiv preprint (arXiv) Under Review · July 2024

Image-based jailbreaks don't transfer well between vision-language models.

Vision-Language Models AI Safety Jailbreaking Adversarial Attacks Transfer Learning

Uncovering latent memories: Assessing data leakage and memorization patterns in frontier AI models

Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete

arXiv preprint (arXiv) Under Review · June 2024

Assessing data leakage and memorization patterns in frontier AI models.

Language Models Memorization Data Leakage Privacy

In-Context Learning of Energy Functions

Rylan Schaeffer, Mikail Khona, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · June 2024

Learning energy functions through in-context learning.

In-Context Learning Energy-Based Models Language Models Theory

Quantifying variance in evaluation benchmarks

Lovish Madaan, Aaditya K Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes

arXiv preprint (arXiv) Under Review · June 2024

Quantifying and understanding variance in LLM evaluation benchmarks.

Language Models Evaluation Benchmarks Statistics

Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

Rylan Schaeffer, Victor Lecomte, Dhruv Bhandarkar Pai, Andres Carranza, Berivan Isik, Alyssa Unell, Mikail Khona, Thomas Yerxa, Yann LeCun, SueYeon Chung, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · June 2024

Understanding Maximum Manifold Capacity Representations from information theory, double descent, and scaling law perspectives.

Representation Learning Self-Supervised Learning Theory

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · June 2024

Why predicting downstream capabilities from scale has remained elusive: the sequence of transformations to compute accuracy decorrelates performance from scale.

Language Models Scaling Laws Evaluation Benchmarks

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Tomasz Korbak, Rajashree Agrawal, Henry Sleight, John Hughes, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · April 2024

Model collapse is avoidable: accumulating synthetic data across iterations prevents degradation, unlike replacement.

Model Collapse Synthetic Data Language Models Generative Models

Bridging associative memory and probabilistic modeling

Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Sanmi Koyejo

arXiv preprint (arXiv) Under Review · February 2024

Connecting associative memory models with probabilistic modeling frameworks.

Associative Memory Probabilistic Models Hopfield Networks Theory

Investigating Data Contamination for Pre-training Language Models

Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

arXiv preprint (arXiv) Under Review Best Paper Award at ICLR 2024 Data Problems for Foundation Models Workshop · January 2024

Deliberately contaminating pretraining data reveals surprising U-shaped effects and highlights flaws in current contamination detection.

Language Models Data Contamination Pretraining Benchmarks

Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells

Rylan Schaeffer, Mikail Khona, Tzuhsuan Ma, Cristobal Eyzaguirre, Sanmi Koyejo, Ila Rani Fiete

Advances in Neural Information Processing Systems (NeurIPS 2023) Accepted · December 2023

Self-supervised learning on spatial tasks generates multi-modular grid cell-like representations.

Neuroscience Grid Cells Self-Supervised Learning Representation Learning

Are Emergent Abilities of Language Models a Mirage?

Rylan Schaeffer, Brando Miranda, Sanmi Koyejo

Advances in Neural Information Processing Systems (NeurIPS 2023) Accepted Outstanding Paper · December 2023

Emergent abilities in LLMs may be a mirage created by metric choice, not fundamental model behavior changes.

Language Models Emergent Abilities Scaling Laws Evaluation

DecodingTrust: A comprehensive assessment of trustworthiness in GPT models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer

Advances in Neural Information Processing Systems (Datasets & Benchmarks Track) (NeurIPS 2023) Accepted · December 2023

Comprehensive trustworthiness assessment benchmark for GPT models.

Language Models Trustworthiness AI Safety Benchmarks

Disentangling Fact from Grid Cell Fiction in Trained Deep Path Integrators

Rylan Schaeffer, Mikail Khona, Sanmi Koyejo, Ila Rani Fiete

arXiv preprint (arXiv) Under Review · December 2023

Separating genuine grid cell phenomena from artifacts in deep learning models.

Neuroscience Grid Cells Deep Learning Path Integration

Testing assumptions underlying a unified theory for the origin of grid cells

Rylan Schaeffer, Mikail Khona, Adrian Bertagnoli, Sanmi Koyejo, Ila Rani Fiete

arXiv preprint (arXiv) Under Review · November 2023

Testing the assumptions underlying unified theories of grid cell origins.

Neuroscience Grid Cells Theory Computational Neuroscience

Pretraining on the test set is all you need

Rylan Schaeffer

arXiv preprint (arXiv) Accepted · September 2023

Satirical paper showing that pretraining on the test set yields perfect benchmark scores.

Language Models Data Contamination Benchmarks Satire Evaluation

Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting

Rylan Schaeffer, Kateryna Pistunova, Samar Khanna, Sarthak Consul, Sanmi Koyejo

ICML 2023 Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning (ICML Workshop 2023) Accepted · July 2023

Logically invalid chain-of-thought prompts can be as effective as valid ones - what does this tell us about LLM reasoning?

Language Models Chain-of-Thought Reasoning Prompting

FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation

Dhruv Pai, Andres Carranza, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

ICML 2023 Workshop: Adversarial Machine Learning Frontiers (ICML AdvML Workshop 2023) Accepted · July 2023

Framework for detecting adversarial anomalies in neural network circuits using mechanistic interpretability.

AI Safety Adversarial ML Anomaly Detection Mechanistic Interpretability

Deceptive Alignment Monitoring

Andres Carranza, Dhruv Pai, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo

ICML 2023 Workshop: Adversarial Machine Learning Frontiers (ICML AdvML Workshop 2023) Accepted Blue Sky Oral · July 2023

Monitoring for deceptive alignment in AI systems - awarded Blue Sky Oral at ICML AdvML Workshop.

AI Safety Deceptive Alignment Monitoring Adversarial ML

Emergence of Sparse Representations from Noise

Trenton Bricken, Rylan Schaeffer, Bruno Olshausen, Gabriel Kreiman

International Conference on Machine Learning (ICML 2023) Accepted · July 2023

Adding noise to network inputs causes activations to become sparse - a discovery with implications for both neuroscience and deep learning.

Sparse Coding Representation Learning Neural Networks Theory Neuroscience

Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W Rocks, Ila Rani Fiete, Sanmi Koyejo

arXiv preprint / NeurIPS 2023 Workshops (ATTRIB, M3L) (arXiv / NeurIPS Workshops) Under Review · March 2023

Identifying and ablating the sources of double descent using only linear regression and SVD - the simplest possible explanation.

Deep Learning Double Descent Generalization Theory Linear Algebra

No free lunch from deep learning in neuroscience: A case study through models of the entorhinal-hippocampal circuit

Rylan Schaeffer, Mikail Khona, Ila Fiete

Advances in Neural Information Processing Systems (NeurIPS 2022) Accepted · December 2022

Deep learning models of the brain don't automatically provide scientific insight without careful analysis.

Neuroscience Deep Learning Grid Cells NeuroAI

Streaming Inference for Infinite Non-Stationary Clustering

Rylan Schaeffer, Gabrielle Kaili-May Liu, Yilun Du, Scott Linderman, Ila Rani Fiete

Conference on Lifelong Learning Agents (CoLLAs 2022) Accepted · August 2022

Streaming inference algorithms for infinite non-stationary clustering - handling evolving cluster structures online.

Bayesian Nonparametrics Streaming Inference Clustering Online Learning

Streaming Inference for Infinite Feature Models

Rylan Schaeffer, Yilun Du, Gabrielle K Liu, Ila Fiete

International Conference on Machine Learning (ICML 2022) Accepted · July 2022

Streaming inference algorithms for infinite feature models (Indian Buffet Process).

Bayesian Nonparametrics Streaming Inference Feature Learning

An Algorithmic Theory of Metacognition in Minds and Machines

Rylan Schaeffer

NeurIPS 2021 Workshop: Metacognition in the Age of AI (NeurIPS Workshop 2021) Accepted · December 2021

A simple modification to Actor-Critic that enables RL agents to detect and correct their own mistakes through metacognitive interaction.

Metacognition Cognitive Science Reinforcement Learning Theory

Efficient Online Inference for Nonparametric Mixture Models

Rylan Schaeffer, Blake Bordelon, Mikail Khona, Weiwei Pan, Ila Rani Fiete

Uncertainty in Artificial Intelligence (UAI 2021) Accepted · July 2021

Efficient online inference algorithms for nonparametric mixture models.

Bayesian Nonparametrics Online Learning Mixture Models Clustering

Neural Network Model of Amygdalar Memory Engram Formation and Function

Rylan Schaeffer, Nimrod Shaham, Gabriel Kreiman, Haim Sompolinsky

Computational and Systems Neuroscience (COSYNE 2021) Accepted · February 2021

Neural network model of memory engram formation and function in the amygdala.

Neuroscience Memory Neural Networks Computational Modeling

Reverse-engineering recurrent neural network solutions to a hierarchical inference task for mice

Rylan Schaeffer, Mikail Khona, Leenoy Meshulam, International Brain Laboratory, Ila Rani Fiete

Advances in Neural Information Processing Systems (NeurIPS 2020) Accepted · December 2020

Reverse-engineering RNN solutions to understand hierarchical inference in mice.

Neuroscience Recurrent Neural Networks Interpretability