Pretraining on the test set is all you need

Rylan Schaeffer

arXiv preprint Accepted

September 2023

PDF arXiv

Abstract

We introduce a novel technique to achieve state-of-the-art performance on any benchmark: simply pretrain on the test set. Our method achieves 100% accuracy on all benchmarks tested.

Summary

Satirical paper showing that pretraining on the test set yields perfect benchmark scores.

Summary

This satirical paper highlights the problem of data contamination in large language model evaluation by taking the concept to its logical extreme.

The “Method”:

We propose a revolutionary new technique: simply include the test set in pretraining data. Our approach achieves:

100% accuracy on MMLU
100% accuracy on GSM8K
100% accuracy on every other benchmark

The Serious Point:

This paper draws attention to the growing problem of benchmark contamination in LLM evaluation:

Test sets may inadvertently appear in web-scraped pretraining corpora
Models may memorize benchmark examples rather than learning general capabilities
The field needs better evaluation practices that are robust to contamination

Impact: This work sparked important discussions about evaluation integrity and the need for held-out, uncontaminated test sets.