Quantifying the Effect of Test Set Contamination on Generative Evaluations Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Abhay Puri, Niloofar Mireshghallah, Sanmi Koyejo arXiv preprint Under Review January 2026 Language Models Data Contamination Generative Evaluation Benchmarks arXiv Abstract Summary Quantifying how test set contamination affects generative evaluation metrics.