Investigating Data Contamination for Pre-training Language Models

Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, Sanmi Koyejo

arXiv preprint Under Review Best Paper Award at ICLR 2024 Data Problems for Foundation Models Workshop

January 2024

PDF arXiv Code Tweeprint

Abstract

We analyze the effects of data contamination in the pre-training stage of LMs by deliberately introducing contamination into pre-training corpora in two ways (text contamination and ground truth contamination). We find ground-truth contamination can significantly improve model performance, while text contamination does not show such enhancement.

Summary

Deliberately contaminating pretraining data reveals surprising U-shaped effects and highlights flaws in current contamination detection.

Note: My contribution to this work was limited to (1) proposing the main question and (2) proposing specific experiments, e.g., ratcheting up the amount of data contamination. I did not contribute to implementing the experiments.

Methodology

To quantify how data contamination affects LM performance in downstream tasks, we deliberately introduce contamination into pre-training corpora in two ways:

Text contamination
Ground truth contamination

Key Finding 1: Ground Truth Matters

The ground-truth contamination can significantly improve the model’s performance, highlighting the importance of considering ground-truths in the contamination analysis. Text contamination alone does not show such enhancement.

Contamination effects

Key Finding 2: U-Shaped Effects

Perhaps surprisingly, more repetitions aren’t always better! The effect of data contamination can be U-shaped in the number of times that eval data is repeated in the pre-training corpus.

U-shaped effects

Key Finding 3: Detection Methods Have Drawbacks

We find that commonly used n-gram-based methods for detecting contamination in the existing studies have significant drawbacks.

Detection method issues

Implications

We critically analyze the current assessment of contamination in existing LLM reports and point out that current evaluation practices used in previous reports are insufficient.

Analysis of existing practices

See the full research page for more details.