Investigating Data Contamination for Pre-training Language Models
Abstract
We analyze the effects of data contamination in the pre-training stage of LMs by deliberately introducing contamination into pre-training corpora in two ways (text contamination and ground truth contamination). We find ground-truth contamination can significantly improve model performance, while text contamination does not show such enhancement.
Summary
Deliberately contaminating pretraining data reveals surprising U-shaped effects and highlights flaws in current contamination detection.
Note: My contribution to this work was limited to (1) proposing the main question and (2) proposing specific experiments, e.g., ratcheting up the amount of data contamination. I did not contribute to implementing the experiments.
Methodology
To quantify how data contamination affects LM performance in downstream tasks, we deliberately introduce contamination into pre-training corpora in two ways:
- Text contamination
- Ground truth contamination
Key Finding 1: Ground Truth Matters
The ground-truth contamination can significantly improve the model’s performance, highlighting the importance of considering ground-truths in the contamination analysis. Text contamination alone does not show such enhancement.

Key Finding 2: U-Shaped Effects
Perhaps surprisingly, more repetitions aren’t always better! The effect of data contamination can be U-shaped in the number of times that eval data is repeated in the pre-training corpus.

Key Finding 3: Detection Methods Have Drawbacks
We find that commonly used n-gram-based methods for detecting contamination in the existing studies have significant drawbacks.

Implications
We critically analyze the current assessment of contamination in existing LLM reports and point out that current evaluation practices used in previous reports are insufficient.

See the full research page for more details.
