ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo

arXiv preprint Under Review

October 2024

arXiv

Abstract

Data selection is crucial for optimizing language model performance on specific tasks. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution.

Summary

Compression-based data selection that outperforms embedding-based methods while being faster and simpler.