ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
Abstract
Data selection is crucial for optimizing language model performance on specific tasks. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution.
Summary
Compression-based data selection that outperforms embedding-based methods while being faster and simpler.
