Rylan Schaeffer

Logo
Resume
Publications
Learning
Blog
Teaching
Jokes
Kernel Papers


ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

Elyas Obbad, Iddah Mlauzi, Brando Miranda, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo

arXiv preprint Under Review

October 2024

Abstract

Data selection is crucial for optimizing language model performance on specific tasks. We introduce ZIP-FIT, a data selection framework that uses gzip compression to directly measure alignment between potential training data and the target task distribution.

Summary

Compression-based data selection that outperforms embedding-based methods while being faster and simpler.