DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Authors: Sanh, Debut, Chaumond, Wolf (HuggingFace)
Venue: EMC @ NeurIPS 2019.
PDF: https://arxiv.org/pdf/1910.01108.pdf
Background
Parameters per language model have been rapidly increasing
data:image/s3,"s3://crabby-images/4fab9/4fab9eb1196b27b2309972600c3ca206ad6f3f38" alt=""
Idea
Distill pretrained transformer into smaller transformer that is 60% the size of the original.
DistilBert is 60% faster at inference time. Student training loss is triple loss:
- Supervised training loss i.e. masked language modeling loss
- Cross entropy between student and teacher, with temperature parameter in softmax
- Cosine embedding loss between student and teacher hidden representations
- Initialize student by taking one out of every two layers
Results
On GLUE benchmark, DistilBERT reaches 97% of BERT performance
data:image/s3,"s3://crabby-images/a8af6/a8af658ee8a7cf3c1cd10c7ddb2bddabf3b81308" alt=""
DistilBERT also performs well on downstream tasks and is faster
data:image/s3,"s3://crabby-images/b3abe/b3abea2f5ecec02aac6ffa63403eb1b17a707009" alt=""
Ablation shows that the following changes have the biggest (negative) impact:
- Random weight initialization hurt most
- Student-teacher cross entropy
- Student-teacher representation cosine similarity
- Masked Language Modeling loss (i.e. the task loss used to train the teacher)
data:image/s3,"s3://crabby-images/37201/37201b92b87fca0dac7c491821d2350e9cf85054" alt=""
Notes
- TODO: Investigate what the following means: “We applied best practices for training BERT model
recently proposed in Liu et al. [2019]. As such, DistilBERT is distilled on very large batches
leveraging gradient accumulation (up to 4K
examples per batch) using dynamic masking and without the next sentence prediction objective.”