DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Authors: Sanh, Debut, Chaumond, Wolf (HuggingFace)

Venue: EMC @ NeurIPS 2019.

PDF: https://arxiv.org/pdf/1910.01108.pdf

Background

Parameters per language model have been rapidly increasing

Idea

Distill pretrained transformer into smaller transformer that is 60% the size of the original. DistilBert is 60% faster at inference time. Student training loss is triple loss:

Supervised training loss i.e. masked language modeling loss
Cross entropy between student and teacher, with temperature parameter in softmax
Cosine embedding loss between student and teacher hidden representations

Initialize student by taking one out of every two layers

Results

On GLUE benchmark, DistilBERT reaches 97% of BERT performance

DistilBERT also performs well on downstream tasks and is faster

Ablation shows that the following changes have the biggest (negative) impact:

Random weight initialization hurt most
Student-teacher cross entropy
Student-teacher representation cosine similarity
Masked Language Modeling loss (i.e. the task loss used to train the teacher)

Notes

TODO: Investigate what the following means: “We applied best practices for training BERT model recently proposed in Liu et al. [2019]. As such, DistilBERT is distilled on very large batches leveraging gradient accumulation (up to 4K examples per batch) using dynamic masking and without the next sentence prediction objective.”