Rylan Schaeffer

Logo
Resume
Research
Learning
Blog
Teaching
Jokes
Kernel Papers


DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Authors: Sanh, Debut, Chaumond, Wolf (HuggingFace)

Venue: EMC @ NeurIPS 2019.

PDF: https://arxiv.org/pdf/1910.01108.pdf

Background

Parameters per language model have been rapidly increasing

Idea

Distill pretrained transformer into smaller transformer that is 60% the size of the original. DistilBert is 60% faster at inference time. Student training loss is triple loss:

  1. Supervised training loss i.e. masked language modeling loss
  2. Cross entropy between student and teacher, with temperature parameter in softmax
  3. Cosine embedding loss between student and teacher hidden representations

Results

On GLUE benchmark, DistilBERT reaches 97% of BERT performance

DistilBERT also performs well on downstream tasks and is faster

Ablation shows that the following changes have the biggest (negative) impact:

  1. Random weight initialization hurt most
  2. Student-teacher cross entropy
  3. Student-teacher representation cosine similarity
  4. Masked Language Modeling loss (i.e. the task loss used to train the teacher)

Notes