9 September 2022
Paper Summary - "Memorizing Transformers"
by Rylan Schaeffer
The below are my notes on Wu, Rabe, Hutchins and Szegedy’s 2022’s
Memorizing Transformers.
Summary
- Goal: Language models that can read and memorize new data at inference
time, acquiring new knowledge immediately
- Demonstrate that a kNN lookup of non-differentiable memory of key-value pairs improves language modeling
across benchmarks and tasks
- Memorizing Transformer is capable of using newly defined information during test time
Method
- High level: Increase the size of attention context using a k-nearest-neighbor lookup
- Input text is tokenized, then embedded
- One transformer layer near the top is a kNN-augmented attention layer
- Uses standard dense self-attention on the local context i.e. the input subsequence
- Also does an approximate kNN search into an external memory
- After each training step, the (key, value) pairs in the local context are appended to the
end of the external memory, and the old pairs are dropped
- Gradients are not backpropagated through the external memory
- Results of kNN-attention and local attention are combined using a learnt gate
- Keys and queries are normalized to ensure older and newer keys don’t change in magnitude during training
Results
- Datasets:
- Arxiv Math
- GitHub
- Isabelle (Formal Math language)
- Colossal Cleaned Common Crawl (C4)
- PG-19 (English-language books), good for long-range natural language text modeling
- Adding the kNN memory improved the token-level perplexity
- Increasing the memory size increases the benefit of the memory
- The model was most often retrieving rare words (e.g. proper names, references, citations)
tags: machine-learning - neuro-ai - memory