9 September 2022

Paper Summary - "Memorizing Transformers"

by Rylan Schaeffer

The below are my notes on Wu, Rabe, Hutchins and Szegedy’s 2022’s Memorizing Transformers.

Summary

Goal: Language models that can read and memorize new data at inference time, acquiring new knowledge immediately
Demonstrate that a kNN lookup of non-differentiable memory of key-value pairs improves language modeling across benchmarks and tasks
Memorizing Transformer is capable of using newly defined information during test time

High level: Increase the size of attention context using a k-nearest-neighbor lookup
Input text is tokenized, then embedded
One transformer layer near the top is a kNN-augmented attention layer
- Uses standard dense self-attention on the local context i.e. the input subsequence
- Also does an approximate kNN search into an external memory
After each training step, the (key, value) pairs in the local context are appended to the end of the external memory, and the old pairs are dropped
Gradients are not backpropagated through the external memory
Results of kNN-attention and local attention are combined using a learnt gate
Keys and queries are normalized to ensure older and newer keys don’t change in magnitude during training

The model was most often retrieving rare words (e.g. proper names, references, citations)

tags: machine-learning - neuro-ai - memory