Processing math: 100%

Rylan Schaeffer

Logo
Resume
Research
Learning
Blog
Teaching
Jokes
Kernel Papers


“Be a better friend to yourself“

Normalization in Deep Learning

Layer Normalization

Ba, Kiro and Hinton 2016 introduced Layer Normalization. Let aRD be a D-dimensional vector of activations. Layer normalization first computes the mean and standard deviation across the D dimensions:

μ=1DDd=1ad σ=1DDd=1(adμ)2+ϵ

where ϵ>0 is a small constant to avoid division by zero. Then, Layer normalization normalizes the activations:

aμσ

One can optionally introduce learnable parameters γ,βRD to scale and shift the normalized activations:

γaμσ+β

where denotes elementwise multiplication.

Root Mean Square Layer Normalization

Zhang and Sennrich NeurIPS 2019 introduced Root Mean Square (RMS) layer normalization. Rather than centering, RMS Layer Norm normalizes the activations by their root mean square. Let aRD be a vector of activations. RMS Layer Norm first calculates the root mean square:

RMS(a)=1Dda2d

RMS Layer Norm then normalizes the activations:

aRMS(a)

One can optionally introduce learnable parameters γRD to scale the normalized activations:

γaRMS(a)

This forces the vectors to lie on a D-scaled hypersphere.

Batch Normalization

Instance Normalization