“Be a better friend to yourself“
Ba, Kiro and Hinton 2016 introduced Layer Normalization. Let a∈RD be a D-dimensional vector of activations. Layer normalization first computes the mean and standard deviation across the D dimensions:
μ=1DD∑d=1ad σ=√1DD∑d=1(ad−μ)2+ϵwhere ϵ>0 is a small constant to avoid division by zero. Then, Layer normalization normalizes the activations:
a−μσOne can optionally introduce learnable parameters γ,β∈RD to scale and shift the normalized activations:
γ⊙a−μσ+βwhere ⊙ denotes elementwise multiplication.
Zhang and Sennrich NeurIPS 2019 introduced Root Mean Square (RMS) layer normalization. Rather than centering, RMS Layer Norm normalizes the activations by their root mean square. Let a∈RD be a vector of activations. RMS Layer Norm first calculates the root mean square:
RMS(a)=√1D∑da2dRMS Layer Norm then normalizes the activations:
aRMS(a)One can optionally introduce learnable parameters γ∈RD to scale the normalized activations:
γ⊙aRMS(a)This forces the vectors to lie on a √D-scaled hypersphere.