$\DeclareMathOperator*{\argmax}{argmax}$ $\DeclareMathOperator*{\argmin}{argmin}$ $\DeclareMathOperator{\defeq}{\stackrel{def}{=}}$ $\DeclareMathOperator{\Tr}{Tr}$ $\DeclareMathOperator{\rank}{rank}$ $\DeclareMathOperator{\sign}{sign}$

Rylan Schaeffer > Learning > Neural Networks

Learning Algorithms


Backpropagation ("backprop" for short) TODO

Backprop as Chain Derivative


Backprop as Constrained Optimization

Backprop is most commonly derived using chain derivatives. However, backprop can also be derived as the solution to a constrained optimization problem (LeCun 1988). The idea is to see backprop as an algorithm for selecting a set of vectors $\{x_i\}_{i=1}^{L+1}$, one per layer of the network, that minimize a loss function subject to a set of consistency equations: $$ x^l = f(W^l x^{l-1} + b^l)$$ or equivalently, written in index form: $$ x_i^l = f \Big( \sum_{j=1}^{N^{l-1}} W_{ij}^l x_j^{l-1} + b_i^l \Big)$$

we have $N$ datapoints $\{x_i, y_i\}_{i=1}^N$ and a mean-squared error loss function.

Weight Perturbation

Weight Perturbation is an algorithm for learning with a global error signal that uses additive noise to estimate the gradient of the error signal with respect to each parameter (Jabri & Flower 1992). Consider training a feedforward network $y(\cdot)$ using mean squared error on training data $\{(x_i, y_i)\}_{i=1}^N$: $$ E_{tr} = \frac{1}{2} || y_i - y(x_i) ||_2^2 $$ $$x_i^l = f \Big(\sum_j W_{ij}^l x_j^{l-1} + b_i^l \Big) $$ Through a relatively simple algorithm, we can estimate $\frac{\partial E_{tr}}{\partial W_{ij}^l}$ and $\frac{\partial E_{tr}}{\partial b_{i}^l}$ simply by adding a small amount of independent noise to each parameter. To understand why this works, consider adding noise $\Omega_{ij}^l, \gamma_i^l \sim N(0, \sigma^2)$ to each parameter $W_{ij}^l, b_i^l$. The network's activity and outputs will consequently be slightly pertubed: $$ \tilde{E}_{tr} = \frac{1}{2} || y_i - \tilde{y}(x_i) ||_2^2$$ $$\tilde{x}_i^l = f \Big(\sum_j (W_{ij}^l + \Omega_{ij}^l) \tilde{x}_j^{l-1} + (b_i^l + \gamma_i^l) \Big) $$

If the variance of the additive noise $\sigma^2$ is small, we can use a linear approximation of the training error $E_{tr}$ to quantify how the additive noise will affect the training error. $$\tilde{E}_{tr} \approx E_{tr} + \sum_{l, i, j} \frac{\partial E}{\partial W_{ij}^l} \Omega_{ij}^l + \sum_{l, ij} \frac{\partial E}{\partial b_i^l} \gamma_i^l$$ Rearranging slightly, we can write that the difference between the pertubed and unperturbed training error is: $$\tilde{E}_{tr} - E_{tr} \approx \sum_{l, i, j} \frac{\partial E}{\partial W_{ij}^l} \Omega_{ij}^l + \sum_{l, ij} \frac{\partial E}{\partial b_i^l} \gamma_i^l$$ Since each noise term was sampled independently from a distribution with mean 0, we see that if we multiply both sides by a noise term $\Omega_{ab}^c$ and take the expectation with respect to that noise term, the independence and zero mean will annihilate all terms except the one containing the gradient with respect to the corresponding parameter $W_{ab}^c$: $$\begin{align*} \tilde{E}_{tr} - E_{tr} &\approx \sum_{l, i, j} \frac{\partial E}{\partial W_{ij}^l} \Omega_{ij}^l + \sum_{l, ij} \frac{\partial E}{\partial b_i^l} \gamma_i^l\\ (\tilde{E}_{tr} - E_{tr}) \Omega_{ab}^c &\approx \sum_{l, i, j} \frac{\partial E}{\partial W_{ij}^l} \Omega_{ij}^l \Omega_{ab}^c + \sum_{l, ij} \frac{\partial E}{\partial b_i^l} \gamma_i^l \Omega_{ab}^c\\ \langle (\tilde{E}_{tr} - E_{tr}) \Omega_{ab}^c \rangle_{\Omega_{ab}^c} &\approx \langle \sum_{l, i, j} \frac{\partial E}{\partial W_{ij}^l} \Omega_{ij}^l \Omega_{ab}^c + \sum_{l, ij} \frac{\partial E}{\partial b_i^l} \gamma_i^l \Omega_{ab}^c \rangle_{\Omega_{ab}^c}\\ &\approx \sum_{l, i, j} \frac{\partial E}{\partial W_{ij}^l} \langle \Omega_{ij}^l \Omega_{ab}^c \rangle_{\Omega_{ab}^c} + \sum_{l, ij} \frac{\partial E}{\partial b_i^l} \gamma_i^l \langle \Omega_{ab}^c \rangle_{\Omega_{ab}^c}\\ &\approx \frac{\partial E}{\partial W_{ab}^c} \sigma^2\\ \frac{1}{\sigma^2} \langle (\tilde{E}_{tr} - E_{tr}) \Omega_{ab}^c \rangle_{\Omega_{ab}^c} &\approx \frac{\partial E}{\partial W_{ab}^c} \end{align*}$$

We can further simplify by noting that since $E_{tr}$ is independent of $\Omega_{ab}^c$, $$ \langle E_{tr} \Omega_{ab}^c \rangle_{\Omega_{ab}^c} = E_{tr} \langle \Omega_{ab}^c \rangle_{\Omega_{ab}^c} = 0$$ and therefore $$\frac{1}{\sigma^2} \langle (\tilde{E}_{tr} - E_{tr}) \Omega_{ab}^c \rangle_{\Omega_{ab}^c} = \frac{1}{\sigma^2} \langle \tilde{E}_{tr} \Omega_{ab}^c \rangle_{\Omega_{ab}^c} \approx \frac{\partial E}{\partial W_{ab}^c}$$ This same argument applies for $\gamma_a^c$ as well, yielding: $$ \frac{1}{\sigma^2} \langle \tilde{E}_{tr} \gamma_{a}^c \rangle_{\gamma_{a}^c} \approx \frac{\partial E}{\partial b_{a}^c}$$

What we've shown is that the covariance between the pertubation of the weight and the pertubed error will (in expectation) point in the direction of the gradient. No chain derivatives necessary! Then, Weight Pertubation approximates gradient descent using the following update equations: $$ \begin{align*} \Delta W_{ij}^l &= - \eta \tilde{E}_{tr} \Omega_{ij}^l \\ \Delta b_i^l &= - \eta \tilde{E}_{tr} \gamma_i^l \end{align*} $$

Problems with Weight Pertubation

Despite obtaining an unbiased estimate of the gradient of the loss, Weight Pertubation is impractical because (like all methods using Monte Carlo estimators), its suffers from extremely high variance.

Node Perturbation

Node Pertubation (Williams 1992) improves on Weight Pertubation by identifying that the gradient of the global error can be estimated by adding noise into the bias alone. Consider adding noise $\gamma_i^l$: $$\tilde{x}_i^l = f \Big(\sum_j W_{ij}^l \tilde{x}_j^{l-1} + (b_i^l + \gamma_i^l) \Big) $$ We can use the same Weight Perturbation approach to estimate $\frac{\partial E}{\partial b_a^c}$, but how can we estimate $\frac{\partial E}{\partial W_{ab}^c}$?

Define the unperturbed, pre-activation value: $$\nu_i^l = \sum_j W_{ij}^l x_j^{l-1} + (b_i^l + \gamma_i^l)$$ Per the chain rule, $$\frac{\partial E}{\partial W_{ab}^c} = \frac{\partial E}{\partial \nu_a^c} \frac{\partial \nu_a^c}{\partial W_{ab}^c} = \frac{\partial E}{\partial \nu_a^c} x_b^c$$ We can estimate $\frac{\partial E}{\partial \nu_a^c}$ using a similar trick as WP. To first order, the difference between the perturbed and unperturbed error is $$\tilde{E}_{tr} - E_{tr} \approx \sum_{l, i} \frac{\partial E}{\partial \nu_i^l} \gamma_i^l$$ Multiply both sides by $\gamma_a^c$ and take the expectation with respect to $\gamma_a^c$: $$\langle (\tilde{E}_{tr} - E_{tr}) \gamma_a^c \rangle_{\gamma_a^c} = \frac{\partial E}{\partial \nu_a^c} \sigma^2$$ Thus, $$ \begin{align*} \frac{\partial E}{\partial W_{ab}^c} &= \frac{\partial E}{\partial \nu_a^c} x_b^c\\ &= \frac{1}{\sigma^2}\langle (\tilde{E}_{tr} - E_{tr}) \gamma_a^c \rangle_{\gamma_a^c} x_b^c \end{align*} $$

Since $tilde{x} \approx x$ to first order in $\gamma$, NP approximates gradient descent using the following update equations: $$ \begin{align*} \Delta W_{ij}^l &= - \eta \tilde{E}_{tr} \gamma_i^l x_j^l \\ \Delta b_i^l &= - \eta \tilde{E}_{tr} \gamma_i^l \end{align*} $$

Comparison of Backprop, Weight Perturbation & Node Perturbation

( Werfel, Xie and Seung 2004) offer a comparison of the convergence of Backpropagation (BP), Weight Pertuubation (WP) and Node Perturbation (NP) in a simple linear model as a function of the input dimension $N$ and output dimension $M$. They consider the following student-teacher setup, where $W^*$ is the teacher's weight matrix and where $W$ is the student's weight matrix, both in $\mathbb{R}^{M \times N}$: $$x \in \mathbb{R}^N \sim \mathcal{N}(0, I), \qquad y^* \in \mathbb{R}^M = W^* x, \qquad y = W x , \qquad L(W) = ||y - y^*||_2^2$$

Activation Functions

Logistic Sigmoid

The logistic sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x + 1}$.

Hyperbolic Tangent

The hyperbolic tangent function $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}$.

Relationship between Logistic Sigmoid and Hyperbolic Tangent

The logistic sigmoid and the hyperbolic tangent functions can be expressed as transformation of each other. Specifically, $$ \begin{align*} \tanh(x) &\defeq \frac{e^{2x} - 1}{e^{2x} + 1} = \frac{2e^{2x}}{e^{2x} + 1} - \frac{e^{2x} + 1}{e^{2x} + 1} = 2 \frac{e^{2x}}{e^{2x}} \frac{1}{1 + e^{-2x}} - 1 = 2 \sigma(2x) - 1 \end{align*} $$

And therefore equivalently: $$ \begin{align*} \sigma(x) = \frac{\tanh(x/2) + 1}{2} \end{align*} $$

The consequence of this relationship is that the choice of activation function between the two doesn't much matter since each is capable of representing the one. That is, if a single scalar output from a single layer of a network using $\sigma()$ is $y(x, w) = w_0 + \sum w_i \sigma(x)$, then we can construct an equivalent network using $\tanh$ by defining $y(x, w') = w_0' + \sum w_i' \tanh(x/2)$ where $w_0' = w_0 + \sum w_i / 2$ and $w_i' = w_i / 2$ for $i \neq 0$. Then $y(x, w) = y(x, w')$.

Linear Models

Scalar Linear Neuron

One of the simplest models of a neuron we might consider is a single "neuron" outputting a weighted combination of input features:

$$ \begin{align*} y = w^T x \end{align*} $$

where $y \in \mathbb{R}$ and $w, x \in \mathbb{R}^d$. This model, as a supervised learning problem under the Mean-Squared Error loss function, has been extensively studied. In such a setting, the neuron's goal is to learn a mapping $x \rightarrow y$ from a dataset of N x-y pairs, $\{(x_n, y_n) \}_{n=1}^N$, by minimizing the training MSE $L^{tr}(w) \defeq \frac{1}{N} \sum_{n=1}^N (y_n - w^T x_n)^2$. Does this model have a unique fixed point $w^*$ that minimizes the loss? We take the gradient and set equal to zero.

$$ \begin{align*} L^{tr}(w) &\defeq \frac{1}{N} \sum_{n=1}^N (y_n - w^T x_n)^2 = \sum_{n=1}^N Tr(y_n - w^T x_n)^2\\\\ \nabla_w L(w) = 0 &= \nabla_w \frac{1}{N} \sum_{n=1}^N (y_n^2 - 2 w^T x_n y_n + w^T x_n x_n^T w)\\ 0 &= -\sum_{n=1}^N y_n x_n + \sum_{n=1}^N x_n x_n^T w^*\\ \sum_{n=1}^N x_n x_n^T w^* &= \sum_{n=1}^N y_n x_n \end{align*} $$

One important observation in that $\sum_{n=1}^N x_n x_n^T$ is the empiric input-input covariance matrix and $\sum_{n=1}^N y_n x_n$ is the empiric output-input covariance "matrix" (although here the covariance is a vector since $y$ is a scalar), and the optimal weight vector $w^*$ relates these two covariance matrices. Another important observation is that $C \defeq \sum_{n=1}^N x_n x_n^T$ is the sum of $N$ rank-$D$ matrices, meaning that $\rank(C) \leq min(N, D)$. The third important observation is that the Hessian $H_{ij} \defeq \frac{\partial L^{tr}(w)}{\partial w_i \partial w_j}= C_{ij}$. More on this in a bit.

The first question to ask is what behavior we should expect from such a model? Two regimes emerge, one where $N < D$ and the other where $N \geq D$. In the first regime, where the number of observations $N$ is less than the dimensionality of the data $D$, $C$ is a $D \times D$ matrix of at most rank $N$, meaning that $C$ is not invertible and therefore infinitely many solution exist. The easiest way to see this is to note that we have $N$ equations and $D$ unknowns, meaning we have fewer equations than unknowns. Consequently, in this regime, the training error $L^{tr}(w)$ is guaranteed to be 0. For those unconvinced, we can prove this claim via construction:

PCA Networks

Previously, we considered the linear neuron learning via supervised labels $y$. Here, we'll instead consider a linear neuron receiving data $x \in \mathbb{R}^d$ and learning a linear readout that produces a low dimension representation $y \in \mathbb{R}^k$, where $k < d$: $$y = w^T x $$ Rather than defining an objective function and then showing the emergent update rule, we'll instead consider an update rule inspired by Hebbian learning principles: if the model's output is correlated with a feature in the input space, the corresponding weight should be strengthened. This learning rule will be shown to a common linear unsupervised learning technique, principal component analysis on the data. The Hebbian update rule is: $$w_{t+1} = w_{t} + \eta y_t x_t $$

However, this naive Hebbian learning rule is unstable! We can see this by considering the continuous time differential equation and showing that the magnitude of $w$ diverges. $$w_{t+1} - w_t = \eta y_t x_t \approx \frac{\Delta w}{\Delta t} = \frac{\eta}{\Delta t} y_t x_t \approx \tau \frac{dw}{dt} = y(t) x(t) $$ But if we replace $y_t = w_t^T x_t$, we see the the change in $w$ scales proportional to the norm of $x(t)$: $$ \begin{align*} \tau \frac{dw}{dt} &= y(t) x(t) \\ &= w(t)^T x(t) x(t)\\ &= x(t) w(t)^T x(t) \\ \tau \frac{d}{dt} ||w_t||_2^2 &= w(t)^T x(t) w(t)^T x(t)\\ &= ||w(t) x(t)||_2^2 \end{align*} $$ This implies that $w(t)$ will diverge to infinity since the change is always positive, regardless of inputs. We can also show the direction that $w$ diverges in by projecting the weight vector onto the data covariance matrix $C = x(t) x(t)^T$: $$ \begin{align*} \tau \frac{dw}{dt} &= x(t) x(t)^T w(t)\\ &= C w(t)\\ &= C \sum_i w_i(t) \vec{\lambda}_i\\ \tau \frac{dw_i}{dt} &= \lambda_i w_i(t) \end{align*} $$ Solving the dynamics yields

To fix this, we propose a slight modification of the Hebbian learning rule called Oja's rule:

Oja's Rule

Sanger's Rule

Deep Linear Neural Networks

A deep linear neural network is a multi-layered neural network with the non-linear "activations" removed. That is, it is a composition of linear transformations (typically matrices). For instance, $\hat{y} = W_{32} W_{21} x$ is a two-layer linear network. Despite lacking the ability to learn non-linear functions, this class of network is attractive because it can be tractably analyzed and because its behaves similarly to its non-linear cousins under mild conditions.

One question we might immediately ask is for a two-layer linear neural network trained under mean squared error (MSE) on dataset $\{(x_i, y_i)\}_1^N$, what are the coupled differential equations that describe the network's learning dynamics? Let $x \in \mathbb{R}^i$ be the input, $y \in \mathbb{R}^o$ be the output, and $L = \frac{1}{N} \sum_{i=1}^N (y - W_{32} W_{21} x)^T (y - W_{32} W_{21} x)$ be the objective function. Define the input-input correlation matrix $\Sigma_{11} = \frac{1}{N}\sum_{i=1}^N x x^T$ and the input-output correlation matrix $\Sigma_{31} = \frac{1}{N}\sum_{i=1}^N y x^T$. We first derive $-\frac{\partial L}{\partial W_{32}}$:

$$ \begin{align*} -\frac{\partial L}{\partial W_{32}} &= -\frac{1}{\partial W_{32}} \frac{1}{N} \sum_{i=1}^N (y - W_{32} W_{21} x)^T (y - W_{32} W_{21} x) \nonumber \\ &= -\frac{1}{\partial W_{32}} \frac{1}{N} \sum_{i=1}^N tr(y^T y - 2 y^T W_{32} W_{21} x + x^T W_{21}^T W_{32}^T W_{32} W_{21} x) \nonumber \\ &= \frac{1}{N} \sum_{i=1}^N 2 \frac{1}{\partial W_{32}} tr(y^T W_{32} W_{21} x) - \frac{1}{\partial W_{32}}tr(x^T W_{21}^T W_{32}^T W_{32} W_{21} x) \nonumber \\ &= \frac{1}{N} \sum_{i=1}^N 2 y x^T W_{21}^T - 2 W_{32} W_{21} x x^T W_{21}^T \nonumber \\ &= 2 \Sigma_{31} W_{21}^T - 2 W_{32} W_{21} \Sigma_{11} W_{21}^T \nonumber \\ -\frac{\partial L}{\partial W_{32}} &= 2 (\Sigma_{31} - W_{32} W_{21} \Sigma_{11}) W_{21}^T \end{align*} $$

We similarly derive $-\frac{\partial L}{\partial W_{21}}$: $$ \begin{align*} -\frac{\partial L}{\partial W_{21}} = 2 \Sigma_{32}^T (\Sigma_{31} - W_{32} W_{21} \Sigma_{11}) \end{align*} $$

Together, these coupled nonlinear differential equations define the learning dynamics of a linear neural network. One key aspect is that the learning dynamics are non-linear, meaning that deep linear networks are more sophisticated than shallow (i.e. single-layer) networks. These non-linear learning dynamics arise from the choice of error function: MSE produces a loss function that is quartic with respect to the weights, resulting in a gradient that is cubic with respect to the weights.

Nonlinear Models


One of the earliest and simplest models of a nonlinear neural network is a binary classifier called the Perceptron, the non-linear equivalent of the single linear neuron:

$\begin{align*} y = \sign(w^T x) \end{align*}$

An amazing property of the perceptron was that for a given classification problem, if the