Backpropagation

Backpropagation (“backprop” for short) TODO

Backprop as Chain Derivative

TODO

Backprop as Constrained Optimization

Backprop is most commonly derived using chain derivatives. However, backprop can also be derived as the solution to a constrained optimization problem, as shown by LeCun 1988. The idea is to see backprop as an algorithm for selecting a set of vectors ${x_i}_{i=1}^{L+1}$, one per layer of the network, that minimize a loss function subject to a set of consistency equations:

\[x^l = f(W^l x^{l-1} + b^l)\]

or equivalently, written in index form:

\[x_i^l = f \Big( \sum_{j=1}^{N^{l-1}} W_{ij}^l x_j^{l-1} + b_i^l \Big)\]