Backpropagation (“backprop” for short) TODO
Backprop is most commonly derived using chain derivatives. However, backprop can also be derived as the solution to a constrained optimization problem, as shown by LeCun 1988. The idea is to see backprop as an algorithm for selecting a set of vectors ${x_i}_{i=1}^{L+1}$, one per layer of the network, that minimize a loss function subject to a set of consistency equations:
\[x^l = f(W^l x^{l-1} + b^l)\]or equivalently, written in index form:
\[x_i^l = f \Big( \sum_{j=1}^{N^{l-1}} W_{ij}^l x_j^{l-1} + b_i^l \Big)\]