Computing gradients is done via the chain rule.
Getting the matrix calculus dimensions correct is often the tricky part.
One general rule is: if a function’s input has shape
(A, B, C, ...)
and its output has shape (X, Y, Z, ...)
, then the gradient of the output
with respect to the input should have shape (X, Y, Z, ..., A, B, C, ...)
. The reason why is that, in general,
one needs to specify how each output element varies as each input element varies.
Notes:
Motivating Example: \(L(y, \hat{y}) = \frac{1}{2} (y - \hat{y})^2\) with \(y, \hat{y} \in \mathbb{R}\).
Goal: Compute \(\frac{d}{d\hat{y}} L(y, \hat{y})\)
Shape Analysis: The output is a scalar, and the gradient is with respect to another scalar, so we expect
the gradient’s shape to be (output shape, input shape) = (1, 1)
.
Code:
B, D = 256, 1
y = np.random.randn(B, D)
yhat = np.random.randn(B, D)
loss = 0.5 * np.square(y - yhat) # Shape: (B, 1)
dloss_dyhat = -np.einsum(
"i,bj->bij",
np.ones(1),
y - yhat) # Shape: (B, 1, 1)
Motivating Example: \(L(y, \hat{y}) = \frac{1}{2} \lvert \lvert y - \hat{y} \lvert \lvert_2^2\) with \(\hat{y} \in \mathbb{R}^n\).
Goal: Compute \(\nabla_{\hat{y}} L(y, \hat{y})\)
Shape Analysis: \(L(y, \hat{y})\) is a scalar, and \(\hat{y}\) is a \(n\)-dimensional vector, so we expect the gradient’s
shape to be (output shape, input shape) = (1, n)
.
Code:
B, D = 256, 10
y = np.random.randn(B, D)
yhat = np.random.randn(B, D)
loss = 0.5 * np.square(y - yhat) # Shape: (B, 1)
dloss_dyhat = -np.einsum(
"i,bj->bij",
np.ones(1),
y - yhat) # Shape: (B, 1, D)
(n, n, m)
.B, output_dim, input_dim = 256, 10, 20
W = np.random.randn(output_dim, input_dim)
x = np.random.randn(B, input_dim)
yhat = np.einsum("ij,bj->bi", W, x) # Shape: (B, output_dim)
dyhat_dW = np.einsum(
"ij,bk->bijk",
np.eye(output_dim),
x) # Shape: (B, output_dim, output_dim, input_dim)
(1, n, m)
.B, output_dim, input_dim = 256, 10, 20
y = np.random.randn(B, output_dim)
W = np.random.randn(output_dim, input_dim)
x = np.random.randn(B, input_dim)
diff = y - np.einsum("rc,bc->br", W, x) # Shape: (B, output_dim)
loss = 0.5 * np.square(
np.linalg.norm(diff, axis=1, keepdims=1)
) # Shape: (B, 1)
dloss_dyhat = -np.einsum(
"i,bj->bij",
np.ones(1),
diff) # Shape: (B, 1, output_dim)
dyhat_dW = np.einsum(
"ij,bk->bijk",
np.eye(output_dim),
x) # Shape: (B, output_dim, output_dim, input_dim)
dloss_dW = np.mean(
np.einsum(
"bij,bjjk->bjk",
dloss_dyhat,
dyhat_dW), # Shape: (B, output_dim, input_dim)
axis=0,
keepdims=True) # Shape: (1, output_dim, input_dim)
Note: All we did was apply the chain rule. The key is to get the dimensions right.
(n, n)
.B, D = 256, 10
h = np.random.randn(B, D)
sigmoid = lambda x: 1 / (1 + np.exp(-x))
a = sigmoid(h) # Shape: (B, D)
dsigmoid_dargument = lambda x: sigmoid(x) * (1 - sigmoid(x))
da_dh = np.einsum(
"bi,ij->bij",
dsigmoid_dargument(h),
np.eye(D),
) # Shape: (B, D, D)
(n, n)
.B, D = 256, 10
v = np.random.randn(B, D)
exp_v = np.exp(v) # Shape: (B, D)
q = exp_v / np.sum(exp_v, axis=1, keepdims=True) # Shape: (B, D)
dq_dv = np.subtract(
np.einsum( # Shape: (B, D, D)
"bi,ij->bij", q,np.eye(D)
),
np.einsum(
"bi,bj->bij", q, q
) # Shape: (B, D, D)
)
(1, n)
.B, N = 256, 10
p = np.random.randn(B, N)
p = p / np.sum(p, axis=1, keepdims=True) # Shape: (B, N)
phat = np.random.randn(B, N)
phat = phat / np.sum(phat, axis=1, keepdims=True) # Shape: (B, N)
loss = np.einsum("bi,bi->b", p, np.log(phat)) # Shape: (B, 1)
dloss_dphat = np.einsum( # Shape: (B, 1, N)
"i,bj->bij",
np.ones(N),
p / phat,
)
We consider a deep affine network with \(L\) layers:
\[\hat{y} = W_L W_{L-1} ... W_1 x\]where \(W_i \in \mathbb{R}^{n_i \times n_{i-1}}\) and \(x \in \mathbb{R}^{n_0}\).