
Computing gradients is done via the chain rule.
Getting the matrix calculus dimensions correct is often the tricky part.
One general rule is: if a function’s input has shape
(A, B, C, ...) and its output has shape (X, Y, Z, ...), then the gradient of the output
with respect to the input should have shape (X, Y, Z, ..., A, B, C, ...). The reason why is that, in general,
one needs to specify how each output element varies as each input element varies.
Notes:
Motivating Example: \(L(y, \hat{y}) = \frac{1}{2} (y - \hat{y})^2\) with \(y, \hat{y} \in \mathbb{R}\).
Goal: Compute \(\frac{d}{d\hat{y}} L(y, \hat{y})\)
Shape Analysis: The output is a scalar, and the gradient is with respect to another scalar, so we expect
the gradient’s shape to be (output shape, input shape) = (1, 1).
Code:
B, D = 256, 1
y = np.random.randn(B, D)
yhat = np.random.randn(B, D)
loss = 0.5 * np.square(y - yhat) # Shape: (B, 1)
dloss_dyhat = -np.einsum(
"i,bj->bij",
np.ones(1),
y - yhat) # Shape: (B, 1, 1)
Motivating Example: \(L(y, \hat{y}) = \frac{1}{2} \lvert \lvert y - \hat{y} \lvert \lvert_2^2\) with \(\hat{y} \in \mathbb{R}^n\).
Goal: Compute \(\nabla_{\hat{y}} L(y, \hat{y})\)
Shape Analysis: \(L(y, \hat{y})\) is a scalar, and \(\hat{y}\) is a \(n\)-dimensional vector, so we expect the gradient’s
shape to be (output shape, input shape) = (1, n).
Code:
B, D = 256, 10
y = np.random.randn(B, D)
yhat = np.random.randn(B, D)
loss = 0.5 * np.square(y - yhat) # Shape: (B, 1)
dloss_dyhat = -np.einsum(
"i,bj->bij",
np.ones(1),
y - yhat) # Shape: (B, 1, D)
(n, n, m).B, output_dim, input_dim = 256, 10, 20
W = np.random.randn(output_dim, input_dim)
x = np.random.randn(B, input_dim)
yhat = np.einsum("ij,bj->bi", W, x) # Shape: (B, output_dim)
dyhat_dW = np.einsum(
"ij,bk->bijk",
np.eye(output_dim),
x) # Shape: (B, output_dim, output_dim, input_dim)
(1, n, m).B, output_dim, input_dim = 256, 10, 20
y = np.random.randn(B, output_dim)
W = np.random.randn(output_dim, input_dim)
x = np.random.randn(B, input_dim)
diff = y - np.einsum("rc,bc->br", W, x) # Shape: (B, output_dim)
loss = 0.5 * np.square(
np.linalg.norm(diff, axis=1, keepdims=1)
) # Shape: (B, 1)
dloss_dyhat = -np.einsum(
"i,bj->bij",
np.ones(1),
diff) # Shape: (B, 1, output_dim)
dyhat_dW = np.einsum(
"ij,bk->bijk",
np.eye(output_dim),
x) # Shape: (B, output_dim, output_dim, input_dim)
dloss_dW = np.mean(
np.einsum(
"bij,bjjk->bjk",
dloss_dyhat,
dyhat_dW), # Shape: (B, output_dim, input_dim)
axis=0,
keepdims=True) # Shape: (1, output_dim, input_dim)
Note: All we did was apply the chain rule. The key is to get the dimensions right.
(n, n).B, D = 256, 10
h = np.random.randn(B, D)
sigmoid = lambda x: 1 / (1 + np.exp(-x))
a = sigmoid(h) # Shape: (B, D)
dsigmoid_dargument = lambda x: sigmoid(x) * (1 - sigmoid(x))
da_dh = np.einsum(
"bi,ij->bij",
dsigmoid_dargument(h),
np.eye(D),
) # Shape: (B, D, D)
(n, n).B, D = 256, 10
v = np.random.randn(B, D)
exp_v = np.exp(v) # Shape: (B, D)
q = exp_v / np.sum(exp_v, axis=1, keepdims=True) # Shape: (B, D)
dq_dv = np.subtract(
np.einsum( # Shape: (B, D, D)
"bi,ij->bij", q,np.eye(D)
),
np.einsum(
"bi,bj->bij", q, q
) # Shape: (B, D, D)
)
(1, n).B, N = 256, 10
p = np.random.randn(B, N)
p = p / np.sum(p, axis=1, keepdims=True) # Shape: (B, N)
phat = np.random.randn(B, N)
phat = phat / np.sum(phat, axis=1, keepdims=True) # Shape: (B, N)
loss = np.einsum("bi,bi->b", p, np.log(phat)) # Shape: (B, 1)
dloss_dphat = np.einsum( # Shape: (B, 1, N)
"i,bj->bij",
np.ones(N),
p / phat,
)
We consider a deep affine network with \(L\) layers:
\[\hat{y} = W_L W_{L-1} ... W_1 x\]where \(W_i \in \mathbb{R}^{n_i \times n_{i-1}}\) and \(x \in \mathbb{R}^{n_0}\).