$\DeclareMathOperator*{\argmax}{argmax}$ $\DeclareMathOperator*{\argmin}{argmin}$ $\DeclareMathOperator*{\Tr}{Tr}$ $\DeclareMathOperator{\defeq}{\stackrel{def}{=}}$

Rylan Schaeffer > Learning > Optimization

First-Order Gradient-Based Optimization

Second-Order Gradient-Based Optimization

Newton's Method

Newton's Method: Idea

Suppose we wish to minimize a scalar-valued function $f(x)$ where $x \in \mathbb{R}^N$ using gradient descent. We want to identify what perturbation $\delta$ to $x$ minimizes $f(x)$. We phrase this question as an optimization problem: $$\argmin_{\delta} f(x + \delta)$$ If we perform a second-order Taylor series expansion about the current point, we see that: $$f(x + \delta) \approx f(x) + \nabla_x f(x)^T \delta + \frac{1}{2} \delta^T H(x) \delta + O(\delta^3)$$ where $H(x)_{ij} = \frac{\partial^2 f(x)}{\partial x_i \partial x_j}$ is the Hessian. To find the value of $\delta$ that minimizes $f(x^* + \delta)$, we differentiate and set equal to 0: $$ 0 = \nabla_{\delta} f(x^* + \delta) \Rightarrow 0 = 0 + \nabla_x f(x) + H(x) \delta$$ Solving for $\delta$ yields: $$\delta = - H(x)^{-1} \nabla_x f(x)$$ Thus, Newton's Method takes the following step: $$x_{k+1} = x_k - H(x)^{-1} \nabla_x f(x)$$

Newton's Method: Attracted to Saddle Points

One interesting property of Newton's Method is that it is attracted to saddle points, a point made by Dauphin et al. 2014. To see why this is, let's explore how Newton's Method behaves around fixed points. Starting with a second-order Taylor Series expansion around a fixed point $x^*$: $$f(x^* + \delta) = $$ we write the update in the eigenbasis of the Hessian. Let $\lambda_i, e_i$ be the $i$-th eigenvalue, eigenvector pair of the Hessian $H \defeq \nabla_x^2 f(x)$:

Newton's Method: Convergence Analysis