Newton’s Method

Idea

Suppose we wish to minimize a scalar-valued function $f(x)$ where $x \in \mathbb{R}^N$ using gradient descent. We want to identify what perturbation $\delta$ to $x$ minimizes $f(x)$. We phrase this question as an optimization problem:

\[\argmin_{\delta} f(x + \delta)\]

If we perform a second-order Taylor series expansion about the current point, we see that:

\[f(x + \delta) \approx f(x) + \nabla_x f(x)^T \delta + \frac{1}{2} \delta^T H(x) \delta + O(\delta^3)\]

where $H(x)_{ij} = \frac{\partial^2 f(x)}{\partial x_i \partial x_j}$ is the Hessian. To find the value of $\delta$ that minimizes $f(x^* + \delta)$, we differentiate and set equal to 0:

\[0 = \nabla_{\delta} f(x^* + \delta) \Rightarrow 0 = 0 + \nabla_x f(x) + H(x) \delta\]

Solving for $\delta$ yields:

\[\delta = - H(x)^{-1} \nabla_x f(x)\]

Thus, Newton’s Method takes the following step:

\[x_{k+1} = x_k - H(x)^{-1} \nabla_x f(x)\]

Properties

Attraction to Saddle Points

One interesting property of Newton’s Method is that it is attracted to saddle points, a point made by Dauphin et al. 2014. To see why this is, let’s explore how Newton’s Method behaves around fixed points. Starting with a second-order Taylor Series expansion around a fixed point $x^*$:

\[f(x^* + \delta) =\]

we write the update in the eigenbasis of the Hessian. Let $\lambda_i, e_i$ be the $i$-th eigenvalue, eigenvector pair of the Hessian $H \defeq \nabla_x^2 f(x)$:

Convergence Analysis

TODO