Finish reviewing Newton's method
parent
a1a1393056
commit
a89e219b0b
1 changed files with 26 additions and 9 deletions
|
|
@ -30,17 +30,34 @@ Uniform regularization can be seen as an interpolation between Newton’s method
|
||||||
|
|
||||||
#### Review
|
#### Review
|
||||||
|
|
||||||
Let's say we're trying to minimize a smooth function $f$ on an affine search space with underlying vector space $V$. We can use the affine structure to take a second-order approximation of $f$ near any point $p$:
|
Let’s say we're trying to minimize a smooth function $f$ on an affine search space with underlying vector space $V$. We can use the affine structure to take a second-order approximation of $f$ near any point $p$:
|
||||||
|
```math
|
||||||
|
f(p + v) \in f^{(0)}_p + f^{(1)}_p(v) + \tfrac{1}{2} f^{(2)}_p(v, v) + \mathfrak{m}^2
|
||||||
|
```
|
||||||
|
Here, $v$ is a $V$-valued variable, each $f^{(k)}_p$ is a symmetric $k$-linear form on $V$, and $\mathfrak{m}$ is the ideal of smooth functions on $V$ that vanish to first order at the origin. The form $f^{(k)}_p$ is called the *$k$th derivative* of $f$ at $p$, and it turns out that $f^{(2)}(v, \_\!\_)$ is the derivative of $f^{(1)}_{p}(v)$ with respect to $p$. Most people writing about Newton’s method use the term *Hessian* to refer to either the second derivative or an operator that represents it.
|
||||||
|
|
||||||
$$f(p + v) \in f^{(0)}_p + f^{(1)}_p(v) + \tfrac{1}{2} f^{(2)}_p(v, v) + \mathfrak{m}^2$$
|
When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newton’s method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping to the minimum of the second-order approximation and then taking a new second-order approximation at that point. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation:
|
||||||
|
```math
|
||||||
|
f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) = 0.
|
||||||
|
```
|
||||||
|
For computations, we’ll choose a basis $\mathbb{R}^n \to V$ and express $f^{(1)}_p$ and $f^{(2)}_p$ in terms of the standard inner product $\langle \_\!\_, \_\!\_ \rangle$ on $\mathbb{R}^n$, writing
|
||||||
|
```math
|
||||||
|
\begin{align*}
|
||||||
|
f^{(1)}_p(w) & = \langle F^{(1)}_p, w \rangle \\
|
||||||
|
f^{(2)}_p(v, w) & = \langle F^{(2)}_p v, w \rangle
|
||||||
|
\end{align*}
|
||||||
|
```
|
||||||
|
with a vector $F^{(1)}_p \in \R^{n}$ and a symmetric operator $F^{(2)}_p \in \operatorname{End}(\mathbb{R}^n)$. The equation that defines the Newton step can then be written
|
||||||
|
```math
|
||||||
|
\begin{align*}
|
||||||
|
\langle F^{(1)}_p, \_\!\_ \rangle + \langle F^{(2)}_p v, \_\!\_ \rangle & = 0 \\
|
||||||
|
\langle F^{(1)}_p + F^{(2)}_p v, \_\!\_ \rangle & = 0 \\
|
||||||
|
F^{(1)}_p + F^{(2)}_p v & = 0
|
||||||
|
\end{align*}
|
||||||
|
```
|
||||||
|
using the non-degeneracy of the inner product. When the bilinear form $f^{(2)}_p$ is positive-definite, the operator $F^{(2)}_p$ is positive-definite too, so we can solve this equation by taking the Cholesky decomposition of $F^{(2)}_p$.
|
||||||
|
|
||||||
Here, $v$ is a $V$-valued variable, each $f^{(k)}_p$ is a symmetric $k$-linear form on $V$, and $\mathfrak{m}$ is the ideal of smooth functions on $V$ that vanish to first order at the origin.
|
If $f$ is convex, its second derivative is positive-definite everywhere, so the Newton step is always well-defined. However, non-convex loss functions show up in many interesting problems, including ours. For these problems, we need to decide how to step at a point where the second derivative is indefinite. One approach is to *regularize* the equation that defines the Newton step by making some modification of the second derivative that turns it into a positive-definite bilinear form. We’ll discuss some regularization methods below.
|
||||||
|
|
||||||
When $f^{(2)}_p(v, v)$ is positive-definite, the second-order approximation has a unique minimum. Newton’s method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping to the minimum of the second-order approximation and then taking a new second-order approximation at that point. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation:
|
|
||||||
|
|
||||||
$$f^{(1)}(-) + f^{(2)}(v, -) = 0.$$
|
|
||||||
|
|
||||||
_To be continued_
|
|
||||||
|
|
||||||
#### Uniform regularization
|
#### Uniform regularization
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue