Correct and finish the discussion of uniform regularization
parent
fd4c33c113
commit
fb9765f7fd
1 changed files with 7 additions and 34 deletions
|
|
@ -63,47 +63,20 @@ If $f$ is convex, its second derivative is positive-definite everywhere, so the
|
||||||
|
|
||||||
#### Uniform regularization
|
#### Uniform regularization
|
||||||
|
|
||||||
Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\delta := \lambda_\text{min} + \lambda$ is positive. We typically make a “minimal modification,” choosing $\lambda > 0$ just large enough to make $\lambda_\text{min} + \lambda \ge \epsilon$ for some small positivity threshold $\epsilon > 0$.
|
Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ bounded away from zero.
|
||||||
|
|
||||||
Uniform regularization can be seen as interpolating between Newton’s method in regions where the second derivative is very positive-definite and gradient descent in regions where the second derivative is far from positive definite. To see why, consider the regularized Newton step $v$ defined by the equation
|
Uniform regularization can be seen as interpolating between Newton’s method and gradient descent. To see why, consider the regularized Newton step $v$ defined by the equation
|
||||||
```math
|
```math
|
||||||
f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) + \lambda (v, \_\!\_) = 0,
|
f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) + \lambda (v, \_\!\_) = 0.
|
||||||
```
|
```
|
||||||
the standard Newton step $w$ defined by the equation
|
Expressing $f^{(1)}_p(\_\!\_)$ as $(\tilde{F}^{(1)}_p, \_\!\_)$ and following the reasoning from [above](#Review), we can rewrite this equation as
|
||||||
```math
|
|
||||||
f^{(1)}_p(\_\!\_) + f^{(2)}_p(w, \_\!\_) = 0,
|
|
||||||
```
|
|
||||||
and the gradient descent step $u$ defined by the equation
|
|
||||||
```math
|
|
||||||
f^{(1)}_p(\_\!\_) + (u, \_\!\_) = 0.
|
|
||||||
```
|
|
||||||
|
|
||||||
Suppose we’re in a region where the second derivative is very positive-definite, meaning that $\lambda_\text{min}$ is large and positive. Our minimal modification assumption then implies that $\lambda$ is small. Under these conditions, we’ll see that $v \approx w$. Let’s write $v = w + x$, with the goal of showing that the difference $x$ is small.
|
|
||||||
|
|
||||||
_Continue argument…_
|
|
||||||
|
|
||||||
Now, suppose we're in a region where the second derivative is far from positive-definite, meaning that $\lambda_\text{min}$ is large and negative. Under this condition, we’ll see that $v \approx \tfrac{1}{\lambda} u$. Let’s write $v = \tfrac{1}{\lambda} u + x$, with the goal of showing that the difference $x$ is small. Observe that
|
|
||||||
```math
|
```math
|
||||||
\begin{align*}
|
\begin{align*}
|
||||||
f^{(1)}_p(\_\!\_) + f^{(2)}_p\big(\tfrac{1}{\lambda} u + x, \_\!\_\big) + \lambda \big(\tfrac{1}{\lambda} u + x, \_\!\_\big) & = 0 \\
|
\tilde{F}^{(1)}_p + \big[ \tilde{F}^{(2)}_p + \lambda \big] v & = 0 \\
|
||||||
f^{(1)}_p(\_\!\_) + \tfrac{1}{\lambda} f^{(2)}_p(u, \_\!\_) + f^{(2)}_p(x, \_\!\_) + (u, \_\!\_) + \lambda (x, \_\!\_) & = 0 \\
|
-\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}\,\tilde{F}^{(1)}_p & = v.
|
||||||
\left[ f^{(1)}_p(\_\!\_) + (u, \_\!\_)\right] + f^{(2)}_p(x, \_\!\_) + \lambda (x, \_\!\_) & = -\tfrac{1}{\lambda} f^{(2)}_p(u, \_\!\_) \\
|
|
||||||
\Big(\big[\tilde{F}^{(2)}_p + \lambda\big] x, \_\!\_\Big) & = \big(-\tfrac{1}{\lambda} \tilde{F}^{(2)}_p u, \_\!\_\big) \\
|
|
||||||
\big\|\big[\tilde{F}^{(2)}_p + \lambda\big] x\big\| & = \tfrac{1}{\lambda} \big\|\tilde{F}^{(2)}_p u\big\|
|
|
||||||
\end{align*}
|
\end{align*}
|
||||||
```
|
```
|
||||||
Notice that $\lambda_\text{min} + \lambda$ is the lowest eigenvalue, and therefore the smallest eigenvalue, of the positive-definite operator $\tilde{F}^{(2)}_p + \lambda$. It follows that
|
The operator $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ stretches $V$ along the eigenspaces of $\tilde{F}^{(2)}_p$, which are the principal curvature directions of $f$ at $p$ with respect to the inner product $(\_\!\_, \_\!\_)$. Projectively, this stretching pulls lines in $V$ away from the eigenspaces with higher eigenvalues, where $f$ curves more strongly upward or less strongly downward, and towards the eigenspaces with lower eigenvalues, where $f$ curves less strongly upward or more strongly downward. Thus, heuristically, applying $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ to the gradient descent direction $-\tilde{F}^{(1)}$ trades some of our downward velocity for downward acceleration. When we set $\lambda$ to zero, which is allowed when $\tilde{F}^{(2)}_p$ is already positive-definite, this trade is perfectly calibrated to point us toward the minimum of the second-order approximation, and the regularized Newton step reduces to the classic Newton step. When we let $\lambda$ grow, the projective action of $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ approaches the identity, and the regularized Newton step approaches $\lambda^{-1}$ times the gradient descent step.
|
||||||
```math
|
|
||||||
(\lambda_\text{min} + \lambda) \|x\| \le \big\|\big[\tilde{F}^{(2)}_p + \lambda\big] x\big\|.
|
|
||||||
```
|
|
||||||
Our minimal modification assumption implies, in turn, that $\epsilon \|x\| \le (\lambda_\text{min} + \lambda) \|x\|$. On the other hand, by the definition of the operator norm, we have $\big\|\tilde{F}^{(2)}_p u\big\| \le \big\|\tilde{F}^{(2)}_p\big\| \|u\|$ and $\|u\| = \big\|f^{(1)}_p\big\|$. Therefore,
|
|
||||||
```math
|
|
||||||
\begin{align*}
|
|
||||||
\epsilon \|x\| & \le \tfrac{1}{\lambda} \big\|\tilde{F}^{(2)}_p\big\|\,\big\|f^{(1)}_p\big\| \\
|
|
||||||
\|x\| & \le \frac{\big\|\tilde{F}^{(2)}_p \big\|\,\big\|f^{(1)}_p\big\|}{\epsilon \lambda} \\
|
|
||||||
\end{align*}
|
|
||||||
```
|
|
||||||
_Uh-oh, this argument isn't working as planned…_
|
|
||||||
|
|
||||||
- Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010.
|
- Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010.
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue