Make corrections and improve citations

Vectornaut 2025-11-04 21:47:00 +00:00
parent fb9765f7fd
commit c64fdfa7ae

@ -8,13 +8,12 @@ Lets say were doing continuous-time gradient descent on a Riemannian manif
Suppose that the loss function is _Morse_: its smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero. Suppose that the loss function is _Morse_: its smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero.
Suppose that the loss function is _MorseSmale_: its Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension. Suppose that the loss function is _MorseSmale_: its Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point [Ch]. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension.
- Guoning Chen. “Lecture 8: MorseSmale complex and applications.” [COSC 6397: Feature Detection in Data Analysis.](https://www2.cs.uh.edu/~chengu/Teaching/Spring2013/Analysis_Spring2013.html) University of Houston, Spring 2013. The picture for discrete-time gradient descent is similar [Re].
The picture for discrete-time gradient descent is similar. - **[Ch]** Guoning Chen. “Lecture 8: MorseSmale complex and applications.” [COSC 6397: Feature Detection in Data Analysis.](https://www2.cs.uh.edu/~chengu/Teaching/Spring2013/Analysis_Spring2013.html) University of Houston, Spring 2013.
- **[Re]** Benjamin Recht. [“Saddles Again.”](https://www.offconvex.org/2016/03/24/saddles-again/) _Off the Convex Path_, 2016.
- Recht, Benjamin. [“Saddles Again.”](https://www.offconvex.org/2016/03/24/saddles-again/) _Off the Convex Path_, 2016.
#### Slogging past saddle points #### Slogging past saddle points
@ -36,7 +35,7 @@ f(p + v) \in f^{(0)}_p + f^{(1)}_p(v) + \tfrac{1}{2} f^{(2)}_p(v, v) + \mathfrak
``` ```
Here, $v$ is a $V$-valued variable, each $f^{(k)}_p$ is a symmetric $k$-linear form on $V$, and $\mathfrak{m}$ is the ideal of smooth functions on $V$ that vanish to first order at the origin. The form $f^{(k)}_p$ is called the *$k$th derivative* of $f$ at $p$, and it turns out that $f^{(2)}(v, \_\!\_)$ is the derivative of $f^{(1)}_{p}(v)$ with respect to $p$. Most people writing about Newtons method use the term *Hessian* to refer to either the second derivative or an operator that represents it. Here, $v$ is a $V$-valued variable, each $f^{(k)}_p$ is a symmetric $k$-linear form on $V$, and $\mathfrak{m}$ is the ideal of smooth functions on $V$ that vanish to first order at the origin. The form $f^{(k)}_p$ is called the *$k$th derivative* of $f$ at $p$, and it turns out that $f^{(2)}(v, \_\!\_)$ is the derivative of $f^{(1)}_{p}(v)$ with respect to $p$. Most people writing about Newtons method use the term *Hessian* to refer to either the second derivative or an operator that represents it.
When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newtons method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping to the minimum of the second-order approximation and then taking a new second-order approximation at that point. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation: When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newtons method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping toward the minimum of the second-order approximation and then taking a new second-order approximation at that point where you end up [NW, §2.2]. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation:
```math ```math
f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) = 0. f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) = 0.
``` ```
@ -59,11 +58,11 @@ using the non-degeneracy of the inner product. When the bilinear form $f^{(2)}_p
If $f$ is convex, its second derivative is positive-definite everywhere, so the Newton step is always well-defined. However, non-convex loss functions show up in many interesting problems, including ours. For these problems, we need to decide how to step at a point where the second derivative is indefinite. One approach is to *regularize* the equation that defines the Newton step by making some explicit or implicit modification of the second derivative that turns it into a positive-definite bilinear form. Well discuss some regularization methods below. If $f$ is convex, its second derivative is positive-definite everywhere, so the Newton step is always well-defined. However, non-convex loss functions show up in many interesting problems, including ours. For these problems, we need to decide how to step at a point where the second derivative is indefinite. One approach is to *regularize* the equation that defines the Newton step by making some explicit or implicit modification of the second derivative that turns it into a positive-definite bilinear form. Well discuss some regularization methods below.
- Jorge Nocedal and Stephen J. Wright. *Numerical Optimization,* second edition. Springer, 2006. - **[NW]** Jorge Nocedal and Stephen J. Wright. *Numerical Optimization,* second edition. Springer, 2006.
#### Uniform regularization #### Uniform regularization
Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ bounded away from zero. Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ from getting too small. A simple but effective way of choosing $\lambda$ is described in [UY, §2].
Uniform regularization can be seen as interpolating between Newtons method and gradient descent. To see why, consider the regularized Newton step $v$ defined by the equation Uniform regularization can be seen as interpolating between Newtons method and gradient descent. To see why, consider the regularized Newton step $v$ defined by the equation
```math ```math
@ -78,13 +77,13 @@ Expressing $f^{(1)}_p(\_\!\_)$ as $(\tilde{F}^{(1)}_p, \_\!\_)$ and following th
``` ```
The operator $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ stretches $V$ along the eigenspaces of $\tilde{F}^{(2)}_p$, which are the principal curvature directions of $f$ at $p$ with respect to the inner product $(\_\!\_, \_\!\_)$. Projectively, this stretching pulls lines in $V$ away from the eigenspaces with higher eigenvalues, where $f$ curves more strongly upward or less strongly downward, and towards the eigenspaces with lower eigenvalues, where $f$ curves less strongly upward or more strongly downward. Thus, heuristically, applying $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ to the gradient descent direction $-\tilde{F}^{(1)}$ trades some of our downward velocity for downward acceleration. When we set $\lambda$ to zero, which is allowed when $\tilde{F}^{(2)}_p$ is already positive-definite, this trade is perfectly calibrated to point us toward the minimum of the second-order approximation, and the regularized Newton step reduces to the classic Newton step. When we let $\lambda$ grow, the projective action of $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ approaches the identity, and the regularized Newton step approaches $\lambda^{-1}$ times the gradient descent step. The operator $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ stretches $V$ along the eigenspaces of $\tilde{F}^{(2)}_p$, which are the principal curvature directions of $f$ at $p$ with respect to the inner product $(\_\!\_, \_\!\_)$. Projectively, this stretching pulls lines in $V$ away from the eigenspaces with higher eigenvalues, where $f$ curves more strongly upward or less strongly downward, and towards the eigenspaces with lower eigenvalues, where $f$ curves less strongly upward or more strongly downward. Thus, heuristically, applying $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ to the gradient descent direction $-\tilde{F}^{(1)}$ trades some of our downward velocity for downward acceleration. When we set $\lambda$ to zero, which is allowed when $\tilde{F}^{(2)}_p$ is already positive-definite, this trade is perfectly calibrated to point us toward the minimum of the second-order approximation, and the regularized Newton step reduces to the classic Newton step. When we let $\lambda$ grow, the projective action of $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ approaches the identity, and the regularized Newton step approaches $\lambda^{-1}$ times the gradient descent step.
- Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010. - **[UY]** Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization.”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010.
#### Positive-definite truncation #### Positive-definite truncation
_To be added_ _To be added_
- Santiago Paternain, Aryan Mokhtari, and Alejandro Ribeiro. [“A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points.”](https://doi.org/10.1137/17M1150116) *SIAM Journal on Optimization* 29(1), 2019. - **[PMR]** Santiago Paternain, Aryan Mokhtari, and Alejandro Ribeiro. [“A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points.”](https://doi.org/10.1137/17M1150116) *SIAM Journal on Optimization* 29(1), 2019.
#### Modified Cholesky decomposition #### Modified Cholesky decomposition