Make corrections and improve citations

2025-11-04 21:47:00 +00:00 · 2025-11-04 21:47:00 +00:00 · c64fdfa7ae
commit c64fdfa7ae
parent fb9765f7fd
1 changed files with 9 additions and 10 deletions
--- a/Numerical-optimization.md
+++ b/Numerical-optimization.md
@ -8,13 +8,12 @@ Let’s say we’re doing continuous-time gradient descent on a Riemannian manif
 Suppose that the loss function is _Morse_: it’s smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero.
-Suppose that the loss function is _Morse–Smale_: it’s Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension.
+Suppose that the loss function is _Morse–Smale_: it’s Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point [Ch]. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension.
- Guoning Chen. “Lecture 8: Morse–Smale complex and applications.” [COSC 6397: Feature Detection in Data Analysis.](https://www2.cs.uh.edu/~chengu/Teaching/Spring2013/Analysis_Spring2013.html) University of Houston, Spring 2013.
+The picture for discrete-time gradient descent is similar [Re].
-The picture for discrete-time gradient descent is similar.
+- **[Ch]** Guoning Chen. “Lecture 8: Morse–Smale complex and applications.” [COSC 6397: Feature Detection in Data Analysis.](https://www2.cs.uh.edu/~chengu/Teaching/Spring2013/Analysis_Spring2013.html) University of Houston, Spring 2013.
-
+- **[Re]** Benjamin Recht. [“Saddles Again.”](https://www.offconvex.org/2016/03/24/saddles-again/) _Off the Convex Path_, 2016.
 - Recht, Benjamin. [“Saddles Again.”](https://www.offconvex.org/2016/03/24/saddles-again/) _Off the Convex Path_, 2016.
 #### Slogging past saddle points
@ -36,7 +35,7 @@ f(p + v) \in f^{(0)}_p + f^{(1)}_p(v) + \tfrac{1}{2} f^{(2)}_p(v, v) + \mathfrak
 ```
 Here, $v$ is a $V$-valued variable, each $f^{(k)}_p$ is a symmetric $k$-linear form on $V$, and $\mathfrak{m}$ is the ideal of smooth functions on $V$ that vanish to first order at the origin. The form $f^{(k)}_p$ is called the *$k$th derivative* of $f$ at $p$, and it turns out that $f^{(2)}(v, \_\!\_)$ is the derivative of $f^{(1)}_{p}(v)$ with respect to $p$. Most people writing about Newton’s method use the term *Hessian* to refer to either the second derivative or an operator that represents it.
-When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newton’s method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping to the minimum of the second-order approximation and then taking a new second-order approximation at that point. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation:
+When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newton’s method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping toward the minimum of the second-order approximation and then taking a new second-order approximation at that point where you end up [NW, §2.2]. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation:
 ```math
 f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) = 0.
 ```
@ -59,11 +58,11 @@ using the non-degeneracy of the inner product. When the bilinear form $f^{(2)}_p
 If $f$ is convex, its second derivative is positive-definite everywhere, so the Newton step is always well-defined. However, non-convex loss functions show up in many interesting problems, including ours. For these problems, we need to decide how to step at a point where the second derivative is indefinite. One approach is to *regularize* the equation that defines the Newton step by making some explicit or implicit modification of the second derivative that turns it into a positive-definite bilinear form. We’ll discuss some regularization methods below.
- Jorge Nocedal and Stephen J. Wright. *Numerical Optimization,* second edition. Springer, 2006.
+- **[NW]** Jorge Nocedal and Stephen J. Wright. *Numerical Optimization,* second edition. Springer, 2006.
 #### Uniform regularization
-Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ bounded away from zero.
+Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ from getting too small. A simple but effective way of choosing $\lambda$ is described in [UY, §2].
 Uniform regularization can be seen as interpolating between Newton’s method and gradient descent. To see why, consider the regularized Newton step $v$ defined by the equation
 ```math
@ -78,13 +77,13 @@ Expressing $f^{(1)}_p(\_\!\_)$ as $(\tilde{F}^{(1)}_p, \_\!\_)$ and following th
 ```
 The operator $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ stretches $V$ along the eigenspaces of $\tilde{F}^{(2)}_p$, which are the principal curvature directions of $f$ at $p$ with respect to the inner product $(\_\!\_, \_\!\_)$. Projectively, this stretching pulls lines in $V$ away from the eigenspaces with higher eigenvalues, where $f$ curves more strongly upward or less strongly downward, and towards the eigenspaces with lower eigenvalues, where $f$ curves less strongly upward or more strongly downward. Thus, heuristically, applying $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ to the gradient descent direction $-\tilde{F}^{(1)}$ trades some of our downward velocity for downward acceleration. When we set $\lambda$ to zero, which is allowed when $\tilde{F}^{(2)}_p$ is already positive-definite, this trade is perfectly calibrated to point us toward the minimum of the second-order approximation, and the regularized Newton step reduces to the classic Newton step. When we let $\lambda$ grow, the projective action of $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ approaches the identity, and the regularized Newton step approaches $\lambda^{-1}$ times the gradient descent step.
- Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010.
+- **[UY]** Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization.”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010.
 #### Positive-definite truncation
 _To be added_
- Santiago Paternain, Aryan Mokhtari, and Alejandro Ribeiro. [“A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points.”](https://doi.org/10.1137/17M1150116) *SIAM Journal on Optimization* 29(1), 2019.
+- **[PMR]** Santiago Paternain, Aryan Mokhtari, and Alejandro Ribeiro. [“A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points.”](https://doi.org/10.1137/17M1150116) *SIAM Journal on Optimization* 29(1), 2019.
 #### Modified Cholesky decomposition