diff --git a/Numerical-optimization.md b/Numerical-optimization.md index 2b08368..ff7389c 100644 --- a/Numerical-optimization.md +++ b/Numerical-optimization.md @@ -8,13 +8,12 @@ Let’s say we’re doing continuous-time gradient descent on a Riemannian manif Suppose that the loss function is _Morse_: it’s smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero. -Suppose that the loss function is _Morse–Smale_: it’s Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension. +Suppose that the loss function is _Morse–Smale_: it’s Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point [Ch]. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension. -- Guoning Chen. “Lecture 8: Morse–Smale complex and applications.” [COSC 6397: Feature Detection in Data Analysis.](https://www2.cs.uh.edu/~chengu/Teaching/Spring2013/Analysis_Spring2013.html) University of Houston, Spring 2013. +The picture for discrete-time gradient descent is similar [Re]. -The picture for discrete-time gradient descent is similar. - -- Recht, Benjamin. [“Saddles Again.”](https://www.offconvex.org/2016/03/24/saddles-again/) _Off the Convex Path_, 2016. +- **[Ch]** Guoning Chen. “Lecture 8: Morse–Smale complex and applications.” [COSC 6397: Feature Detection in Data Analysis.](https://www2.cs.uh.edu/~chengu/Teaching/Spring2013/Analysis_Spring2013.html) University of Houston, Spring 2013. +- **[Re]** Benjamin Recht. [“Saddles Again.”](https://www.offconvex.org/2016/03/24/saddles-again/) _Off the Convex Path_, 2016. #### Slogging past saddle points @@ -36,7 +35,7 @@ f(p + v) \in f^{(0)}_p + f^{(1)}_p(v) + \tfrac{1}{2} f^{(2)}_p(v, v) + \mathfrak ``` Here, $v$ is a $V$-valued variable, each $f^{(k)}_p$ is a symmetric $k$-linear form on $V$, and $\mathfrak{m}$ is the ideal of smooth functions on $V$ that vanish to first order at the origin. The form $f^{(k)}_p$ is called the *$k$th derivative* of $f$ at $p$, and it turns out that $f^{(2)}(v, \_\!\_)$ is the derivative of $f^{(1)}_{p}(v)$ with respect to $p$. Most people writing about Newton’s method use the term *Hessian* to refer to either the second derivative or an operator that represents it. -When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newton’s method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping to the minimum of the second-order approximation and then taking a new second-order approximation at that point. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation: +When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newton’s method is based on the hope that this minimum is near a local minimum of $f$. It works by repeatedly stepping toward the minimum of the second-order approximation and then taking a new second-order approximation at that point where you end up [NW, §2.2]. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The *Newton step*—the vector $v \in V$ that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation: ```math f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) = 0. ``` @@ -59,11 +58,11 @@ using the non-degeneracy of the inner product. When the bilinear form $f^{(2)}_p If $f$ is convex, its second derivative is positive-definite everywhere, so the Newton step is always well-defined. However, non-convex loss functions show up in many interesting problems, including ours. For these problems, we need to decide how to step at a point where the second derivative is indefinite. One approach is to *regularize* the equation that defines the Newton step by making some explicit or implicit modification of the second derivative that turns it into a positive-definite bilinear form. We’ll discuss some regularization methods below. -- Jorge Nocedal and Stephen J. Wright. *Numerical Optimization,* second edition. Springer, 2006. +- **[NW]** Jorge Nocedal and Stephen J. Wright. *Numerical Optimization,* second edition. Springer, 2006. #### Uniform regularization -Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ bounded away from zero. +Given an inner product $(\_\!\_, \_\!\_)$ on $V$, we can make the modified second derivative $f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_)$ positive-definite by choosing a large enough coefficient $\lambda$. We can say precisely what it means for $\lambda$ to be large enough by expressing $f^{(2)}_p$ as $(\_\!\_, \tilde{F}^{(2)}_p\_\!\_)$ and taking the lowest eigenvalue $\lambda_{\text{min}}$ of $\tilde{F}^{(2)}_p$. The modified second derivative is positive-definite when $\lambda_\text{min} + \lambda$ is positive. We typically choose $\lambda$ in a way that depends continuously on $\tilde{F}^{(2)}_p$ and keeps $\lambda_\text{min} + \lambda$ from getting too small. A simple but effective way of choosing $\lambda$ is described in [UY, §2]. Uniform regularization can be seen as interpolating between Newton’s method and gradient descent. To see why, consider the regularized Newton step $v$ defined by the equation ```math @@ -78,13 +77,13 @@ Expressing $f^{(1)}_p(\_\!\_)$ as $(\tilde{F}^{(1)}_p, \_\!\_)$ and following th ``` The operator $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ stretches $V$ along the eigenspaces of $\tilde{F}^{(2)}_p$, which are the principal curvature directions of $f$ at $p$ with respect to the inner product $(\_\!\_, \_\!\_)$. Projectively, this stretching pulls lines in $V$ away from the eigenspaces with higher eigenvalues, where $f$ curves more strongly upward or less strongly downward, and towards the eigenspaces with lower eigenvalues, where $f$ curves less strongly upward or more strongly downward. Thus, heuristically, applying $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ to the gradient descent direction $-\tilde{F}^{(1)}$ trades some of our downward velocity for downward acceleration. When we set $\lambda$ to zero, which is allowed when $\tilde{F}^{(2)}_p$ is already positive-definite, this trade is perfectly calibrated to point us toward the minimum of the second-order approximation, and the regularized Newton step reduces to the classic Newton step. When we let $\lambda$ grow, the projective action of $\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}$ approaches the identity, and the regularized Newton step approaches $\lambda^{-1}$ times the gradient descent step. -- Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010. +- **[UY]** Kenji Ueda and Nobuo Yamashita. [“Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization.”](https://doi.org/10.1007/s00245-009-9094-9) *Applied Mathematics and Optimization* 62, 2010. #### Positive-definite truncation _To be added_ -- Santiago Paternain, Aryan Mokhtari, and Alejandro Ribeiro. [“A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points.”](https://doi.org/10.1137/17M1150116) *SIAM Journal on Optimization* 29(1), 2019. +- **[PMR]** Santiago Paternain, Aryan Mokhtari, and Alejandro Ribeiro. [“A Newton-Based Method for Nonconvex Optimization with Fast Evasion of Saddle Points.”](https://doi.org/10.1137/17M1150116) *SIAM Journal on Optimization* 29(1), 2019. #### Modified Cholesky decomposition