This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Challenges
Getting bogged down near saddle points
A global picture of gradient descent
Let’s say we’re doing continuous-time gradient descent on a Riemannian manifold search space. Suppose, for simplicity, that the loss function is proper and bounded below. This implies that it grows without bound along any path that runs off into an end of the search space.
Suppose that the loss function is Morse: it’s smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero.
Suppose that the loss function is Morse–Smale: it’s Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension.
- Guoning Chen. “Lecture 8: Morse–Smale complex and applications.” COSC 6397: Feature Detection in Data Analysis. University of Houston, Spring 2013.
The picture for discrete-time gradient descent is similar.
- Recht, Benjamin. “Saddles Again.” Off the Convex Path, 2016.
Slogging past saddle points
We saw above that the vanishingly rare gradient descent paths that lead to saddle points are the only things that can stop us from finding a local minimum eventually. If we want to find a local minimum quickly, however, the gradient descent paths that pass near saddle points are almost as bad. To see why, consider a gradient descent path \gamma
that hits a saddle point and another gradient descent path \gamma'
which is close to \gamma
at time zero. For each \epsilon > 0
, let T_\epsilon
be the first time in [0, \infty)
when the distance between \gamma
and \gamma'
grows to at least \epsilon
. By bringing \gamma'_0
closer to \gamma_0
, we should be able to make T_\epsilon
arbitrarily large for any fixed \epsilon
. This shows that even if \gamma'
eventually passes the saddle point and goes on to reach a local minimum, it can spend arbitrarily long on approach to the saddle point before that happens.
Affected optimization methods
Uniform regularization can be seen as an interpolation between Newton’s method and gradient descent, which kicks in when lowest eigenvalue of the Hessian drops below zero and brings the search direction closer to the gradient descent direction as the lowest eigenvalue gets more negative. Since the Hessian is indefinite near a saddle point, Newton’s method with uniform regularization should act at least sort of like gradient descent near a saddle point. This suggests that it could get bogged down near saddle points in the same way.
Methods
Newton’s method
To be added
Uniform regularization
To be added
Positive-definite truncation
To be added