2 Numerical optimization
Vectornaut edited this page 2025-10-17 09:07:22 +00:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Challenges

Getting bogged down near saddle points

A global picture of gradient descent

Lets say were doing continuous-time gradient descent on a Riemannian manifold search space. Suppose, for simplicity, that the loss function is proper and bounded below. This implies that it grows without bound along any path that runs off into an end of the search space.

Suppose that the loss function is Morse: its smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero.

Suppose that the loss function is MorseSmale: its Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension.

The picture for discrete-time gradient descent is similar.

Slogging past saddle points

We saw above that the vanishingly rare gradient descent paths that lead to saddle points are the only things that can stop us from finding a local minimum eventually. If we want to find a local minimum quickly, however, the gradient descent paths that pass near saddle points are almost as bad. To see why, consider a gradient descent path \gamma that hits a saddle point and another gradient descent path \gamma' which is close to \gamma at time zero. For each \epsilon > 0, let T_\epsilon be the first time in [0, \infty) when the distance between \gamma and \gamma' grows to at least \epsilon. By bringing \gamma'_0 closer to \gamma_0, we should be able to make T_\epsilon arbitrarily large for any fixed \epsilon. This shows that even if \gamma' eventually passes the saddle point and goes on to reach a local minimum, it can spend arbitrarily long on approach to the saddle point before that happens.

Affected optimization methods

Uniform regularization can be seen as an interpolation between Newtons method and gradient descent, which kicks in when lowest eigenvalue of the Hessian drops below zero and brings the search direction closer to the gradient descent direction as the lowest eigenvalue gets more negative. Since the Hessian is indefinite near a saddle point, Newtons method with uniform regularization should act at least sort of like gradient descent near a saddle point. This suggests that it could get bogged down near saddle points in the same way.

Methods

Newtons method

To be added

Uniform regularization

To be added

Positive-definite truncation

To be added