15 Numerical optimization
Vectornaut edited this page 2025-11-07 08:07:32 +00:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Challenges

Getting bogged down near saddle points

A global picture of gradient descent

Lets say were doing continuous-time gradient descent on a Riemannian manifold search space. Suppose, for simplicity, that the loss function is proper and bounded below. This implies that it grows without bound along any path that runs off into an end of the search space.

Suppose that the loss function is Morse: its smooth, and all its critical points are isolated and non-degenerate. Then every gradient descent path starts in an end of the search space or at a critical point and ends at a critical point. By the stable manifold theorem, the ascending and descending manifolds of a critical point have the same dimensions as the positive and negative eigenspaces of the Hessian, respectively. In particular, the ascending manifold of a local minimum has codimension zero.

Suppose that the loss function is MorseSmale: its Morse, and all its ascending manifolds are transverse to all its descending manifolds. Then every gradient descent path that starts at a critical point must end at a lower-index critical point [Ch]. It should follow, in light of the discussion above, that almost every gradient descent path ends at a local minimum. The only exceptions are the paths that lie in the ascending manifolds of saddle points, which have positive codimension.

The picture for discrete-time gradient descent is similar [Re].

Slogging past saddle points

We saw above that the vanishingly rare gradient descent paths that lead to saddle points are the only things that can stop us from finding a local minimum eventually. If we want to find a local minimum quickly, however, the gradient descent paths that pass near saddle points are almost as bad. To see why, consider a gradient descent path \gamma that hits a saddle point and another gradient descent path \gamma' which is close to \gamma at time zero. For each \epsilon > 0, let T_\epsilon be the first time in [0, \infty) when the distance between \gamma and \gamma' grows to at least \epsilon. By bringing \gamma'_0 closer to \gamma_0, we should be able to make T_\epsilon arbitrarily large for any fixed \epsilon. This shows that even if \gamma' eventually passes the saddle point and goes on to reach a local minimum, it can spend arbitrarily long on approach to the saddle point before that happens.

Affected optimization methods

Uniform regularization can be seen as an interpolation between Newtons method and gradient descent, which kicks in when lowest eigenvalue of the Hessian drops below zero and brings the search direction closer to the gradient descent direction as the lowest eigenvalue gets more negative. Since the Hessian is indefinite near a saddle point, Newtons method with uniform regularization should act at least sort of like gradient descent near a saddle point. This suggests that it could get bogged down near saddle points in the same way.

Flightiness

To be added

Gratuitous symmetry breaking

In some cases, a configuration has a symmetry that must be broken to satisfy a newly imposed constraint. This happens, for example, when a point at the origin is constrained to lie on a sphere centered at the origin. In other cases, however, a configuration has a symmetry which is logically and aesthetically compatible with a newly imposed constraint. This happens, for example, when three equal-curvature spheres that intersect each other at 90° angles are constrained to instead intersect each other at 60° angles. A user might reasonably expect the realization process to preserve this symmetry, at least approximately. When it doesnt, well say the symmetry has been gratuitously broken.

The idea of gratuitous symmetry breaking can be extended to approximate symmetries. For example, suppose three spheres of roughly the same curvature intersect each other at roughly 90° angles. If we constrain these spheres to intersect each other at 60° angles, we might reasonably expect their curvatures to stay roughly the same as each other. If the curvatures instead come out wildly different from each other, we can say that the approximate symmetry of the initial configuration has been gratuitously broken.

Methods

Newtons method

Review

Lets say we're trying to minimize a smooth function f on an affine search space with underlying vector space V. We can use the affine structure to take a second-order approximation of f near any point p:

f(p + v) \in f^{(0)}_p + f^{(1)}_p(v) + \tfrac{1}{2} f^{(2)}_p(v, v) + \mathfrak{m}^2

Here, v is a $V$-valued variable, each f^{(k)}_p is a symmetric $k$-linear form on V, and \mathfrak{m} is the ideal of smooth functions on V that vanish to first order at the origin. The form f^{(k)}_p is called the $k$th derivative of f at p, and it turns out that f^{(2)}(v, \_\!\_) is the derivative of f^{(1)}_{p}(v) with respect to p. Most people writing about Newtons method use the term Hessian to refer to either the second derivative or an operator that represents it.

When the second derivative is positive-definite, the second-order approximation has a unique minimum. Newtons method is based on the hope that this minimum is near a local minimum of f. It works by repeatedly stepping toward the minimum of the second-order approximation and then taking a new second-order approximation at that point where you end up [NW, §2.2]. The minimum of the second-order approximation is nicely characterized as the place where the derivative of the second-order approximation vanishes. The Newton step—the vector v \in V that takes us to the minimum of the second-order approximation—is therefore the solution of the following equation:

f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) = 0.

For computations, well choose a basis \mathbb{R}^n \to V and express f^{(1)}_p and f^{(2)}_p in terms of the standard inner product \langle \_\!\_, \_\!\_ \rangle on \mathbb{R}^n, writing

\begin{align*}
f^{(1)}_p(w) & = \langle F^{(1)}_p, w \rangle \\
f^{(2)}_p(v, w) & = \langle F^{(2)}_p v, w \rangle
\end{align*}

with a vector F^{(1)}_p \in \R^{n} and a symmetric operator F^{(2)}_p \in \operatorname{End}(\mathbb{R}^n). The equation that defines the Newton step can then be written

\begin{align*}
\langle F^{(1)}_p, \_\!\_ \rangle + \langle F^{(2)}_p v, \_\!\_ \rangle & = 0 \\
\langle F^{(1)}_p + F^{(2)}_p v, \_\!\_ \rangle & = 0 \\
F^{(1)}_p + F^{(2)}_p v & = 0
\end{align*}

using the non-degeneracy of the inner product. When the bilinear form f^{(2)}_p is positive-definite, the operator F^{(2)}_p is positive-definite too, so we can solve this equation by taking the Cholesky decomposition of F^{(2)}_p.

If f is convex, its second derivative is positive-definite everywhere, so the Newton step is always well-defined. However, non-convex loss functions show up in many interesting problems, including ours. For these problems, we need to decide how to step at a point where the second derivative is indefinite. One approach is to regularize the equation that defines the Newton step by making some explicit or implicit modification of the second derivative that turns it into a positive-definite bilinear form. Well discuss some regularization methods below.

  • [NW] Jorge Nocedal and Stephen J. Wright. Numerical Optimization, second edition. Springer, 2006.

Uniform regularization

Given an inner product (\_\!\_, \_\!\_) on V, we can make the modified second derivative f^{(2)}_p(v, \_\!\_) + \lambda (\_\!\_, \_\!\_) positive-definite by choosing a large enough coefficient \lambda. We can say precisely what it means for \lambda to be large enough by expressing f^{(2)}_p as (\_\!\_, \tilde{F}^{(2)}_p\_\!\_) and taking the lowest eigenvalue \lambda_{\text{min}} of \tilde{F}^{(2)}_p. The modified second derivative is positive-definite when \lambda_\text{min} + \lambda is positive. We typically choose \lambda in a way that depends continuously on \tilde{F}^{(2)}_p and keeps \lambda_\text{min} + \lambda from getting too small. A simple but effective way of choosing \lambda is described in [UY, §2].

Uniform regularization can be seen as interpolating between Newtons method and gradient descent. To see why, consider the regularized Newton step v defined by the equation

f^{(1)}_p(\_\!\_) + f^{(2)}_p(v, \_\!\_) + \lambda (v, \_\!\_) = 0.

Expressing f^{(1)}_p(\_\!\_) as (\tilde{F}^{(1)}_p, \_\!\_) and following the reasoning from above, we can rewrite this equation as

\begin{align*}
\tilde{F}^{(1)}_p + \big[ \tilde{F}^{(2)}_p + \lambda \big] v & = 0 \\
-\big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1}\,\tilde{F}^{(1)}_p & = v.
\end{align*}

The operator \big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1} stretches V along the eigenspaces of \tilde{F}^{(2)}_p, which are the principal curvature directions of f at p with respect to the inner product (\_\!\_, \_\!\_). Projectively, this stretching pulls lines in V away from the eigenspaces with higher eigenvalues, where f curves more strongly upward or less strongly downward, and towards the eigenspaces with lower eigenvalues, where f curves less strongly upward or more strongly downward. Thus, heuristically, applying \big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1} to the gradient descent direction -\tilde{F}^{(1)} trades some of our downward velocity for downward acceleration. When we set \lambda to zero, which is allowed when \tilde{F}^{(2)}_p is already positive-definite, this trade is perfectly calibrated to point us toward the minimum of the second-order approximation, and the regularized Newton step reduces to the classic Newton step. When we let \lambda grow, the projective action of \big[ \tilde{F}^{(2)}_p + \lambda \big]^{-1} approaches the identity, and the regularized Newton step approaches \lambda^{-1} times the gradient descent step.

Positive-definite truncation

To be added

Modified Cholesky decomposition

Recall from above that once we express the first and second derivatives of f as a vector F^{(1)}_p \in \mathbb{R}^n and a matrix F^{(2)}_p \in \operatorname{End}(\mathbb{R}^n) with respect to a computational basis \mathbb{R}^n \to V, we can find the Newton step v at p by solving the equation F^{(1)}_p + F^{(2)}_p v = 0. More abstractly, if we express the first and second derivatives of f as a vector \tilde{F}^{(1)}_p \in V and an operator \tilde{F}^{(2)}_p \in \operatorname{End}(V) with respect to a chosen inner product (\_\!\_, \_\!\_) on V, as discussed above, we can find the Newton step by solving the equation \tilde{F}^{(1)}_p + \tilde{F}^{(2)}_p v = 0. When f^{(2)}_p is positive-definite, taking the Cholesky decomposition of \tilde{F}^{(2)}_p with respect to the standard inner product on \mathbb{R}^n provides an efficient and numerically stable solution method. Since the Newton step doesnt depend on the choice of inner product, we typically use the inner product given by the computational basis.

When f^{(2)}_p isnt guaranteed to be positive-definite, we can use a modified Cholesky decomposition to solve an implicitly regularized Newton step equation

\tilde{F}^{(1)}_p + \big[\tilde{F}^{(2)}_p + E\big] v = 0,

where the modification E \in \operatorname{End}(V) is symmetric with respect to (\_\!\_, \_\!\_) and balances some notion of smallness against the requirement that \tilde{F}^{(2)}_p + E be safely positive-definite [CH, §1][NW, §3.4]. The hallmark of a modified Cholesky decomposition is that it directly produces a Cholesky decomposition of a pivoted version of \tilde{F}^{(2)}_p + E, with no need to produce E as an intermediate step, and its speed is comparable to the ordinary Cholesky decomposition [CH, §1].

The modcholesky crate implements the modified Cholesky decompositions from [GMW], [SE90], and [SE99]. In our application, they tend to exhibit flightiness and gratuitous symmetry breaking. The latter might be caused by pivoting, which differentiates between the elements of an arbitrary orthonormal basis in a way that isnt directly related to the problem were trying to solve.