Gram matrix parameterization

glen/dyna3

Fork 1

Table of Contents

Linear representation of elements and constraints

Elements as vectors
Constraints as Gram matrix entries
Coordinate computations

Reconstructing an assembly from its Gram matrix

Reconstruction as low-rank factorization

Overview
The Frobenius product
The loss function
The first derivative of the loss function
The second derivative of the loss function
Finding minima

Reconstructing a rigid subassembly
Consistency constraints on the Gram matrix
Choosing dependent variables

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Linear representation of elements and constraints

Elements as vectors

Take a 5d vector space V with a bilinear form (\_\!\_, \_\!\_) of signature ++++-, which we'll call the Lorentz form. In inversive coordinates, points and generalized spheres are represented, respectively, by timelike and spacelike vectors in V. If we normalize these vectors to pseudo-length \pm 1, and choose a vector on the lightlike 1d subspace representing the point at infinity, a lot of the constraints we care about can be expressed by fixing the Lorentz products between vectors.

Constraints as Gram matrix entries

The vectors a_1, \ldots, a_n \in V representing the elements of our assembly can be encoded in the linear map A \colon \mathbb{R}^n \to V with Ae_j = a_j for each of the standard basis vectors e_1, \ldots, e_n. We can then express constraints by fixing elements of the Gram matrix G = A^\dagger A, where \dagger is the adjoint with respect to the inner product \langle\_\!\_, \_\!\_\rangle on \mathbb{R}^n and the Lorentz form. Since the inner product and the Lorentz form are both non-degenerate, the rank of G matches the dimension of the span of a_1, \ldots, a_n.

The symmetric bilinear form \langle\_\!\_, G\_\!\_\rangle is the pullback of the Lorentz form along A:

\begin{align*} \langle\_\!\_, G\_\!\_\rangle & = \langle\_\!\_, A^\dagger A\_\!\_\rangle \\ & = (A\_\!\_, A\_\!\_). \end{align*}

Coordinate computations

When we choose a basis for V, represented as an identification V \cong \mathbb{R}^5, the Lorentz form gets encoded in the matrix Q \colon \mathbb{R}^5 \to \mathbb{R}^5 defined by the relation (\_\!\_, \_\!\_) = \langle\_\!\_, Q\_\!\_\rangle. Using the fact that A^\dagger = A^\top Q, where ^\top is the adjoint with respect to the inner product \langle\_\!\_, \_\!\_\rangle on both \mathbb{R}^n and V \cong \mathbb{R}^5, we can express the Gram matrix G = A^\top Q A in terms of positive-signature matrix operations. This expression is convenient for programming, because linear algebra packages tend to be written for positive signature.

Reconstructing an assembly from its Gram matrix

Reconstruction as low-rank factorization

Overview

We saw above that when we represent assemblies as linear maps A \colon \R^n \to V, many constraints can be expressed by fixing entries of the Gram matrix A^\dagger A, which can be written more conveniently as A^\top Q A when we choose an identification V \cong \R^5. This is an example of a low-rank factorization problem. In this section, we'll express the problem in a way that's useful for numerical computations.

We'll find a non-negative function f on the space of linear maps \mathbb{R}^n \to V which vanishes at the matrices representing assemblies that satisfy the constraints. Finding a global minimum of the loss function f will be equivalent to finding an assembly that satisfies the constraints. The loss function will be a quadratic polynomial in matrix entries, and its first and second derivatives will be straightforward to compute using a typical linear algebra package. This makes it easy to search for minima using Newton's method with backtracking.

The Frobenius product

For any dimensions n and m, the inner products on \mathbb{R}^n and \mathbb{R}^m induce an inner product \langle\!\langle X, Y \rangle\!\rangle = \operatorname{tr}(X^\top Y) on the space of matrices \mathbb{R}^n \to \mathbb{R}^m. We'll call this inner product the Frobenius product, because the associated norm, defined by \|X\|^2 = \langle\!\langle X, X \rangle\!\rangle, is called the Frobenius norm. The entry-wise dot product of X and Y gives \langle\!\langle X, Y \rangle\!\rangle, so the sum of the squares of the entries of X gives \|X\|^2. More generally, in any orthonormal basis v_1, \ldots, v_n for \mathbb{R}^n we have

 \langle\!\langle X, Y \rangle\!\rangle = \sum_{k \in [n]} \langle X v_k, Y v_k \rangle.

The loss function

Consider the matrices \mathbb{R}^n \to \mathbb{R}^n whose entries vanish at the indices where the Gram matrix is unconstrained. They form a subspace C of the matrix space \operatorname{End}(\mathbb{R}^n). Let \mathcal{P} \colon \operatorname{End}(\mathbb{R}^n) \to C be the orthogonal projection with respect to the Frobenius product. The constrained entries of the Gram matrix can be expressed uniquely as a matrix G \in C, and the linear maps A \colon \mathbb{R}^n \to V that satisfy the constraints form the zero set of the non-negative function

 f = \|G - \mathcal{P}(A^\top Q A)\|^2

on \operatorname{Hom}(\mathbb{R}^n, V). Finding a global minimum of the loss function f is thus equivalent to finding an assembly that satisfies the constraints.

The first derivative of the loss function

Write the loss function as [ \begin{align*} f & = |\Delta|^2 \ & = \operatorname{tr}(\Delta^\top \Delta), \end{align*} ] where \Delta = G - \mathcal{P}(A^\top Q A). Differentiate both sides and simplify the result using the transpose-invariance of the trace: [ \begin{align*} df & = \operatorname{tr}(d\Delta^\top \Delta) + \operatorname{tr}(\Delta^\top d\Delta) \ & = 2\operatorname{tr}(\Delta^\top d\Delta). \end{align*} ] To compute d\Delta, it will be helpful to write the projection operator \mathcal{P} more explicitly. Let \mathcal{C} be the set of indices where the Gram matrix is unconstrained. We can express C as the span of the standard basis matrices \{E_{ij}\}_{(i, j) \in \mathcal{C}}. Observing that E_{ij} X^\top E_{ij} = X_{ij} E_{ij} for any matrix X, we can do orthogonal projection onto C using standard basis matrices:

 \mathcal{P}(X) = \sum_{(i, j) \in \mathcal{C}} E_{ij} X^\top E_{ij}.

It follows that [ \begin{align*} d\mathcal{P}(X) & = \sum_{(i, j) \in \mathcal{C}} E_{ij},dX^\top E_{ij} \ & = \mathcal{P}(dX). \end{align*} ] Since the subspace C is transpose-invariant, we also have

 \mathcal{P}(X^\top) = \mathcal{P}(X)^\top.

We can now see that [ \begin{align*} d\Delta & = -\mathcal{P}(dA^\top Q A + A^\top Q,dA) \ & = -\big[\mathcal{P}(A^\top Q,dA)^\top + \mathcal{P}(A^\top Q,dA)\big]. \end{align*} ] Plugging this into our formula for df, and recalling that \Delta is symmetric, we get [ \begin{align*} df & = 2\operatorname{tr}(-\Delta^\top \big[\mathcal{P}(A^\top Q,dA)^\top + \mathcal{P}(A^\top Q,dA)\big]) \ & = -2\operatorname{tr}\left(\Delta,\mathcal{P}(A^\top Q,dA)^\top + \Delta^\top \mathcal{P}(A^\top Q,dA)\big]\right) \ & = -2\left[\operatorname{tr}\big(\Delta,\mathcal{P}(A^\top Q,dA)^\top\big) + \operatorname{tr}\big(\Delta^\top \mathcal{P}(A^\top Q,dA)\big)\right] \ & = -4 \operatorname{tr}\big(\Delta^\top \mathcal{P}(A^\top Q,dA)\big), \end{align*} ] using the transpose-invariance and cyclic property of the trace in the final step. Writing the projection in terms of standard basis matrices, we learn that [ \begin{align*} df & = -4 \operatorname{tr}\left(\Delta^\top \left[ \sum_{(i, j) \in \mathcal{C}} E_{ij} (A^\top Q,dA)^\top E_{ij} \right] \right) \ & = -4 \operatorname{tr}\left(\sum_{(i, j) \in \mathcal{C}} \Delta^\top E_{ij},dA^\top Q A E_{ij}\right) \ & = -4 \operatorname{tr}\left(dA^\top Q A \left[ \sum_{(i, j) \in \mathcal{C}} E_{ij} \Delta^\top E_{ij} \right]\right) \ & = -4 \operatorname{tr}\big(dA^\top Q A,\mathcal{P}(\Delta)\big) \ & = \langle!\langle dA,,-4 Q A,\mathcal{P}(\Delta) \rangle!\rangle. \end{align*} ] From here, we get a nice matrix expression for the negative gradient of the loss function:

 -\operatorname{grad}(f) = 4 Q A\,\mathcal{P}(\Delta).

This matrix is stored as neg_grad in the Rust and Julia implementations of the realization routine.

The second derivative of the loss function

Recalling that

 -d\Delta = \mathcal{P}(dA^\top Q A + A^\top Q\,dA),

we can express the derivative of \operatorname{grad}(f) as [ \begin{align*} d\operatorname{grad}(f) & = -4 Q,dA,\mathcal{P}(\Delta) - 4 Q A,\mathcal{P}(d\Delta) \ & = 4 Q\big[{-dA},\mathcal{P}(\Delta) + A,\mathcal{P}(-d\Delta)\big]. \end{align*} ] In the Rust and Julia implementations of the realization routine, we express d\operatorname{grad}(f) as a matrix in the standard basis for \operatorname{End}(\mathbb{R}^n). We apply the cotangent vector d\operatorname{grad}(f) to each standard basis matrix E_{ij} by setting the value of the matrix-valued 1-form dA to E_{ij}.

Finding minima

We minimize the loss function using a cheap imitation of Ueda and Yamashita's regularized Newton's method with backtracking.

Kenji Ueda and Nobuo Yamashita. "Convergence Properties of the Regularized Newton Method for the Unconstrained Nonconvex Optimization," 2009.

The minimization routine is implemented in engine.rs. (In the old Julia prototype of the engine, it's in Engine.jl.) It works like this.

Do Newton steps, as described below, until the loss gets tolerably close to zero. Fail out if we reach the maximum allowed number of descent steps.
1. Find -\operatorname{grad}(f), as described in "The first derivative of the loss function."
2. Find the Hessian H(f) := d\operatorname{grad}(f), as described in "The second derivative of the loss function."
  - Recall that we express H(f) as a matrix in the standard basis for \operatorname{End}(\mathbb{R}^n).
3. If the Hessian isn't positive-definite, make it positive definite by adding -c \lambda_\text{min}, where \lambda_\text{min} is its lowest eigenvalue and c > 1 is a parameter of the minimization routine. In other words, find the regularized Hessian
```
H_\text{reg}(f) := H(f) + \begin{cases}0 & \lambda_\text{min} > 0 \\ -c \lambda_\text{min} & \lambda_\text{min} \le 0 \end{cases}.
```
  - The parameter c is passed to realize_gram as the argument reg_scale.
  - When \lambda_\text{min} is exactly zero, our regularization doesn't do anything, so H_\text{reg}(f) isn't actually positive-definite. Ueda and Yamashita add an extra regularization term that's proportional to a power of \|\operatorname{grad}(f)\|, which takes care of this problem.
4. Project the negative gradient and the regularized Hessian onto the orthogonal complement of the frozen subspace of \operatorname{End}(\mathbb{R}^n).
  - For this write-up, we'll write the projection as \mathcal{Q}.
5. Find the base step s_\text{base} \in \operatorname{End}(\mathbb{R}^n), which is defined by two properties: satisfying the equation -\mathcal{Q} \operatorname{grad}(f) = H_\text{reg}(f)\,s_\text{base} and being orthogonal to the frozen subspace.
  - When we say in the code that we're "projecting" the regularized Hessian, we're really turning it into an operator that can be used to express both properties.
6. Backtrack by reducing the step size, as described below, until we find a step that reduces the loss at a good fraction of the maximum possible rate. Fail out if we reach the maximum allowed number of backtracking steps.
  1. Find the change in loss that we would get from the step s under consideration. At the beginning of the loop, s is set to s_\text{base}.
  2. The definition of the derivative tells us that by making s is small enough, we should be able to bring the change in loss as close as we want to \langle -\operatorname{grad}(f), s \rangle
  3. If the change in loss is more negative than \alpha \langle -\operatorname{grad}(f), s \rangle, where \alpha \in (0, 1) is a parameter of the minimization routine, we're done: take the step s
    - The parameter \alpha is passed to realize_gram as the argument min_efficiency.
  4. Otherwise, multiply the step by the back-off parameter \beta \in (0, 1).
    - This parameter is passed to realize_gram as the argument backoff.

Reconstructing a rigid subassembly

Suppose we can find a set of vectors \{a_k\}_{k \in K} whose Lorentz products are all known. Restricting the Gram matrix to \mathbb{R}^K and projecting its output orthogonally onto \mathbb{R}^K gives a submatrix G_K \colon \mathbb{R}^K \to \mathbb{R}^K whose entries are all known. Suppose in addition that the set \{a_k\}_{k \in K} spans V, implying that G_K has rank five.

Write G_K as L_K U_K, where the matrix U_K is a row echelon form of G_K, and the the matrix L_K encodes the sequence of row operations that restores G_K. Since G_K has rank five, the first five rows of U_K are non-zero, and the rest are zero. Thus, the first five columns of L_K are non-zero, and the rest are zero. Since G_K is symmetric, we should have L_K = U_K^\top D_K, where D_K is a diagonal matrix that has one negative entry and four positive entries at the pivot positions on the diagonal.

Let A_K \colon \R^K \to V be the restriction of A. The first five rows of U_K form the matrix of A_K in some orthogonal basis for V. The matrix D_K describes the Lorentz form in this basis. By looking at D_K, it should be possible to rewrite U_K in our chosen basis for V.

Consistency constraints on the Gram matrix

Since G has rank five, every six-by-six minor is zero. Some of the entries of G are known, because they're specified by constraints. Let's treat some of the unknown entries as independent variables, and the rest as dependent variables. Whenever we find a six-by-six submatrix where one entry is dependent and the rest are known or independent, knowing that the corresponding minor is zero gives us a quadratic equation for the dependent variable. Treating that dependent variable as "1-knowable" may provide further six-by-six submatrices where one entry is dependent and the rest are known, independent, or 1-knowable. The resulting quadratic equation for the dependent variable makes that variable "2-knowable", and the process continues.

For G to be realizable as a Gram matrix, it's necessary for each dependent variable to be real. Whenever we have a quadratic equation for a dependent variable, the discriminant of the equation gives us a consistency constraint inequality. The complexity of the consistency constraint for an $n$-knowable variable gets more complicated as n increases.