The Singular Value Decomposition: the Swiss Army knife of data analysis applied to motion capture
The Singular Value Decomposition (SVD) is an incredibly useful tool with a staggering number of applications in seemingly unrelated fields. In this post I want to write about how the SVD is utilized to determine the orientation of a body segment from the skin markers attached to it. It's an interesting application of the SVD because it has straightforward geometrical interpretation.
Say that we have attached 5 skin markers to the humerus and are tracking them via a motion capture (mocap) system. For each capture frame (determined by the capture frequency), the mocap system provides the 3D coordinates of each of the 5 markers. Suppose that at frame 0 we identified the marker positions via the 3x5 matrix \(\boldsymbol{A}\), where each \(\boldsymbol{a}_i\) is a column vector representing the 3D position of the \(i\)th marker.
\[ \boldsymbol{A} = \begin{bmatrix} \boldsymbol{a_1} & \boldsymbol{a_2} & \boldsymbol{a_3} & \boldsymbol{a_4} & \boldsymbol{a_5} \end{bmatrix} \]
Suppose that at some later point in time, say frame 10, we measured the marker positions and stored them in the 3x5 matrix, \(\boldsymbol{B}\):
\[ \boldsymbol{B} = \begin{bmatrix} \boldsymbol{b_1} & \boldsymbol{b_2} & \boldsymbol{b_3} & \boldsymbol{b_4} & \boldsymbol{b_5} \end{bmatrix} \]
What we wish to find is a 3x1 translation vector, \(\boldsymbol{t}\), and a 3x3 rotation matrix, \(\boldsymbol{R}\) such that:
\[ \boldsymbol{B} = \boldsymbol{R} \cdot \boldsymbol{A} + \boldsymbol{t} \]
In general, under practical considerations, a solution to the above equation does not exist. The markers will slide around the skin, and the measurement system will have some amount of noise - making it impossible to find a proper orthogonal matrix, \(\boldsymbol{R}\), which satisfies the equation above. What we can find is \(\boldsymbol{R}\) and \(\boldsymbol{t}\) that minimize the error below in a least-squares sense:
\[ \boldsymbol{e} = \boldsymbol{B} - (\boldsymbol{R} \cdot \boldsymbol{A} + \boldsymbol{t}) \]
I will skip the mathematical derivation of finding \(\boldsymbol{R}\) and \(\boldsymbol{t}\) that minimize this error, and only present the results. I will present the solution presented by Umeyama because it explicitly utilizes the SVD. For clarity, the solution presented below omits accounting for times when the SVD finds an improper orthogonal matrix (roto-reflection). I actually prefer the mathematical derivation of Horn et al., but that paper doesn't explicitly utilize the SVD - although it does so implicitly. Regardless, to find the optimal \(\boldsymbol{R}\) that minimizes the \(\boldsymbol{e}\) presented above, we first normalize the matrices \(\boldsymbol{A}\) and \(\boldsymbol{B}\) using the centroid of the marker cluster at each time point. To normalize \(\boldsymbol{A}\), first find the centroid of the marker cluster:
\[ \boldsymbol{\hat{a}} = (\sum_{i=1}^{5} \boldsymbol{a}_i)/5 \]
Then the normalized \(\boldsymbol{A}\) (and similarly \(\boldsymbol{B}\)) is defined as:
\[ \hat{\boldsymbol{A}} = \begin{bmatrix} \boldsymbol{a_1} - \boldsymbol{\hat{a}} & \boldsymbol{a_2} - \boldsymbol{\hat{a}} & \boldsymbol{a_3} - \boldsymbol{\hat{a}} & \boldsymbol{a_4} - \boldsymbol{\hat{a}} & \boldsymbol{a_5} - \boldsymbol{\hat{a}} \end{bmatrix} \]
\[ \hat{\boldsymbol{B}} = \begin{bmatrix} \boldsymbol{b_1} - \boldsymbol{\hat{b}} & \boldsymbol{b_2} - \boldsymbol{\hat{b}} & \boldsymbol{b_3} - \boldsymbol{\hat{b}} & \boldsymbol{b_4} - \boldsymbol{\hat{b}} & \boldsymbol{b_5} - \boldsymbol{\hat{b}} \end{bmatrix} \]
The columns of \(\boldsymbol{\hat{A}}\) contain vectors that point from the centroid of the marker cluster to each of the markers at time 0. The columns of \(\boldsymbol{\hat{B}}\) contain vectors that point from the centroid of the marker cluster to each of the markers at time 10. Now compute the SVD as follows:
\[ \boldsymbol{\hat{B}} \cdot \boldsymbol{\hat{A}}^T \cdot = \boldsymbol{U} \cdot \boldsymbol{S} \cdot \boldsymbol{V}^T \]
The optimal \(\boldsymbol{R}\) can be found as follows:
\[ \boldsymbol{R} = \boldsymbol{U} \cdot \boldsymbol{V}^T \]
I now want to provide a geometrical argument for why this should be the case by using some useful properties of the SVD. I will begin by re-arranging the SVD decomposition:
\[ \boldsymbol{U}^T \cdot \boldsymbol{\hat{B}} \cdot \boldsymbol{\hat{A}}^T \cdot \boldsymbol{V} = \boldsymbol{S} \]
Let \(\boldsymbol{u}_1\) be the first column of \(\boldsymbol{U}\) (first row of \(\boldsymbol{U}^T)\). Likewise, let \(\boldsymbol{v}_1\) be the first column of \(\boldsymbol{V}\). Let \(\sigma_1\), be the first singular value - the component in the first row and column of \(\boldsymbol{S}\). Then, as shown in page 5 of Tomasi:
\[ \max_{\lVert \boldsymbol{x} \rVert = \lVert \boldsymbol{y} \rVert=1} \boldsymbol{y}^T \cdot \boldsymbol{\hat{B}} \cdot \boldsymbol{\hat{A}}^T \cdot \boldsymbol{x} = \boldsymbol{u}_1^T \cdot \boldsymbol{\hat{B}} \cdot \boldsymbol{\hat{A}}^T \cdot \boldsymbol{v}_1 = \sigma_1 \]
Now let's break this down. \(\boldsymbol{y}^T \cdot \boldsymbol{\hat{B}}\) results in a row vector. Each of the 5 components of this row vector are comprised of the projections of the marker positions (from the centroid) onto the vector $\boldsymbol{y}$ at time 10. Likewise, \(\boldsymbol{\hat{A}}^T \cdot \boldsymbol{x}\) results in a column vector. Each of the 5 components of this column vector are comprised of the projections of the marker positions (from the centroid) onto the vector \(\boldsymbol{x}\) at time 0. The dot product of the \(\boldsymbol{y}^T \cdot \boldsymbol{\hat{B}}\) row vector and the \(\boldsymbol{\hat{A}}^T \cdot \boldsymbol{x}\) column vector multiplies the projection of each marker at time 0 by the projection of that same marker at time 10 and then sums over the markers. We get a single scalar value from this computation. As we vary \(\boldsymbol{x}\) and \(\boldsymbol{y}\), this value also varies. The equation above says that this value reaches its maximum when we let \(\boldsymbol{x} = \boldsymbol{u}_1\) and \(\boldsymbol{y} = \boldsymbol{v}_1\), and that value is \(\sigma_1\).
In other words, the SVD is finding a line running through the marker cluster at time 0, and a line running through the marker cluster at time 10 such that when we project the markers onto these lines, the projections are simultaneously maximized. Note that this is NOT finding a best fit line through the marker cluster at time 0, and a best fit line through the marker cluster at time 10.
\(\boldsymbol{U}\) and \(\boldsymbol{V}\) encompass 3 singular vectors, however - what now? As Guruswami points out in Page 4, the SVD is finding the best fit 3-dimensional subspaces that maximize the projections of the marker cluster at time 0 and time 10 simultaneously.
Since \(\boldsymbol{V}\) corresponds to the best fit 3-dimensional subspace at time 0 and \(\boldsymbol{U}\) corresponds to the best fit 3-dimensional subspace at time 10 then:
\[ \begin{equation} \begin{split} \boldsymbol{R} \cdot \boldsymbol{V} &= \boldsymbol{U} \\ \boldsymbol{R} &= \boldsymbol{U} \cdot \boldsymbol{V}^T \end{split} \end{equation} \]