Performance of common biomechanics linear algebra operations in Numpy
My last two Ph.D. projects will involve quantifying soft tissue artefact and analyzing axial rotation of the humerus using a combination of skin-marker motion capture and bi-plane fluoroscopy (on human subjects). Both projects will require a considerable amount of linear algebra, mainly coordinate system transformations.
Before building more complex logic into my codebase, I wanted to compare the performance of different methods of computing common biomechanics linear algebra operations in Numpy. As is common in biomechanics and robotics, I use a 4x4 homogeneous matrix to represent a coordinate system or pose, and correspondingly 3D vectors are upgraded to homogeneous coordinates. I knew that numpy.einsum
could accommodate all linear algebra operations I am interested in performing, but I was curious to compare its performance against numpy.matmul
for operations that could be performed just by matrix multiplication. Thanks to Numpy's broadcasting algorithm a considerable number of operation can be performed using numpy.matmul
.
These are the operations that I decided to test:
- Transform the position of a marker for each of the n frames of a trial from one coordinate system to another. This is analogous to multiplying a collection of n vectors (stored in a nx4 matrix) by a matrix.
- Transform the pose of the humerus for each of the n frames of a trial from one coordinate system to another. This is analogous to multiplying a collection of n matrices (stored in a nx4x4 matrix) by a matrix.
- Express the pose of the humerus with respect to the pose of the torso for each of the n frames of a trial. This is analogous to multiplying a collection of n matrices by a collection of n matrices (stored in a nx4x4 matrix).
- Project the angular velocity of the humerus onto its long axis for each of the n frames of a trial. This is analogous to performing the dot product between a collection of n vectors and another collection of n vectors (stored in a nx4 matrix).
- Determine the angular velocity magnitude for each of the n frames of a trial. This is analogous to finding the L2-norm (Euclidean norm) of a collection of n vectors (stored in a nx4 matrix).
I set n to 1000 for all test - this is arbitrary but at 100 Hz this is equivalent to a 10 second capture. Each operation was implemented using at least two methods. For each operation, 1000 matrices and vectors with the appropriate dimensions and sizes were pre-instantiated using numpy.random.rand
. For each method within an operation, identical matrices and vectors were utilized. Each method was called 1000 times using the pre-instantiated matrices and vectors , the total run time measured using time.time()
, from which average computation time was calculated. And here are the results for average computation time. Finally, even though operation #4 and #5 make more sense with non-homogeneous coordinates (3D), I opted to keep these as homogeneous coordinate for the sake of consistency.
1. Multiply a collection of 1000 vectors (stored in a 1000x4 matrix) by a matrix.
# transform all vectors in vecs into frame mat
# assume that mat is 4x4 and vecs is nx4, and the output should be nx4
# implemented using matrix multiplication
def cs_transform_vecs_one_mat_matmul(mat, vecs):
# (mat @ vecs.T).T = vecs @ mat.T
return vecs @ mat.T
# implemented using einsum
def cs_transform_vecs_one_mat_einsum(mat, vecs):
return np.einsum('ij,kj->ki', mat, vecs)
numpy.matmul |
0.00598 ms |
numpy.einsum |
0.02394 ms |
RESULT: numpy.matmul
wins by a factor of 4!
2. Multiply a collection of 1000 matrices (stored in a 1000x4x4 matrix) by a matrix.
# transform all frames in mats into frame mat
# assume that mat is 4x4 and mats is nx4x4, and the output should be nx4x4
# implemented using matrix multiplication
def cs_transform_mats_one_mat_matmul(mat, mats):
return mat @ mats
# implemented using einsum
def cs_transform_mats_one_mat_einsum(mat, mats):
return np.einsum('jk,ikl->ijl', mat, mats)
numpy.matmul |
0.29124 ms |
numpy.einsum |
0.60229 ms |
RESULT: numpy.matmul
wins by a factor of 2!
3. Multiply a collection of 1000 matrices by a collection of 1000 matrices (stored in a 1000x4x4 matrix).
# transform all frames in mats2 using frames in mats1
# assume that mats1 is nx4x4 and mats2 is nx4x4, and the output should be nx4x4
# implemented using matrix multiplication
def cs_transform_mats_mats_matmul(mats1, mats2):
return mats1 @ mats2
# implemented using einsum
def cs_transform_mats_mats_einsum(mats1, mats2):
return np.einsum('ijk,ikl->ijl', mats1, mats2)
numpy.matmul |
0.29475 ms |
numpy.einsum |
0.70612 ms |
RESULT: numpy.matmul
wins by a factor of ~2.4!
4. Perform the dot product between a collection of 1000 vectors and another collection of 1000 vectors (stored in a 1000x4 matrix).
# dot product of vectors in vecs1 with corresponding vectors in vecs2
# assume that vecs1 is nx4 and vecs2 is nx4, and the output should be (n,)
# implemented using matrix multiplication
def extended_dot_matmul(vecs1, vecs2):
return np.squeeze(vecs1[:, np.newaxis, :] @ vecs2[:, :, np.newaxis])
# implemented using einsum
def extended_dot_einsum(vecs1, vecs2):
return np.einsum('ij,ij->i', vecs1, vecs2)
# implemented using sum multiply
def extended_dot_sum_multiply(vecs1, vecs2):
return np.sum(np.multiply(vecs1, vecs2), axis=1)
numpy.matmul |
0.04388 ms |
numpy.einsum |
0.01097 ms |
numpy.sum(numpy.multiply) |
0.02094 ms |
RESULT: numpy.einsum
wins by a factor of 4 over numpy.matmul
and by a factor of ~2 over numpy.sum(numpy.multiply)
!
5. Find the L2-norm (Euclidean norm) of a collection of 1000 vectors (stored in a 1000x4 matrix).
# norm of vectors in vecs1
# assume that vecs1 is nx4
# implemented using linear algebra module
def la_norm(vecs1):
return linalg.norm(vecs1, axis=1)
# implemented using einsum
def norm_einsum(vecs1, vecs2):
return np.sqrt(np.einsum('ij,ij->i', vecs1, vecs2))
numpy.linalg.norm |
0.01995 ms |
numpy.einsum |
0.00898 ms |
RESULT: numpy.einsum
wins by a factor of ~2.2!
My conclusion from these tests is that if numpy.matmul
can be utilized without making any changes to the input matrices, it is the fastest solution. If array dimensions are non-conformable (even when broadcasting is considered) then numpy.einsum
is the better choice.