Performance of common biomechanics linear algebra operations in Numpy

My last two Ph.D. projects will involve quantifying soft tissue artefact and analyzing axial rotation of the humerus using a combination of skin-marker motion capture and bi-plane fluoroscopy (on human subjects). Both projects will require a considerable amount of linear algebra, mainly coordinate system transformations.

Before building more complex logic into my codebase, I wanted to compare the performance of different methods of computing common biomechanics linear algebra operations in Numpy. As is common in biomechanics and robotics, I use a 4x4 homogeneous matrix to represent a coordinate system or pose, and correspondingly 3D vectors are upgraded to homogeneous coordinates. I knew that numpy.einsum could accommodate all linear algebra operations I am interested in performing, but I was curious to compare its performance against numpy.matmul for operations that could be performed just by matrix multiplication. Thanks to Numpy's broadcasting algorithm a considerable number of operation can be performed using numpy.matmul.

These are the operations that I decided to test:

  1. Transform the position of a marker for each of the n frames of a trial from one coordinate system to another. This is analogous to multiplying a collection of n vectors (stored in a nx4 matrix) by a matrix.
  2. Transform the pose of the humerus for each of the n frames of a trial from one coordinate system to another. This is analogous to multiplying a collection of n matrices (stored in a nx4x4 matrix) by a matrix.
  3. Express the pose of the humerus with respect to the pose of the torso for each of the n frames of a trial. This is analogous to multiplying a collection of n matrices by a collection of n matrices (stored in a nx4x4 matrix).
  4. Project the angular velocity of the humerus onto its long axis for each of the n frames of a trial. This is analogous to performing the dot product between a collection of n vectors and another collection of n vectors (stored in a nx4 matrix).
  5. Determine the angular velocity magnitude for each of the n frames of a trial. This is analogous to finding the L2-norm (Euclidean norm) of a collection of n vectors (stored in a nx4 matrix).

I set n to 1000 for all test - this is arbitrary but at 100 Hz this is equivalent to a 10 second capture. Each operation was implemented using at least two methods. For each operation, 1000 matrices and vectors with the appropriate dimensions and sizes were pre-instantiated using numpy.random.rand. For each method within an operation, identical matrices and vectors were utilized. Each method was called 1000 times using the pre-instantiated matrices and vectors , the total run time measured using time.time(), from which average computation time was calculated. And here are the results for average computation time. Finally, even though operation #4 and #5 make more sense with non-homogeneous coordinates (3D), I opted to keep these as homogeneous coordinate for the sake of consistency.

1. Multiply a collection of 1000 vectors (stored in a 1000x4 matrix) by a matrix.

# transform all vectors in vecs into frame mat
# assume that mat is 4x4 and vecs is nx4, and the output should be nx4

# implemented using matrix multiplication
def cs_transform_vecs_one_mat_matmul(mat, vecs):
    # (mat @ vecs.T).T = vecs @ mat.T
    return vecs @ mat.T


# implemented using einsum
def cs_transform_vecs_one_mat_einsum(mat, vecs):
    return np.einsum('ij,kj->ki', mat, vecs)
numpy.matmul 0.00598 ms
numpy.einsum 0.02394 ms

RESULT: numpy.matmul wins by a factor of 4!


2. Multiply a collection of 1000 matrices (stored in a 1000x4x4 matrix) by a matrix.

# transform all frames in mats into frame mat
# assume that mat is 4x4 and mats is nx4x4, and the output should be nx4x4

# implemented using matrix multiplication
def cs_transform_mats_one_mat_matmul(mat, mats):
    return mat @ mats


# implemented using einsum
def cs_transform_mats_one_mat_einsum(mat, mats):
    return np.einsum('jk,ikl->ijl', mat, mats)
numpy.matmul 0.29124 ms
numpy.einsum 0.60229 ms

RESULT: numpy.matmul wins by a factor of 2!


3. Multiply a collection of 1000 matrices by a collection of 1000 matrices (stored in a 1000x4x4 matrix).

# transform all frames in mats2 using frames in mats1
# assume that mats1 is nx4x4 and mats2 is nx4x4, and the output should be nx4x4

# implemented using matrix multiplication
def cs_transform_mats_mats_matmul(mats1, mats2):
    return mats1 @ mats2


# implemented using einsum
def cs_transform_mats_mats_einsum(mats1, mats2):
    return np.einsum('ijk,ikl->ijl', mats1, mats2)
numpy.matmul 0.29475 ms
numpy.einsum 0.70612 ms

RESULT: numpy.matmul wins by a factor of ~2.4!


4. Perform the dot product between a collection of 1000 vectors and another collection of 1000 vectors (stored in a 1000x4 matrix).

# dot product of vectors in vecs1 with corresponding vectors in vecs2
# assume that vecs1 is nx4 and vecs2 is nx4, and the output should be (n,)

# implemented using matrix multiplication
def extended_dot_matmul(vecs1, vecs2):
    return np.squeeze(vecs1[:, np.newaxis, :] @ vecs2[:, :, np.newaxis])


# implemented using einsum
def extended_dot_einsum(vecs1, vecs2):
    return np.einsum('ij,ij->i', vecs1, vecs2)


# implemented using sum multiply
def extended_dot_sum_multiply(vecs1, vecs2):
    return np.sum(np.multiply(vecs1, vecs2), axis=1)
numpy.matmul 0.04388 ms
numpy.einsum 0.01097 ms
numpy.sum(numpy.multiply) 0.02094 ms

RESULT: numpy.einsum wins by a factor of 4 over numpy.matmul and by a factor of ~2 over numpy.sum(numpy.multiply)!


5. Find the L2-norm (Euclidean norm) of a collection of 1000 vectors (stored in a 1000x4 matrix).

# norm of vectors in vecs1
# assume that vecs1 is nx4

# implemented using linear algebra module
def la_norm(vecs1):
    return linalg.norm(vecs1, axis=1)


# implemented using einsum
def norm_einsum(vecs1, vecs2):
    return np.sqrt(np.einsum('ij,ij->i', vecs1, vecs2))
numpy.linalg.norm 0.01995 ms
numpy.einsum 0.00898 ms

RESULT: numpy.einsum wins by a factor of ~2.2!

My conclusion from these tests is that if numpy.matmul can be utilized without making any changes to the input matrices, it is the fastest solution. If array dimensions are non-conformable (even when broadcasting is considered) then numpy.einsum is the better choice.