Lazy evaluation using Python descriptors (and decorators)

In Container Emulation in Python I showed how easy it is to emulate container behavior in Python in order to enable concise data access. In this post I want to talk about how this can be accomplished lazily.

The data that I am interested in analyzing consist of hundreds of trials of subjects performing different arm movements. After skin markers are attached to key landmarks on the arm, scapula and torso, the subject's movement is recorded at 100 Hz via a 10-camera system. I had previously built a class for accessing trial data, which I show below with slight modifications.

import pandas as pd
import numpy as np


class NestedContainer:
    """A class which allows access to items in the nested container."""

    def __init__(self, nested_container, get_item_method):
        self._nested_container = nested_container
        self._get_item_method = get_item_method

    def __getitem__(self, item):
        return self._get_item_method(self._nested_container, item)


def get_marker_method(csv_data, marker):
    """Return marker data, (n, 3) numpy array view."""
    return csv_data.loc[:, marker:(marker + '.2')].to_numpy()


class MarkerTrial:
    def __init__(self, trial_name, raw_file, smooth_file):
        self.trial_name = trial_name
        raw_data = pd.read_csv(raw_file, header=[0], dtype=np.float64)
        smooth_data = pd.read_csv(smooth_file, header=[0],
                                        dtype=np.float64)
        self.raw = NestedContainer(raw_data, get_marker_method)
        self.smooth = NestedContainer(smooth_data, get_marker_method)
trial = MarkerTrial('S010_SA_t01', 'S010_SA_t01_raw.csv', 'S010_SA_t01_smooth.csv')
raw_marker_data = trial.raw['RSHO']
smoothed_marker_data = trial.smooth['RSHO']

S010_SA_t01 is the trial name, and RSHO is the name of the right shoulder marker. Its data consists of a nx3 array, where n is the number of frames in the motion capture and 3 corresponds to the number of spatial dimensions.

It works well, so what's the problem? Oftentimes, I want to access data for a particular subset of all the trials (say female subject trials, or under the age of 50 trials, or arm elevation trials). The data for all subjects is housed in one directory with subdirectories for each subject, and further subdirectories for each subject's trials. Every trial will contain raw and smoothed (processed) data stored in separate CSV files (amongst other data). I write code that parses the directory and creates a Pandas dataframe (although in this post I will use a dictionary for illustrative purposes) that allows me to easily access each trial via its name, type of arm movement, patient age, etc.

trial_dict = {t_info.trial_name: MarkerTrial(t_info.trial_name, t_info.raw_file, 
                                             t_info.smooth_file)
              for t_info in process_directory(db_dir)}

Using the MarkerTrial class, as defined above, coerces the trial data to be read for each trial upon instantiation - regardless of whether I analyze that data or not. I would like the trial data to be read only when I specifically access the data, lazily that is.

There are a multitude of approaches to this problem, but I like the one below because it can be applied in a variety of scenarios. The approach relies on Python descriptors. The Python Descriptor HowTo Guide defines a descriptor as an object attribute with "binding behavior". Two types of descriptors exist, a data and non-data descriptors. The excellent RealPython tutorial on descriptors provides a detailed explanation (replete with examples) on both types of descriptors. Now that I have provided you with enough links to learn about Python descriptors on your own, I will show you how I use them. In practice, I use the Python package lazy but it's just as easy to implement your own lazy attribute. The LazyProperty implementation below is taken from the RealPython tutorial. To understand the code below you will also need to have knowledge of Python decorators.

import pandas as pd
import numpy as np


class LazyProperty:
    def __init__(self, function):
        self.function = function
        self.name = function.__name__

    def __get__(self, obj, type=None) -> object:
        obj.__dict__[self.name] = self.function(obj)
        return obj.__dict__[self.name]


class NestedContainer:
    """A class which allows access to items in the nested container."""

    def __init__(self, nested_container, get_item_method):
        self._nested_container = nested_container
        self._get_item_method = get_item_method

    def __getitem__(self, item):
        return self._get_item_method(self._nested_container, item)


def get_marker_method(csv_data, marker):
    """Return marker data, (n, 3) numpy array view."""
    return csv_data.loc[:, marker:(marker + '.2')].to_numpy()


class MarkerTrial:
    def __init__(self, trial_name, raw_data_file, smooth_data_file):
        self.trial_name = trial_name
        self.raw_file = raw_data_file
        self.smooth_file = smooth_data_file

    @LazyProperty
    def raw(self):
        raw_data = pd.read_csv(self.raw_file, header=[0], dtype=np.float64)
        return NestedContainer(raw_data, get_marker_method)

    @LazyProperty
    def smooth(self):
        smooth_data = pd.read_csv(self.smooth_file, header=[0], dtype=np.float64)
        return NestedContainer(smooth_data, get_marker_method)

Below, MarkerTrial is re-written to remove the decorator syntax for ease of understanding.

class MarkerTrial:
    def __init__(self, trial_name, raw_data_file, smooth_data_file):
        self.trial_name = trial_name
        self.raw_file = raw_data_file
        self.smooth_file = smooth_data_file

    def raw(self):
        raw_data = pd.read_csv(self.raw_file, header=[0], dtype=np.float64)
        return NestedContainer(raw_data, get_marker_method)

    def smooth(self):
        smooth_data = pd.read_csv(self.smooth_file, header=[0], dtype=np.float64)
        return NestedContainer(smooth_data, get_marker_method)
    
    raw = LazyProperty(raw)
    smooth = LazyProperty(smooth)

The LazyProperty class obtains access to whatever method it is annotated upon via its constructor. Note that the construction of the LazyProperty descriptor objects raw and smooth happens when the MarkerTrial class itself is constructed, and NOT every time an instance of MarkerTrial is instantiated. The code below shows when data access (reading a file from disk) actually occurs:

# data is still not read from disk
trial = trial_dict['S010_SA_t01']
# data is read from disk in the line below
rsho_raw_data = trial.raw['RSHO']

LazyProperty is a non-data descriptor because it implements only __get__. Once we trigger the binding behavior (via trial.raw or trial.smooth) then the __get__ method of LazyProperty is called. The __get__ method receives trial for the obj parameter and MarkerTrial for the type parameter. Hopefully, this explains how data access is delayed until needed, but I still need to explain how data caching works. As explained in the RealPython tutorial, the lookup chain is key to understanding data caching. Specifically, for a non-data descriptor the following lookup sequence happens (other steps are omitted because they are not pertinent to this explanation):

  1. Python searches within the object's (trial) __dict__ for a key named after the attribute that you are accessing (say raw). If found, that value is returned.
  2. If the key lookup in Step 1 fails, Python attempts to return the value from the __get__ method of the non-data descriptor named after raw.

The implementation of the __get__ method holds the key to caching. On the first lookup of trial.raw a key named raw does not exist in the __dict__ of trial. The test for Step 1 fails, so Python moves on to Step 2 and calls the __get__ method of the MarkerTrial.raw non-data descriptor. The __get__ method places the raw key in the __dict__ of trial, assuring that Step 1 will always pass. From now on Python will return the value associated with the key raw in trial.__dict__. One thing to keep in mind is that this technique is not applicable when __slots__ are utilized because __dict__ does not exist.

Understanding some of the intricacies of the Python data model has been a fun learning experience for me - and I hope this post has been helpful and/or informative for you.