Working with sparse attributed observations

Hey folks,

I am trying to make a PyMC3 model of latent variables based on many observations by noisy oracles. I want to leverage the information of which observations are from which oracles, as I expect each oracle to be somewhat internally consistent, but I can’t figure out how to do so without creating a very sparse set of observations with mostly null values.

More specifically, I have several dozen cookie recipes that I would like to infer the latent quality of. Each recipe has been implemented one or more times to create batches, and each batch has been tasted by 100 people. Each time a cookie is tasted by a person, it results in a noisy binary observation of whether the person approves of that cookie from that batch.

My problem is that while there are 30 recipes and 100+ observations per recipe, each person only provides an observation on 5 of the recipes, including 2 observations of the gold standard recipe.
This means that my observations are quite sparse - for any observation, only 5 out of 30 recipes have values. This is presumably well beyond the level at which it would be reasonable to impute missing values.

The obvious solution to this would be to consider every observation as independent and ignore attributions of observations to oracles. However, I believe that this will sacrifice a substantial amount of information, since I expect each oracle to have some internal “approval threshold” that determines which latent qualities are high enough to be observed as a 1 or 0.

My question then, is whether there is a way to express these observations to a PyMC3 model without sacrificing the observation attribution information.

Thank you for your help!

I got some traction on this problem and want to share what I’ve learned in case it’s useful for others, as well as to ask a more tactical follow-up about making my code more performant.

The approach I’ve taken is to leverage item response theory, which is typically used in the context of standardized testing to estimate the latent ability of each student and the latent difficulty of each test question. I reformulated my approach in these terms and adapted a Stan guide to implementing item response theory. I’m pleased with how my implementation has worked so far.

The key to avoiding the sparseness is to unpivot the observations and index everything against the observation index, where observation i is an observation made by person j about recipe k. To implement this, I just loop across observations, then index into person and recipe trackers to determine which distributions this observation is relevant to. A self-contained implementation of this is in the code below.

I’ve read in the PyMC3 docs that if I have a Python list of distributions like below instead of using the shape parameter, I’m probably sacrificing performance. My follow-up question then is whether there is a more efficient way to implement this list of distributions while still maintaining this indexing scheme (as my real data is much larger than this toy example) - specifically how to get rid of the for loop in the run_model function. Any pointers are much appreciated!

def generate_observations():
    TASTERS_PER_VARIANT = 10

    # These parameters suppose that 2 separate tastings have been run, each with
    # 3 recipes, where each taster observes 2 out of the 3 recipes in the test
    # and the third variant (index 2) is included in both tests.
    QUALITIES = [0.1, 0.3, 0.7, 0.5, 0.6]
    TASTER_THRESHOLD_BOUNDS = [0, 0.9]
    VARIANTS_IN_TEST = [[0, 1, 2], [2, 3, 4]]
    # each sublist of PERMUTATIONS indexes into a sublist of VARIANTS_IN_TEST
    PERMUTATIONS = [[0, 1], [1, 2], [0, 2]]
    TASTERS_PER_PERMUTATION = int(TASTERS_PER_VARIANT / 2)
    observations, taster_of_obs, formula_of_obs = [], [], []
    taster_idx = -1
    for test_set in VARIANTS_IN_TEST:
        for permutation in PERMUTATIONS:
            for _ in range(TASTERS_PER_PERMUTATION):
                taster_idx += 1
                taster_threshold = np.random.uniform(*TASTER_THRESHOLD_BOUNDS)
                for perm_idx in permutation:
                    formula_of_obs.append(test_set[perm_idx])
                    taster_of_obs.append(taster_idx)
                    observations.append(int(QUALITIES[test_set[perm_idx]] >= taster_threshold))
    return observations, taster_of_obs, formula_of_obs

def logit(threshold, candidate, steepness=1):
    return 1 / (1 + tt.exp(steepness * (threshold - candidate)))

def run_model(observations, taster_of_obs, formula_of_obs):
    LOGIT_STEEPNESS = 10
    n_tasters = np.max(taster_of_obs) + 1
    n_formulas = np.max(formula_of_obs) + 1
    n_obs = len(observations)
    print("n_obs", n_obs, "form", n_formulas, "tasters", n_tasters)
    with pm.Model() as tasting_model:
        formula_quality = pm.Normal("formula_quality", mu=0.5, sigma=0.3, shape=(n_formulas))
        taster_discerningness = pm.Normal("taster_discerningness", mu=0.5, sigma=0.3, shape=(n_tasters))

        top_boxes = [None] * n_obs
        ps = [None] * n_obs
        for obs_idx in range(n_obs):
            discerningness = taster_discerningness[taster_of_obs[obs_idx]]
            quality = formula_quality[formula_of_obs[obs_idx]]

            ps[obs_idx] = pm.Deterministic(f"p_{obs_idx}", logit(discerningness, quality, LOGIT_STEEPNESS))
            top_boxes[obs_idx] = pm.Binomial(f"top_box_{obs_idx}", n=1, p=ps[obs_idx], shape=1, observed=observations[obs_idx])
        trace = pm.sample(1000, tune=2000, target_accept=0.95)
    return trace

observations, taster_of_obs, formula_of_obs = generate_observations()
trace = run_model(observations, taster_of_obs, formula_of_obs)
pm.traceplot(trace)

I’ve now answered my own follow-up question. I was able to vectorize the code from my first answer using numpy indexing and as expected, it’s dramatically faster than the version with the python for loop, even with 10x as much generated data. The full vectorized code is below in case this is a helpful example for anybody.

def generate_observations():
    TASTERS_PER_VARIANT = 100

    # These parameters suppose that 2 separate tastings have been run, each with
    # 3 recipes, where each taster observes 2 out of the 3 recipes in the test
    # and the third variant (index 2) is included in both tests.
    QUALITIES = [0.2, 0.3, 0.7, 0.5, 0.6]
    TASTER_THRESHOLD_BOUNDS = [0, 0.9]
    VARIANTS_IN_TEST = [[0, 1, 2], [2, 3, 4]]
    # each sublist of PERMUTATIONS indexes into a sublist of VARIANTS_IN_TEST
    PERMUTATIONS = [[0, 1], [1, 2], [0, 2]]
    TASTERS_PER_PERMUTATION = int(TASTERS_PER_VARIANT / 2)
    observations, taster_of_obs, formula_of_obs = [], [], []
    taster_idx = -1
    for test_set in VARIANTS_IN_TEST:
        for permutation in PERMUTATIONS:
            for _ in range(TASTERS_PER_PERMUTATION):
                taster_idx += 1
                taster_threshold = np.random.uniform(*TASTER_THRESHOLD_BOUNDS)
                for perm_idx in permutation:
                    formula_of_obs.append(test_set[perm_idx])
                    taster_of_obs.append(taster_idx)
                    observations.append(int(QUALITIES[test_set[perm_idx]] >= taster_threshold))
    return np.array(observations), np.array(taster_of_obs), np.array(formula_of_obs)

def logit(threshold, candidate, steepness=1):
    return 1 / (1 + tt.exp(steepness * (threshold - candidate)))

def run_model(observations, taster_of_obs, formula_of_obs):
    LOGIT_STEEPNESS = 10
    n_tasters = np.max(taster_of_obs) + 1
    n_formulas = np.max(formula_of_obs) + 1
    n_obs = len(observations)
    obs_idcs = np.arange(n_obs)
    with pm.Model() as tasting_model:
        formula_quality = pm.Normal("formula_quality", mu=0.5, sigma=0.3, shape=(n_formulas))
        taster_discerningness = pm.Normal("taster_discerningness", mu=0.5, sigma=0.3, shape=(n_tasters))

        discerningness_idxd = taster_discerningness[taster_of_obs[obs_idcs]]
        quality_idxd = formula_quality[formula_of_obs[obs_idcs]]
        ps = logit(discerningness_idxd, quality_idxd, LOGIT_STEEPNESS)
        top_box = pm.Binomial("top_box", n=1, p=ps, shape=(n_obs), observed=observations)

        trace = pm.sample(1000, tune=2000, target_accept=0.95)
    return trace

observations, taster_of_obs, formula_of_obs = generate_observations()
trace = run_model(observations, taster_of_obs, formula_of_obs)
pm.plot_trace(trace)