How to properly index multilevel model with repeated measures data

I am trying to fit a multilevel model on some longitudinal data. In this data I have ~300k observations on ~ 1400 patients. The goal is to have a multilevel model where level 0 models a patients baseline risk and level 1 models additional risk based on the repeated measures variables.

Below is my model code and an graph image of the model:

with pm.Model(coords = coords) as multilevel_model:
    # Prior
    sigma_a = pm.HalfCauchy("sigma_a", 2)

    # Patient Baseline
    gamma = pm.Normal("gamma_0", 0, 1)
    gamma_gender = pm.Normal("gamma_gender", 0, 1, dims = ["gender", "patients"])
    gamma_unit = pm.Normal("gamma_unit", 0, 1, dims = ["unit", "patients"])
    gamma_care = pm.Normal("gamma_care", 0, 1, dims = ["care", "patients"])
    gamma_age = pm.Normal("b_age", 0, 1)

    mu_a = pm.Deterministic(
        "mu_a",
        var = gamma + 
        gamma_gender[gender_idx] + 
        gamma_unit[unit_idx] + 
        gamma_care[care_idx] + 
        gamma_age*age,
        dims = ["obs", "patients"]
    )

# Unexplained Patient Variation
    epsilon_a = pm.Normal("epsilon_a", 0, 1, dims = ["obs", "patients"])
    a = pm.Deterministic("a", var=mu_a + sigma_a * epsilon_a, dims = ["obs", "patients"])
    
    #  Common slopes priors
    b_time = pm.Normal("b_time", 0, 1)
    b_hematocrit = pm.Normal("b_hematocrit", 0, 1)
    b_min_bp = pm.Normal("b_min_bp", 0, 1)
    b_max_bp = pm.Normal("b_max_bp", 0, 1)
    b_min_temp = pm.Normal("b_min_temp", 0, 1)
    b_max_temp = pm.Normal("b_max_temp", 0, 1)
    b_min_rsp = pm.Normal("b_min_rsp", 0, 1)
    b_max_rsp = pm.Normal("b_max_rsp", 0, 1)
    b_min_pls = pm.Normal("b_min_pls", 0, 1)
    b_max_pls = pm.Normal("b_max_pls", 0, 1)
    b_min_sgr = pm.Normal("b_min_sgr", 0, 1)
    b_max_sgr = pm.Normal("b_max_sgr", 0, 1)
    b_min_pn = pm.Normal("b_min_pn", 0, 1)
    b_max_pn = pm.Normal("b_max_pn", 0, 1)
    
    # Expectation
    p = pm.Deterministic(
        "p", 
        pm.math.invlogit(
            pm.math.mean(a[:, patient_idx]) + 
            b_time*time +
            b_hematocrit*hematocrit +
            b_min_bp*min_BP +
            b_max_bp*max_BP +
            b_min_temp*min_temp +
            b_max_temp*max_temp +
            b_min_rsp*min_rsp +
            b_max_rsp*max_rsp +
            b_min_pls*min_pls + 
            b_max_pls*max_pls +
            b_min_sgr*min_sgr +
            b_max_sgr*max_sgr +
            b_min_pn*min_pn +
            b_max_pn*max_pn
        ),
        dims="obs"
    )
    likelihood = pm.Bernoulli("likelihood", p = p, observed = transferred, dims="obs")

I keep getting shape error messages when I try to sample this model.

Any tips/advice would really be appreciated.

The best way is to use the .eval() method. This allows you to eagerly compute a single variable in your model. You should check that all of these different components have the correct shapes with e.g. (b_time * time).shape.eval(), (b_hematocrit*hematocrit).shape.eval(), and so on.

Thank you for that suggestion @jessegrabowski I appreciate your help. If you don’t mind me asking you, does it make sense the way I had to create 2 dimensional parameters in level 0 to address the fact that there are more observations than patients (repeated measures) in my data?

I would suggest you do everything in long form, so instead of having a matrix of dimensions (observation, patient), you just have a vector of length n_obs * n_patients with basically a multi-index. I’m not sure if you have a balanced panel or not, but I find it’s easier to index into long-form data.

I’m also leery of the fact that you’re expanding the a variable with patient_idx just to take the mean of it. That doesn’t seem correct.

@jessegrabowski Yes, I am also quite concerned about taking the mean of a. It was the only way I could get it to not throw out an error of p being multidimensional. Do you have a reference I could check out on how to index into long-form data? The data I have is already in long-form each observation is a row. When I was implementing the pymc code I did initially try to only index my level 0 parameters like b_gender only by dims=“gender” but when I try to construct the a variable that has dims = “patients” it would complain that it could not broadcast variables.

edit:
I think I was able to make some progress thanks to your suggestions. I am now able to sample from the following model:

with pm.Model(coords = coords) as multilevel_model:
    # Prior
    sigma_a = pm.HalfCauchy("sigma_a", 2)

    # Patient Baseline
    gamma = pm.Normal("gamma_0", 0, 1)
    gamma_gender = pm.Normal("gamma_gender", 0, 1, dims = "gender")
    gamma_unit = pm.Normal("gamma_unit", 0, 1, dims = "unit")
    gamma_care = pm.Normal("gamma_care", 0, 1, dims = "care")
    gamma_age = pm.Normal("b_age", 0, 1)

    mu_a = pm.Deterministic(
        "mu_a",
        var = gamma + 
        gamma_gender[gender_idx] + 
        gamma_unit[unit_idx] + 
        gamma_care[care_idx] + 
        gamma_age*age
    )

    # Unexplained Patient Variation
    epsilon_a = pm.Normal("epsilon_a", 0, 1)
    a = pm.Deterministic("a", var=mu_a + sigma_a * epsilon_a)
    
    # Common slopes priors
    b_time = pm.Normal("b_time", 0, 1)
    b_hematocrit = pm.Normal("b_hematocrit", 0, 1)
    b_min_bp = pm.Normal("b_min_bp", 0, 1)
    b_max_bp = pm.Normal("b_max_bp", 0, 1)
    b_min_temp = pm.Normal("b_min_temp", 0, 1)
    b_max_temp = pm.Normal("b_max_temp", 0, 1)
    b_min_rsp = pm.Normal("b_min_rsp", 0, 1)
    b_max_rsp = pm.Normal("b_max_rsp", 0, 1)
    b_min_pls = pm.Normal("b_min_pls", 0, 1)
    b_max_pls = pm.Normal("b_max_pls", 0, 1)
    b_min_sgr = pm.Normal("b_min_sgr", 0, 1)
    b_max_sgr = pm.Normal("b_max_sgr", 0, 1)
    b_min_pn = pm.Normal("b_min_pn", 0, 1)
    b_max_pn = pm.Normal("b_max_pn", 0, 1)

# Expectation
    p = pm.Deterministic(
        "p", 
        pm.math.invlogit(
            a[patient_idx] + 
            b_time*time +
            b_hematocrit*hematocrit +
            b_min_bp*min_BP +
            b_max_bp*max_BP +
            b_min_temp*min_temp +
            b_max_temp*max_temp +
            b_min_rsp*min_rsp +
            b_max_rsp*max_rsp +
            b_min_pls*min_pls + 
            b_max_pls*max_pls +
            b_min_sgr*min_sgr +
            b_max_sgr*max_sgr +
            b_min_pn*min_pn +
            b_max_pn*max_pn
        ),
        dims="obs"
    )
    likelihood = pm.Bernoulli("likelihood", p = p, observed = transferred, dims="obs")

The only piece that I am still struggling with is the variable “epsilon_a” which represents the unobserved variation between patients. I am unable to set the dims of “epsilon_a” to be the number of patients in the data set. I get broadcasting errors if I do.

You will need to make a mapping from patients to rows of data. Assuming you have a column in your dataset like patient_id, you can do patient_idx, patients = pd.factorize(df.patient_id), then give epsilon_a the dimension patient and expand it to the number of rows with epsilon_a[patient_idx]

1 Like

@jessegrabowski Thank you very much! Your suggestion fixed the problems I was having with epsilon_a. Previously, when I was assigning epsilon_a dims of patients I was not indexing epsilon_a[patient_idx]. I apologize for this amateur mistake on my part. I do appreciate your help, again thank you!

1 Like

No worries, 99.9% of errors are shape errors. Don’t be shy about asking for help when you get stumped!

1 Like

Thanks man, honestly feeling real embarrassed by my mistake. I appreciate your help and kindness. Have a nice rest of your day!

1 Like