Model Comparison: How to constrain a model to positive mu-differences?

I still wasn’t able to figure out how to use patsy’s dmatrices, no clue. :frowning_face:

In the case of a model that isn’t using intercepts and slopes, but trying to model the differences in standard deviations directly in the way that was initially posted by @junpenglao, how do I merge the different priors, so I can use it with the observed variable? To be clear: there are several variables that define sigma for the different conditions/blocks/directions, but I have only 1 observed variable with only 1 sigma argument. How do I map the different sigma variables to their respective measurements?!

I’ve been trying to get this information for weeks now, and it doesn’t seem to be clearly explained anywhere? It’s not obvious to me. dims, coords and how indexing works aren’t well documented.

With the mock-up data in the gist of my previous post, how would I provide these different mu distributions as one to the observed variable?
For a variable that is dummy coded with 0 and 1, like direction in my case, I tried it this way:

with pm.Model(coords=coords) as proj_model:
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')

    # Hyperpriors:
    global_ortho = pm.Gamma('ortho_global', mu=3, sigma=10)
    global_ortho_sd = pm.HalfCauchy('ortho_global_sd', 5)
    global_diff = pm.Gamma('diff_global', mu=3, sigma=3)
    global_diff_sd = pm.HalfCauchy('diff_global_sd', 5)
    global_parallel = pm.Deterministic('parallel_global', global_ortho+global_diff)

    # Priors
    ortho_user = pm.Gamma('ortho_user', mu=global_ortho, sigma=global_ortho_sd, dims='User')
    diff_user = pm.Gamma('diff_user', mu=global_diff, sigma=global_diff_sd, dims='User')
    parallel_user = pm.Deterministic('parallel_user', ortho_user+diff_user)

    # Expected deviation per user:
    theta = ortho_user[user_idx] * (1-direction_idx) + parallel_user[user_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

Due to the multiplication with either 0 or 1, rows that have direction==0 (orthogonal), use the ortho_user variable as prior, and rows with direction==1 use the parallel_user variable.

But for a variable like block that goes from 0 to 2, I can’t simply multiply with the dummy coded variable, as it would multiply by 2. I tried to work around it this way:

with pm.Model(coords=coords) as block2_model:
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')
    block_idx = pm.Data('block_idx', block_indices, dims='obs_id')
    #block_id_idx = pm.Data('block_id_idx', block_id_indices, dims='obs_id')

    # Hyperpriors:
    global_blocks13 = pm.Gamma('blocks13_global', mu=5, sigma=10)
    global_blocks13_sd = pm.HalfCauchy('blocks13_global_sd', 5)
    global_diff = pm.Gamma('diff_global', mu=5, sigma=3)
    global_diff_sd = pm.HalfCauchy('diff_global_sd', 5)
    global_block2 = pm.Deterministic('block2_global', global_blocks13+global_diff)

    # Priors
    blocks13_user = pm.Gamma('blocks13_user', mu=global_blocks13, sigma=global_blocks13_sd, dims='User')
    diff_block2 = pm.Gamma('diff_block2', mu=global_diff, sigma=global_diff_sd, dims='User')
    block2_user = pm.Deterministic('block2_user', blocks13_user+diff_block2)

    # Expected deviation per user:
    theta = blocks13_user[user_idx] * (block_idx.eval() != 1) + block2_user[user_idx] * (block_idx.eval() == 1)

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

But, honestly, using eval() to get the values doesn’t seem like the appropriate way to do it?
And when I need to get more complex with interactions between blocks and direction, this will get very complicated. I guess, this is where the DesignMatrix should come in handy? But I don’t know where to put it and what to do with it.

It’s a lot of puzzle pieces that I don’t know how to fit together.

I found some resources that explain design matrices theoretically (video1, video2,wiki), and according to this stackexchange answer there’s a slight difference between the design matrix and the model matrix, (based on Montgomery, D. (2009). Design and Analysis of Experiments, 7th Edition. John Wiley & Sons Inc.).

This notebook makes use of patsy and PyMC3, but doesn’t explain what’s going on there. Another notebook on a sleep deprivation study is a bit more informative, because it shows how to do only random intercepts/slopes or both, but also doesn’t go into much detail about the why and how. It doesn’t provide an interpretation of the results, unfortunately.
A hierarchical model from this thread is using some indexing scheme that I don’t quite grasp. And just to collect some more useful information, here’s a thread that further discusses hierarchical models.
There seems to be a lack of a step-by-step in-depth guide with best practices on HLMs in PyMC3 that’s beginner-friendly. Mostly it’s just notebooks replicating some calculations without much explanation.

Based on the sleep deprivation study notebook I tried applying it to my case.

# Fixed effects. Commonly denoted as X.
X = patsy.dmatrix('1 + block * direction', samples, return_type='dataframe')
X = np.asarray(X)

# Design matrix for the random effects. Intercept and slope are modelled separately to have more control on the prior.
Z_intercept = patsy.dmatrix('0 + user', data=samples, return_type='dataframe')
Z_intercept = np.asarray(Z_intercept)
Z_slope = patsy.dmatrix('0 + user:direction', data=samples, return_type='dataframe')
Z_slope = np.asarray(Z_slope)
Z = np.concatenate((Z_intercept, Z_slope), axis=1)
Y = np.asarray(samples['log_projection_sq'])

with pm.Model(coords=coords) as hlm_model:
    ## Fixed effect
    beta_X_intercept = pm.Normal('beta_X_ortho', mu=0, sd=10) # contrain it to positive values
    beta_X_slope_b2 = pm.Normal('beta_X_b2_ortho_offset', mu=0, sd=10)
    beta_X_slope_b3 = pm.Normal('beta_X_b3_ortho_offset', mu=0, sd=10)
    beta_X_slope_para = pm.Normal('beta_X_parallel_offset', mu=0, sd=10)
    beta_X_slope_b2_para = pm.Normal('beta_X_b2_parallel_offset', mu=0, sd=10)
    beta_X_slope_b3_para = pm.Normal('beta_X_b3_parallel_offset', mu=0, sd=10)
    beta_X = tt.stack([beta_X_intercept,
                       beta_X_slope_b2,
                       beta_X_slope_b3,
                       beta_X_slope_para,
                       beta_X_slope_b2_para,
                       beta_X_slope_b3_para
                      ])
    
    estimate_X = pm.math.dot(X, beta_X)
    
    ## Random effect
    sigma_Z_intercept = pm.HalfNormal('sigma_Z_intercept', sd=10)
    gamma_Z_intercept_raw = pm.Normal('gamma_Z_intercept_raw', mu=0, sd=1, shape=Z_intercept.shape[1])
    gamma_Z_intercept = pm.Deterministic('gamma_Z_intercept', gamma_Z_intercept_raw * sigma_Z_intercept)

    sigma_Z_slope = pm.HalfNormal('sigma_Z_slope', sd=10)
    gamma_Z_slope_raw = pm.Normal('gamma_Z_slope_raw', mu=0, sd=1, shape=Z_slope.shape[1])
    gamma_Z_slope = pm.Deterministic('gamma_Z_slope', gamma_Z_slope_raw * sigma_Z_slope)
    
    estimate_Z = pm.math.dot(Z_intercept, gamma_Z_intercept) + pm.math.dot(Z_slope, gamma_Z_slope) 
    
    ## likelihood
    mu_estimate = pm.Deterministic('mu_estimate', estimate_X + estimate_Z)
    sigma_unexplained = pm.HalfNormal('sigma_unexplained', sd=10) # unexplained variability
    y_likelihood = pm.Normal('y_likelihood', mu=mu_estimate, sd=sigma_unexplained, observed=Y)

Sampling was slow and I got all kinds of errors:

There were 19 divergences after tuning. Increase target_accept or reparameterize.
There were 55 divergences after tuning. Increase target_accept or reparameterize.
There were 156 divergences after tuning. Increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.6974308250411304, but should be close to 0.8. Try to increase the number of tuning steps.
There were 18 divergences after tuning. Increase target_accept or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

I’m also not clear on how to interpret the estimates when using design matrices. Are all the slopes offsets to the orthogonal projection in block 1 (intercept)? Or is ‘beta_X_b2_parallel_offset’, for example, the difference to ‘beta_X_parallel_offset’?
Also in order to stack all the slopes I had to write them down one-by-one, because I couldn’t stack tensors with different amount of columns, e.g. if I had one beta_X_slope with shape 6. If I added condition and more interactions this would get repetitive quite quickly. Is there a vectorized way for this?

Obviously this wasn’t the best attempt. How can it be improved? How can the coords be utilized?

This thread has become quite messy with my many confusions, so that it’s probably difficult to find useful information in it, but I add to my journey into the Bayesian abyss anyway in the hopes it may help somebody else.

In the meantime I found a comment on a blog post about the indexing syntax to which @twiecki replied and explained what the indexing syntax does. Say, you got something like:

with pm.Model() as hierarchical_model:
    ...
    # Intercept for each county, distributed around group mean mu_a
    a = pm.Normal('alpha', mu=mu_a, sigma=sigma_a, shape=len(data.county.unique()))    
    # Expected value
    radon_est = a[county_idx]
    ....

You have a – a vector with one distribution per county. Your data however, has multiple values per county. So what you want to do is, create a version of a that has the same length as your data and associates each data point with the right county-distribution.
For example, say we have 2 counties each with 3 data points. Then you have a = [county_1, county_2], to then associate that to your data you could do: a[0, 0, 0, 1, 1, 1] to get the right vector of length 6 that syncs up correctly to your data.