Model Comparison: How to constrain a model to positive mu-differences?

I think I have to clarify some more why I turned to pymc3 and I’m not just doing a Bayesian ANOVA in JASP or jamovi.
I’m especially interested in predictions from 2 hypotheses I haven’t mentioned before.
1: low variability for orthogonal in all blocks, high variability for parallel projection in block 1 & 3, but not in block 2.

| .       . parallel
|  \     /
|   \   /
|    \./
| .___.___. orthogonal
|------------> block
  1   2   3

2: same as before except that orthogonal variability rises in block 2.

|.       . parallel
| \     /
|  \   /
|   \./
|   /*\
| ./   \.  orthogonal
|------------> block
  1  2  3

The ANOVA on the Synergy Index (a measure of the variability ratio) can’t discriminate these 2 predictions, only that there is some difference in block 2 vs 1 & 3, which doesn’t help much. Therefore, I hope to model these 2 predictions in pymc3, amongst other ones.
That’s why I think counting the samples where parallel > orthogonal would not be able to give me the answer I seek, because it would give me the same result for these 2 examples?

1 Like

Ah I see, the observed is a square array (n_subject * n_trials). In that case, my suggestion would be to flatten the observed and use indexing to construct the linear predictors, it is better for missing trials etc - something like:

with pm.Model() as projection_model: = "Main Effect Projection"
    # better to also model mu
    mu_ortho = pm.Normal(...)
    mu_parallel = pm.Normal(...)
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    sigma_parallel = pm.Gamma("sigma_parallel", mu=7.0, sigma=3.17, shape=n_users)
    obs_parallel = pm.Normal(f"obs_parallel",
    obs_orthogonal = pm.Normal(f"obs_orthogonal",
    sigma_diff = pm.Deterministic('sigma_diff', sigma_parallel - sigma_ortho)

This is reaction time data? maybe try log-normal? I am not sure what is the recommended best practice nowadays as well.

Well, the language of hypothesis testing is always problematic to me as it usually dont answer what we really want to know - is this conclusion going to be useful (for making decision, for advancing our understanding, etc). As it is only useful when you have a non-tiny effect, looking at things like effect size is much more natural. In a linear model (like ANOVA), all the hypothesis you want to test are ultimately comparisons of parameters/coefficients, so instead of fitting multiple models and compare marginal likelihood or Bayes Factor, I usually recommend to fit the most comprehensive model with all the information (in this case would mean fit both conditions and coded for all subject, trial and block), and look at the distribution of the posterior contrasts (an example:

You need to coded the block in your model as well - for example, you can have a indexing of [0, 1, 2, …, 5] to indicate parallel_block_1, parallel_block_2, …, orthogonal_block_3, and index to a shape=6 vector of parameter.


Thank you very much for taking the time again to answer my questions. I try to understand it and why it may be the better approach.

I agree, that’s why I’ve come here. :smile: Too bad I’m still a noob at Bayesian inference.

No. It’s a motoric task measuring 2 end effector positions, thus forming a trial-to-trial point cloud in a cartesian 2D-coordinate system. The demeaned 2D-coordinate vectors get projected onto a vector and its orthogonal, basically just a change of basis. “parallel” and “orthogonal” data are merely the coefficients in \vec{v} = a_\parallel\hat{i} + a_\perp \hat{j}.

Yes, in my first post this was the case. That’s why I modelled mu which then was the parameter I was interested in, since in that case mu resembled the variance of the untransformed projection data. It was also log-transformed to pull in extremes (due to the squares) and stabilize the variance of mu itself (variance of variance) etc. It may have been more complex than it needs to be.
For the non-squared data in my second post, however, the mean is guaranteed to be 0, that’s why I didn’t model mu, only sigma.

By flattening you mean throwing all participants together? I don’t quite get your code example. mu and sigma are of shape n_users, but you only estimate these for 1 user and using the data of all users at the same time? I’m confused.
Since participants may deploy different strategies in the motoric task (assigned to different conditions as between factor and each participant may come up with their own solution), I want to estimate model parameters per participant. I started with having each single model per each participant before I discovered I could vectorize the parameters. Having all models per particpant eliminated missing data. Only after vectorizing parameters instead of having each single model for each participant I was facing the issue of missing data, since participants differ in their trial count per block (1 column per participant and trials being the rows in the observed data).

Yes, I only left out the blocks from my previous examples for brevity, since they weren’t relevant to those simple models, and for trying to understand the basics first. If I understand you correctly, this is the approach where I have 1 uber-model per participant? And for variables I’m interested in (9 differences) I would create deterministic variables,e.g. sigma[2] - sigma[1] for difference of parallel deviation block2-block1, and sigma[1]-sigma[4] for difference between parallel and orthogonal deviation in block 2?
But if I had all blocks in 1 vectorized form (trials as rows, block_<projection> as columns) I’d again have missing data, since the trial count differs between blocks? This can be solved though by having block1, block2, block3 with shape 2 each (parallel & orthogonal).

Having those 9 estimates and their uncertainty would be a good thing. Thanks!
Only, having all 9 estimates isn’t fully satisfying my question, since there are competing theoretical predictions about the sizes of those differences, see above ASCII diagrams. I’d like to know whose theory is most compatible with the observed data. Wouldn’t that involve a secondary step of a model selection with the estimated differences as observed data, including their uncertainty somehow?

The approach I had with several models was to state from the beginning what I’d expect the differences to look like for each theory. Null model would have no differences whatsoever. Main effect projection would state that there is a constant unknown difference (>0) between the projections across blocks. And then more sofisticated models with statements about the differences between individual blocks/projections. I feel like this was the other way around of what you suggested, first getting the estimates ofthose differences and then making statements about them.

1 Like

Since they are orthogonal you can model them with univariate distribution that has similar shape and the same support (e.g., positive only) as your observed.

Usually, you can coded your data as a table like:

subject_id | condition1 | condition2 | observation
         0 |   parallel |    block_1 |       .3452

Instead of having 1 column per participant and trials being the rows in the observed data, you have 1 column for all your data, and participant and trials are indicator (subject_id and trial_id).

Then in your model, you can either use GLM like approach to construct a design matrix, or use indexing:

subject_mu = pm.Normal(..., shape=n_subject)  <= shape is n_subject
subject_mu_all_trial = subject_mu[subject_id] <= shape is now number of row in your data frame

which estimate the parameter independently for each subject.

No matter what approach you decide to go with in terms of model comparison / hypothesis testing, I think you should use this formulation as it is a more general way to do so, and connect nicely with classical GLM formulation. You can also consult the recent book Regression and other stories


Thanks for the clarification on the data format. So just have it in long format.

I’m sorry, I still don’t understand what type of variable subject_id in subject_mu_all_trial = subject_mu[subject_id] shall be in your example. Is that the single ID of a participant (e.g. single integer)? But then why have subject_mu with shape n_subjects, if I’d have to loop over all subject_id anyway and call with pm.Model() as model: ... within the loop? Or is subject_id an array of all unique subject IDs [0,1,2,3,…]?
Or is it the first column of the data with duplicates [0,0,0,0,0,…,1,1,1,1,1,…,2,2,2,2,2,2,…]? If the shape of subject_mu_all_trials was the number of rows in the long format dataframe (total number of all trials*2) it would be n_users*n_user_trials*n_directions (e.g. 40x90x2=7200)? Wouldn’t that just give me the values of every single trial? I obviously don’t understand what you are trying to tell me. :confused:

Or did you mean to do it in 2 steps:

df = ... # long format dataframe with column 'user' as ID for participants.
n_users = df['user'].nunique()
mus = list()
with pm.Model() as model:
    user_mu = pm.Normal(..., shape=n_users)

for user_id in df['user'].unique():
    with model:
        user_mu_all_trials = user_mu[user_id]
    mus.append(user_mu)  # Otherwise it would be overwritten each iteration.

That doesn’t seem to be what you intended. Thanks for your effort and patience with me!

1 Like


It wouldn’t as you dont have parameter related to each trial - this is essentially what IID data or repeat measurement means. Maybe it is clearer in code:

data1 = np.random.normal(loc=0., scale=1., size=10)
data2 = np.random.normal(loc=10., scale=5., size=10)
data_all = np.concatenate([data1, data2], axis=0)
sbj_idx = np.concatenate([np.zeros(10), np.ones(10)], axis=0).astype(int)

with pm.Model() as model1:
  mu = pm.Normal('mu', 0., 10., shape=2)
  sigma = pm.HalfNormal('sigma', 5., shape=2)
  obs = pm.Normal('obs', mu, sigma, observed=np.stack([data1, data2]).T)

with pm.Model() as model2:
  mu = pm.Normal('mu', 0., 10., shape=2)
  sigma = pm.HalfNormal('sigma', 5., shape=2)
  obs = pm.Normal('obs', mu[sbj_idx], sigma[sbj_idx], observed=data_all)

model1 and model2 are identical but model2 is much more flexible as it allows different length of data among subjects.


Thank you so much for the code example. This is great info. I don’t know if that is even anywhere in the documentation. Using the graphviz plot I see what’s going on. I imagine it now as mapping the observed data to the mus. I scoured the internet in an effort to find any explanation for how to use it, to not occupy too much more of your time, but I didn’t find a single example using the indexing method. So I am once again kindly asking for your help.

I found a well illustrated primer on different shape types. And a blog post which explains that shapes in PyMC3 are, unfortunately, one hot mess. It also states:

distribution_shape == batch_shape + event_shape
[observed] data should have a shape that looks like sample_shape + distribution_shape .

If I understood correctly, for a model where I estimate 6 parameters (e.g. \sigma for n_blocks(3)*n_projections(2)) for each user by drawing from independent univariate distributions (same distribution family), my sample shape is n_trials (per block), batch shape is n_users*n_blocks*n_projections, and event shape = ().

I have the data in long format like you suggested:

user block  direction    projection   
0    1      parallel     7.129121
0    1      parallel     3.414878
0    1      parallel     -6.330880
0    1      parallel     0.428940
0    1      parallel     -3.650233
...  ...    ...          ...
1    3      orthogonal   -2.223337
1    3      orthogonal   -1.275683
1    3      orthogonal   -0.796934
1    3      orthogonal   2.221719
1    3      orthogonal   -5.626795

I could unstack the ‘direction’ if that was of any use, as these are guaranteed to be the same length for each individual block.
I tried different shape settings and I don’t know which one is correct. For example,

with pm.Model() as model3:
    sigma = pm.HalfNormal('sigma', 5., shape=12)  # 2 users * 6 parameters. batch shape?
    obs = pm.Normal('obs', 0.0, sigma[samples['user']], shape=12, observed=samples['projection'])

But it’s hard to identify which parameter (sigma 0-11) belongs to which user/block/direction coordinate.
Whatever else I tried with the indexing and shape argument, I always got this error:

RecursionError: maximum recursion depth exceeded in comparison

If I manage to use it as intended I may write something about how to vectorize a model properly without missing data using your indexing method to ease the struggle for other beginners.

1 Like

So, you can still use indexing to do this, but it will be quite inefficient as you need to create a index of direction + block, for example, index 0 would be parallel block 1.

A better approach is to formulate a design matrix:

from patsy import dmatrices
outcome, predictors = dmatrices("projection ~ block + direction + block * direction", data)

which is what usually call mixed effect model or hierarchical model. The pymc3 doc have quite an extensive session on this (e.g.,

The differences is that, here you your case the linear prediction will goes into sigma instead of mu


Hi, we know cross validation and model comparison can be tricky because in many cases there are several correct ways of comparing the models. We are actually working on some resources to help with that, they are still a work in progress, but it should be useful already. Here is the notebook covering multiple likelihoods. The link is to my fork which has the open PR, feedback will be most appreciated. You may also be interested in the other notebooks of the section:

Having said that, there is also this example in R which I think will be much closer to your model and could also be helpful:

I will try to read this topic with more time to see if I can help further. I think this multiple measurement and multiple subject models are ideal to teach the nooks and crannies of CV.

By any chance, do you know about some public dataset other than the rats one with similar characteristics?


I needed some time to go through the ressources you both provided, dust off my knowledge about HLM (Mixed Model) and think about how this would apply to my actual problem. I’ll discuss now the 2 different approaches, one being mixed model, and the other I call the intuitive approach.

First I want to stress and illustrate again what my research question is. There are several hypothesis by different authors about how the variances of the data should differ between the blocks and direction of projection.
How the data is nested:

3 between-conditions
  N users (N varies between conditions)
    3 within-conditions: blocks
        X trials (varies per block and user)
            2 outcome directions: parallel, orthogonal
                outcome: projection (scalar)

The question of interest is: Which prediction about the variance structure in the data is more probable?
To illustrate this I present again some ASCII-diagrams for different hypotheses. This first illustration makes it easy to think about the data in terms of its collection process (one block after the other, 2 outcomes). These only show block and direction as independent categorical variables and the standard devation as the depedent variableand neglect the between-factor and different users.

Main Effect Projection  Main Effect Block2    Main Effects Projection + Block2
sigma                   sigma                 sigma
^                       ^                     ^     o        o - parallel
|                       |                     |    / \       x - orthogonal
| o__o__o               |     o               |   /   \
|                       |    /x\              |  /  x  \
|                       |   // \\             | o  / \  o
|                       |  //   \\            |   /   \
| x__x__x               | //     \\           |  /     \
|                       |ox       ox          | x       x
|----------> block      |------------> block  |------------> block
  1  2  3                 1   2   3            1   2   3

Block x Projection v1   Block x Proj. v2      Block x Projection v3
sigma                   sigma                 sigma
^                       ^                     ^
|o       o              | o       o           | o
| \     /               |  \     /            |  \
|  \   /                |   \   /             |   \
|   \o/                 |    \ /              |    \
|   /x\                 |     V               |     o___o
|  /   \                |     o               |
| x     x               | x___x___x           | x___x___x
|----------> block      |------------> block  |------------> block
  1  2  3                 1   2   3             1   2   3

Block x Projection v4   Null Model      
sigma                   sigma                 
^                       ^                     
|                       |            
|                       |            
| o__o__o               |            
|    x                  |            
|   / \                 | xo__xo__xo 
|  /   \                |            
| x     x               |            
|----------> block      |------------> block  
  1  2  3                 1   2   3           

Multiple users (e.g. 3) may look like this:

Block2 x Projection v1  Block2 x Proj. v2     Main Effects Block2 + Projection
sigma                   sigma                 sigma o
^                       ^                     ^
|                       |                     |     oo         o - parallel
|                       |         o           |     o          x - orthogonal
| oo      o             | oo      o           | oo      o
| o       oo            | o       o           |         oo
|     oo                |                     | o   xx
|     xox               |                     |     x
| x   x                 | x   oo  xx          | x
| xx      xxx           | xx  xxx x           | xx      xxx
|------------> block    |------------> block  |------------> block
  1   2   3               1   2   3             1   2   3

These would be intuitive to formulate as separate models with restrictions on average sigma per block x direction.

So let’s look at how I would go about the intuitive approach. Here’s a code example with 3 models and 3 users behaving differently.

# %%
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

#          Dummy Data           #
# %%
# Create some data conforming to null model: all deviations on average equal across blocks.
n_trials = 90 - np.random.randint(36)
df0 = pd.DataFrame({'user': 0, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 5.5, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 5.5, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df0[['parallel', 'orthogonal']] = df0.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to main effect of projection model: variance in parallel > orthogonal.
n_trials = 90 - np.random.randint(36)
df1 = pd.DataFrame({'user': 1, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 7.0, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 3.3, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df1[['parallel', 'orthogonal']] = df1.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to interaction effect version 1:
# blocks 1&3 have smallest orthogonal deviations, but largest parallel deviations. Strong synergy.
n_trials = 30 - np.random.randint(12, size=3)
block1 = pd.DataFrame({'user': 2, 'block': 1,
                       'parallel': np.random.normal(0.0, 7.0, size=n_trials[0]),
                       'orthogonal': np.random.normal(0.0, 3.0, size=n_trials[0])})
block3 = pd.DataFrame({'user': 2, 'block': 3,
                       'parallel': np.random.normal(0.0, 7.0, size=n_trials[1]),
                       'orthogonal': np.random.normal(0.0, 3.0, size=n_trials[1])})
# Block 2 has smaller parallel deviations than blocks 1&3, but higher orthogonal deviations than blocks 1&3.
# In this case sigma_block2_parallel - sigma_block2_orthogonal = 0
# No synergy.
block2 = pd.DataFrame({'user': 2, 'block': 2,
                       'parallel': np.random.normal(0.0, 5.0, size=n_trials[2]),
                       'orthogonal': np.random.normal(0.0, 5.0, size=n_trials[2])})
df2 = pd.concat((block1, block2, block3), axis='index')
df2[['parallel', 'orthogonal']] = df2.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
df = pd.concat((df0, df1, df2), axis='index')

# %%
df[['user', 'block']] = df[['user', 'block']].astype('category')
df.sort_values(by=['user', 'block'], inplace=True)

# %%
# Reshape data into long format.
samples = df.melt(id_vars=['user', 'block'], value_vars=['parallel', 'orthogonal'],
                  var_name='direction', value_name='projection')
# Dimensions.
n_users = samples['user'].nunique()
n_blocks = samples['block'].nunique()
n_directions = samples['direction'].nunique()

# %%
# Vizualize the data.
g = sns.FacetGrid(samples, col='block', row='user', hue="direction")
g = (, "projection").add_legend(title="Projection", loc='upper right'))
for ax in g.axes.flat:
    ax.set_ylabel("Probability Density")
    ax.set_xlabel("Projection Length")

# %%
#          Null Model           #
with pm.Model() as null_model: = "Null Model"
    sigma = pm.Gamma("sigma", alpha=2.2, beta=0.4, shape=n_users)

    obs = pm.Normal(f"obs", mu=0.0, sigma=sigma[samples['user']], observed=samples['projection'])

# %%

# %%
with null_model:
    idata0 = pm.sample(1000, tune=800, return_inferencedata=True)
az.summary(idata0, var_names=["Null Model_sigma"])

# %%
# Main Effect Projection Model  #
with pm.Model() as projection_model: = "Main Effect Projection"
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    # The sum of two gamma-distributed variables is gamma-distributed.
    sigma_diff = pm.Gamma('sigma_diff', mu=4.0, sigma=5.0, shape=n_users)
    sigma_parallel = pm.Deterministic("sigma_parallel", sigma_ortho + sigma_diff)

    obs_parallel = pm.Normal("obs_parallel", mu=0.0, 
                             sigma=sigma_parallel[samples.loc[samples['direction'] == 'parallel', 'user']], 
                             observed=samples.loc[samples['direction'] == 'parallel', 'projection'])
    obs_orthogonal = pm.Normal("obs_orthogonal", mu=0.0,
                               sigma=sigma_ortho[samples.loc[samples['direction'] == 'orthogonal', 'user']], 
                               observed=samples.loc[samples['direction'] == 'orthogonal', 'projection'])

# %%

# %%
with projection_model:
    idata1 = pm.sample(1000, tune=1000, return_inferencedata=True)
az.summary(idata1, var_names=["Main Effect Projection_sigma"], filter_vars='like')

# %%
#idata1.log_likelihood['projection_diff'] = (idata1.log_likelihood['Main Effect Projection_obs_parallel'] 
#                                            - idata1.log_likelihood['Main Effect Projection_obs_orthogonal'])
#az.loo(idata1, var_name='projection_diff')

# %%
# Main Effects Block2 + Projection Model #
with pm.Model() as both_main_effects_model: = "2 Main Effects"
    sigma_ortho13 = pm.Gamma("sigma_ortho_blocks_1_3", mu=3.34, sigma=1.44, shape=n_users)
    # The sum of two gamma-distributed variables is gamma-distributed.
    sigma_diff_proj = pm.Gamma('sigma_diff_projection', alpha=4., beta=1.0, shape=n_users)
    sigma_diff_block2 = pm.Gamma('sigma_diff_block2', alpha=1., beta=1.0, shape=n_users)

    sigma_ortho2 = pm.Deterministic('sigma_ortho_block_2', sigma_ortho13 + sigma_diff_block2)
    sigma_parallel13 = pm.Deterministic('sigma_parallel_blocks_1_3', sigma_ortho13 + sigma_diff_proj)
    sigma_parallel2 = pm.Deterministic('sigma_parallel_block_2', sigma_parallel13 + sigma_diff_block2)

    data_parallel13 = samples.query('`direction`=="parallel" & `block` in [1, 3]')
    data_ortho13 = samples.query('`direction`=="orthogonal" & `block` in [1, 3]')
    data_parallel2 = samples.query('`direction`=="parallel" & `block`==2')
    data_ortho2 = samples.query('`direction`=="orthogonal" & `block`==2')

    obs_parallel13 = pm.Normal("obs_blocks_1_3_parallel", mu=0.0, sigma=sigma_parallel13[data_parallel13['user']], 
    obs_parallel2 = pm.Normal("obs_block2_parallel", mu=0.0, sigma=sigma_parallel2[data_parallel2['user']], 
    obs_orthogonal13 = pm.Normal("obs_blocks_1_3_orthogonal", mu=0.0, sigma=sigma_ortho13[data_ortho13['user']], 
    obs_orthogonal2 = pm.Normal("obs_block2_orthogonal", mu=0.0, sigma=sigma_ortho2[data_ortho2['user']], 

# %%

# %%
with both_main_effects_model:
    idata2 = pm.sample(1000, tune=1000, return_inferencedata=True)
az.summary(idata2, var_names=["2 Main Effects_sigma_diff"], filter_vars='like')

What’s missing in these models is the partial pooling, but I guess this could be easily done by just adding more priors on top. To get from these models to answering the research question is where the bridge sampling should come into play. I have to figure out how to get the traces and model in the vectorized form suited for the marginal_llk-function.

The Mixed Model approach does have advantages like partial-pooling and being able to include within- and between-conditions. The disadvantage is that it is not as intuitive for me for the research question at hand and fitting one model without restrictions also takes me only halfway to what I really want to know.

Let me illustrate the same predictions from above as regression models. The i stands for intercept and the b are the slopes. i12 is abbreviation for intercept block 1 & intercept block 2. The orthogonal direction is coded as 0 and parallel as 1.

    Main Effect Projection  Main Effect Block2    Main Effects Projection + Block2
    sigma                   sigma                 sigma  / b2    
    ^                       ^                     ^     /        
    |      / b123           |                     |    /         
    |     /               i2|_________b2          |   /  / b13   
    |    /                  |                     |  /  /        
    |   /                   |                     | /  /         
    |  /                    |                   i2|/  /          
    | /                     |                     |  /           
i123|/                   i13|_________b13         | /            
    |                       |                  i13|/             
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

    Block x Projection v1   Block x Proj. v2      Block x Projection v3
    sigma                   sigma                 sigma          
    ^                       ^                     ^              
    |      / b13            |      / b13          |      / b1    
    |     /                 |     /               |     /        
    |    /                  |    /                |    /         
  i2|___/___ b2             |   /                 |   /          
    |  /                    |  /                  |  /           
    | /                     | /                   | /            
 i13|/                  i123|/______ b2       i123|/______ b23   
    |                       |                     |              
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

    Block x Projection v4   Null Model
    sigma                   sigma          
    ^                       ^              
  i2|______                 |              
    | b2  /                 |              
    |    /                  |              
    |   / b13               |              
    |  /                i123|________      
    | /                     |  b123        
 i13|/                      |              
    |                       |              
    |---------->            |---------->   
    0      1                0      1       
    ortho  parallel         ortho  parallel

If I fitted a model without restrictions on intercepts and slopes I would have to compare the estimates I get for all of them to these hypothesis which describe the illustrations above in order to answer my research question:

Main Effect Projection:
    intercepts and slopes are fixed
    b1 = b2 = b3 > 0
    i1 = i2 = i3

Main Effect Block2:
    b1 = b2 = b3 (fixed slopes)
    i2 > i1, i3
    i3 - i1 = 0

Main Effects Projection + Block2:
    i2 > i1, i3
    b1 = b2 = b3

Block x Projection v1:
    i3 - i1 = 0
    i2 > i1, i3
    b3 - b1 = 0
    b3, b1 > 0
    b3 > i2 - i3
    b1 > i2 - i1
    b2 = 0

Block x Proj. v2:
    i1 = i2 = i3
    b1, b3 > 0
    b3 - b1 = 0
    b2 = 0

Block x Projection v3:
    i1 = i2 = i3
    b1 > 0
    b2 = b3 = 0

Block x Projection v4:
    i1 = i3
    i2 > i1, i3
    b2 = 0
    b1 = i2 - i1
    b3 = i2 - i3

Null Model:
    i1 = i2 = i3
    b1 = b2 = b3 = 0

One way I could imagine to do it quantitavely is to take the estimates as input to models with these restrictions and get the likelihoods. But the “intuitve” solution with bridge sampling appears more straight forward to me.
What do you think?

Thanks for the explanation, personally, I am not sure the data could answer the question like:“hey after comparing 8 different model here, we are sure that model X is the generative process”. And building a 8*8 comparison table of all the models seems a bit difficult to reason. The mixed effect model is more like: “we assume every conditions are different, and build a full model the encodes all the information, then look at the posterior to see if they are really all different (i.e., no overlapping)” - basically you plot the posterior and see which ASCII-diagrams for different hypotheses it fits the best.
But since you have strong theoretical motivation, it certainly sound like a reasonable approach.

I agree with your opinion about a less restrictive hierarchical model having the potential to reveal more about the underlying data. I’ve tried to work my way up to the full model after reading this informative Primer on Bayesian Methods for Multilevel Modeling, but I’m stuck early on after getting to the user-level and trying to add the block-level.
Using the dims keyword argument and coords seems more elegant than the indexing scheme from this post.

Example data with multilevel predictors

    user	experience condition  block rating	direction	projection
0	9	    0.0        A      	  1	    2.0     parallel	-1.058687
1	9	    0.0        A      	  1	    2.0     parallel	0.201144
2	9	    0.0        A      	  1	    2.0     parallel	-0.806420

experience is a user-level predictor and rating a block-level “predictor” (actually, rating is measured afterwards, so it’s correlated with block).

The indexing/design matrix still poses a problem to me. I don’t understand how to use the predictors when doing outcome, predictors = dmatrices("projection ~ condition * user * block * direction", df). The example didn’t utilize the design matrix, stating the glm() function does not play nice with hierarchical models yet.
This is an excerpt from what I get with aforementioned formula.

‘Intercept’ (column 0)
‘condition’ (columns 1:3)
‘user’ (columns 3:32)
‘block’ (columns 32:34)
‘direction’ (column 34)
‘condition:user’ (columns 35:93)
‘condition:block’ (columns 93:97)
‘user:block’ (columns 97:155)
‘condition:direction’ (columns 155:157)
‘user:direction’ (columns 157:186)
‘block:direction’ (columns 186:188)
‘condition:user:block’ (columns 188:304)
‘condition:user:direction’ (columns 304:362)
‘condition:block:direction’ (columns 362:366)
‘user:block:direction’ (columns 366:424)
‘condition:user:block:direction’ (columns 424:540)

From what I gathered from a tutorial in R about rat growth that @OriolAbril provided, in R this formula would be kind of this \sigma_{proj} \sim (condition * block * direction)/(1|trial) + (1|user), meaning trials are viewed as nested random effects, and users as random effects. It seems patsy doesn’t support nesting and random effects.

Following the primer I’ve come so far as to include users, as they are what counties are in the radon example. The difference is that I model the standard deviation sigma, not the mu. When trying to add rating as a pendent to uranium in the radon example, I ran into broadcasting issues.

Model with only direction x user:

conditions = df['condition'].cat.categories.tolist()
n_conditions = df['condition'].nunique()
condition_indices = df['condition'] # Condition index for every single value.
condition_lookup = pd.Series(*pd.factorize(df['condition'].cat.categories), name='condition_idx_lookup')

# We need to be able to index by user. But the user IDs of the sample don't start at 0. We create an index and a mapping between them.
users = df['user'].cat.categories.tolist()
n_users = df['user'].nunique()
user_indices = df['user']  # User index for every single value.
user_lookup = pd.Series(*pd.factorize(df['user'].cat.categories), name='user_idx_lookup')
# User level predictors. More options: age group, gender.
# experience is not on an interval scale (no equidistance between items), but ordinal.
experience = df.drop_duplicates(subset="user")['experience']  # Shape = user_indices

blocks = df['block'].cat.categories.tolist()
n_blocks = df['block'].nunique()  # Same as using cat.categories.size
block_indices = df['block'] # Block index for every single value.
block_lookup = pd.Series(*pd.factorize(df['block'].cat.categories), name='block_idx_lookup')
# Block level predictors.
# Note: Ratings are correlated with block index.
ratings = df.drop_duplicates(subset=["user", 'block'])['rating']  # Shape = block_indices

# Coded direction of projection: 0 = orthogonal, 1 = parallel.
directions = df['direction'].cat.categories.tolist()
n_directions = df['direction'].nunique()
direction_indices = df['direction']
direction_lookup = pd.Series(*pd.factorize(df['direction'].cat.categories), name='direction_idx_lookup')

coords = {'Direction': directions, 'User': users, 'Block': blocks, 'Condition': conditions, 'obs_id': np.arange(direction_indices

with pm.Model(coords=coords) as varying_intercept_slope_by_user:
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')

    # Hyperpriors:
    intercept = pm.Gamma('intercept', mu=4.0, sigma=2.0)
    intercept_sd = pm.Exponential('sigma_intercept', 1.0)
    slope = pm.Normal('slope', mu=0.0, sigma=1.0)
    slope_sd = pm.Exponential('sigma_slope', 0.5)

    # Varying intercepts:
    intercept_user = pm.Gamma('intercept_user', mu=intercept, sigma=intercept_sd, dims='User')
    # Varying slopes:
    slope_user = pm.Normal('slope_user', mu=slope, sigma=slope_sd, dims='User')

    # Expected deviation per user:
    theta = intercept_user[user_idx] + slope_user[user_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

How do I add the other levels? Condition would be like state in the radon example. I try, but I keep hitting walls. In the meantime I try again and share my progress. I guess I have to formulate linear expressions for each level and use those as inputs to the levels below.

I just realized that using the ratings (for blocks) as I did there isn’t going to work with how I defined blocks. Block indices are only 0,1,2, but each user gave an individual rating for each block leading to n_users*n_blocks ratings. If it was a block-level predictor, there would only be 3 values, right? I could make an index for each block (0-89), but then I wouldn’t be able to compare each first, second and third block across users.

I tried adding a new level, but I’m stuck with the broadcasting issue. I really don’t know what I’m supposed to do with the dims or shape keyword arguments and the slicing of the parameters. Unfortunately, I didn’t find a comprehensible documentation on how this works anywhere. All the examples just have models with 1 or 2 levels, nothing nested deeper than that.

What I’d like to get is an estimate for intercept and slope of projection variance for Blocks 1,2, and 3, while each user can have their own intercepts and slopes in each block as different baselines (random effect).

In addition to the above I tried with blocks (0,1,2) and an ID for each single block.

coords['Block_ID'] = block_ids  # 0..n_users*n_blocks

with pm.Model(coords=coords) as by_block_model:
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')
    block_id_idx = pm.Data('block_id_idx', block_id_indices, dims='obs_id')
    block_idx = pm.Data('block_idx', block_indices, dims='obs_id')
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    experience = pm.Data('experience', gaming_exp, dims='User')
    rating = pm.Data('rating', ratings, dims='Block_ID')  # Block has indices 0,1,2. But ratings are size of n_users*n_blocks. Try block_id?
    # Hyperpriors:
    global_intercept = pm.Gamma('global_intercept', mu=4.0, sigma=2.0)
    global_intercept_sd = pm.Exponential('global_intercept_sd', 1.0)
    global_slope = pm.Normal('global_slope', mu=1.5, sigma=1.0)  # We expect the slope to be positive most of the time.
    global_slope_sd = pm.Exponential('global_slope_sd', 0.5)

    # Users.
    # Varying intercepts per user:
    intercept_user = pm.Gamma('intercept_user', mu=global_intercept, sigma=global_intercept_sd, dims='User')
    intercept_user_sd = pm.Exponential('intercept_user_sd', 1.0, dims='User')
    # Varying slopes per user:
    slope_user = pm.Normal('slope_user', mu=global_slope, sigma=global_slope_sd, dims='User')
    slope_user_sd = pm.Exponential('slope_user_sd', 1.0, dims='User')

    # Blocks.
    # Varying intercepts per block:
    intercept_block = pm.Gamma('intercept_block', mu=intercept_user, sigma=intercept_user_sd, dims='Block')  # FixMe: ValueError: operands could not be broadcast together with shapes (3,) (30,) 
    # Varying slopes per block:
    slope_block = pm.Normal('slope_block', mu=slope_user, sigma=slope_user_sd, dims='Block')

    # Expected deviation per block:
    theta = intercept_block[block_idx] + slope_block[block_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

Once I know how to add more levels, I’d like to cluster users by condition, which may have an effect on standard deviation.

I tried resolving the broadcasting issues by repeating the intercept_user and slope_user parameters.
block_id_idx 0, 1, 2 belong to user 0, block_id_idx 3, 4, 5 to user 1 and so on. intercept_user should then be like 0, 0, 0, 1, 1, 1, 2, 2, 2 etc.

import theano.tensor as tt

with pm.Model(coords=coords) as by_block_model:
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')
    block_id_idx = pm.Data('block_id_idx', block_id_indices, dims='obs_id')
    block_idx = pm.Data('block_idx', block_indices, dims='obs_id')
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    experience = pm.Data('experience', gaming_exp, dims='User')
    rating = pm.Data('rating', ratings, dims='Block_ID')  # Block has indices 0,1,2. But ratings are size of n_users*n_blocks. Try block_id?
    # Hyperpriors:
    global_intercept = pm.Gamma('global_intercept', mu=4.0, sigma=2.0)
    global_intercept_sd = pm.Exponential('global_intercept_sd', 1.0)
    global_slope = pm.Normal('global_slope', mu=1.5, sigma=1.0)  # We expect the slope to be positive most of the time.
    global_slope_sd = pm.Exponential('global_slope_sd', 0.5)

    # Users.
    # Varying intercepts per user:
    intercept_user = tt.repeat(pm.Gamma('intercept_user', mu=global_intercept, sigma=global_intercept_sd, dims='User'), n_blocks)
    intercept_user_sd = tt.repeat(pm.Exponential('intercept_user_sd', 1.0, dims='User'), n_blocks)
    # Varying slopes per user:
    slope_user = tt.repeat(pm.Normal('slope_user', mu=global_slope, sigma=global_slope_sd, dims='User'), n_blocks)
    slope_user_sd = tt.repeat(pm.Exponential('slope_user_sd', 1.0, dims='User'), n_blocks)

    # Blocks.
    # Varying intercepts per block:
    intercept_block = pm.Gamma('intercept_block', mu=intercept_user, sigma=intercept_user_sd, dims='Block_ID')
    # Varying slopes per block:
    slope_block = pm.Normal('slope_block', mu=slope_user, sigma=slope_user_sd, dims='Block_ID')

    # Expected deviation per block:
    theta = intercept_block[block_idx] + slope_block[block_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

As you can see the model gets created now. There are a few Data containers floating around. I’m using the Block_ID (0…89) to create a slope and intercept for each single block in the data (30 users * block1, 30 * block2, 30 * block3). But since I want to estimate intercept and slope across users for each block (0,1,2), I used block_idx (0,1, or 2 for each data point) in the linear regression for the expected standard deviation in the data theta = intercept_block[block_idx] + slope_block[block_idx] * direction_idx. When using block_idx (either 0, 1, or 2) instead of block_id_idx (0…89), sampling time increases like 10x. Regardless of whether I choose block_idx or block_id_idx, I get every sampling error you can imagine: thousands of divergences, rhat > 1.05, acceptance probability, number of effective samples is smaller than 200.
If I used block_id_idx in the linear expression, is there a way to sample from the posterior using the block_idx? Like group every first, second and third block and get their group estimates?

How can I reparameterize the model? The between factor of condition isn’t even in there yet.
I just want estimates on the differences between standard deviations for blocks 1, 2 and 3, which each user does (within), but users are in different conditions (between). As a bonus it would be nice to know if experience (user-level) has an influence, or the individual rating of blocks. Doesn’t sound too hard.

I often read a common answer to some topics that everything should be straight forward to do in PyMC3. Unfortunately I don’t have a PhD in Bayesian Statistics and 15 years of computer science under the belt, so I appreciate any help! To me it’s everything but straight forward. I’m on a tight schedule, desperate and on the verge of going back to frequentist statistics. :slightly_frowning_face: The devil you know…

IIUC, you have condition, block, direction which potentially be different, and user could also display different strategy - in that case, I recommend you to try parsing the linear formula σproj∼(condition∗block∗direction*user) in patsy, and use indexing for trial (the random effect.)

In general, frequentist testing like ANOVA are GLMs, and GLMs are pretty straightforward to express in Bayesian - I would love to give you a bit more concrete suggestion, maybe it is easier you share a jupyter notebook with some simulation data?

That sounds about right. I don’t really care about trials, e.g. if all trials with number 1 differ or not. I did try to use patsy and got an outcome and a DesignMatrix. Unfortunately, I don’t have the slightest clue about what to do with the DesignMatrix object. I tried multiplying it with a distribution and only got an AssertionError. I also don’t know what to index where?

Yes, and you only have to drag some variables into the right slots in a GUI software most of the time. :sweat_smile:

That is very kind of you and much appreciated! I prepared a gist with some mocked up data. There’s not much of a model, because I don’t know how to use the DesignMatrix and indexing.

I still wasn’t able to figure out how to use patsy’s dmatrices, no clue. :frowning_face:

In the case of a model that isn’t using intercepts and slopes, but trying to model the differences in standard deviations directly in the way that was initially posted by @junpenglao, how do I merge the different priors, so I can use it with the observed variable? To be clear: there are several variables that define sigma for the different conditions/blocks/directions, but I have only 1 observed variable with only 1 sigma argument. How do I map the different sigma variables to their respective measurements?!

I’ve been trying to get this information for weeks now, and it doesn’t seem to be clearly explained anywhere? It’s not obvious to me. dims, coords and how indexing works aren’t well documented.

With the mock-up data in the gist of my previous post, how would I provide these different mu distributions as one to the observed variable?
For a variable that is dummy coded with 0 and 1, like direction in my case, I tried it this way:

with pm.Model(coords=coords) as proj_model:
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')

    # Hyperpriors:
    global_ortho = pm.Gamma('ortho_global', mu=3, sigma=10)
    global_ortho_sd = pm.HalfCauchy('ortho_global_sd', 5)
    global_diff = pm.Gamma('diff_global', mu=3, sigma=3)
    global_diff_sd = pm.HalfCauchy('diff_global_sd', 5)
    global_parallel = pm.Deterministic('parallel_global', global_ortho+global_diff)

    # Priors
    ortho_user = pm.Gamma('ortho_user', mu=global_ortho, sigma=global_ortho_sd, dims='User')
    diff_user = pm.Gamma('diff_user', mu=global_diff, sigma=global_diff_sd, dims='User')
    parallel_user = pm.Deterministic('parallel_user', ortho_user+diff_user)

    # Expected deviation per user:
    theta = ortho_user[user_idx] * (1-direction_idx) + parallel_user[user_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

Due to the multiplication with either 0 or 1, rows that have direction==0 (orthogonal), use the ortho_user variable as prior, and rows with direction==1 use the parallel_user variable.

But for a variable like block that goes from 0 to 2, I can’t simply multiply with the dummy coded variable, as it would multiply by 2. I tried to work around it this way:

with pm.Model(coords=coords) as block2_model:
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')
    block_idx = pm.Data('block_idx', block_indices, dims='obs_id')
    #block_id_idx = pm.Data('block_id_idx', block_id_indices, dims='obs_id')

    # Hyperpriors:
    global_blocks13 = pm.Gamma('blocks13_global', mu=5, sigma=10)
    global_blocks13_sd = pm.HalfCauchy('blocks13_global_sd', 5)
    global_diff = pm.Gamma('diff_global', mu=5, sigma=3)
    global_diff_sd = pm.HalfCauchy('diff_global_sd', 5)
    global_block2 = pm.Deterministic('block2_global', global_blocks13+global_diff)

    # Priors
    blocks13_user = pm.Gamma('blocks13_user', mu=global_blocks13, sigma=global_blocks13_sd, dims='User')
    diff_block2 = pm.Gamma('diff_block2', mu=global_diff, sigma=global_diff_sd, dims='User')
    block2_user = pm.Deterministic('block2_user', blocks13_user+diff_block2)

    # Expected deviation per user:
    theta = blocks13_user[user_idx] * (block_idx.eval() != 1) + block2_user[user_idx] * (block_idx.eval() == 1)

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

But, honestly, using eval() to get the values doesn’t seem like the appropriate way to do it?
And when I need to get more complex with interactions between blocks and direction, this will get very complicated. I guess, this is where the DesignMatrix should come in handy? But I don’t know where to put it and what to do with it.

It’s a lot of puzzle pieces that I don’t know how to fit together.

I found some resources that explain design matrices theoretically (video1, video2,wiki), and according to this stackexchange answer there’s a slight difference between the design matrix and the model matrix, (based on Montgomery, D. (2009). Design and Analysis of Experiments, 7th Edition. John Wiley & Sons Inc.).

This notebook makes use of patsy and PyMC3, but doesn’t explain what’s going on there. Another notebook on a sleep deprivation study is a bit more informative, because it shows how to do only random intercepts/slopes or both, but also doesn’t go into much detail about the why and how. It doesn’t provide an interpretation of the results, unfortunately.
A hierarchical model from this thread is using some indexing scheme that I don’t quite grasp. And just to collect some more useful information, here’s a thread that further discusses hierarchical models.
There seems to be a lack of a step-by-step in-depth guide with best practices on HLMs in PyMC3 that’s beginner-friendly. Mostly it’s just notebooks replicating some calculations without much explanation.

Based on the sleep deprivation study notebook I tried applying it to my case.

# Fixed effects. Commonly denoted as X.
X = patsy.dmatrix('1 + block * direction', samples, return_type='dataframe')
X = np.asarray(X)

# Design matrix for the random effects. Intercept and slope are modelled separately to have more control on the prior.
Z_intercept = patsy.dmatrix('0 + user', data=samples, return_type='dataframe')
Z_intercept = np.asarray(Z_intercept)
Z_slope = patsy.dmatrix('0 + user:direction', data=samples, return_type='dataframe')
Z_slope = np.asarray(Z_slope)
Z = np.concatenate((Z_intercept, Z_slope), axis=1)
Y = np.asarray(samples['log_projection_sq'])

with pm.Model(coords=coords) as hlm_model:
    ## Fixed effect
    beta_X_intercept = pm.Normal('beta_X_ortho', mu=0, sd=10) # contrain it to positive values
    beta_X_slope_b2 = pm.Normal('beta_X_b2_ortho_offset', mu=0, sd=10)
    beta_X_slope_b3 = pm.Normal('beta_X_b3_ortho_offset', mu=0, sd=10)
    beta_X_slope_para = pm.Normal('beta_X_parallel_offset', mu=0, sd=10)
    beta_X_slope_b2_para = pm.Normal('beta_X_b2_parallel_offset', mu=0, sd=10)
    beta_X_slope_b3_para = pm.Normal('beta_X_b3_parallel_offset', mu=0, sd=10)
    beta_X = tt.stack([beta_X_intercept,
    estimate_X =, beta_X)
    ## Random effect
    sigma_Z_intercept = pm.HalfNormal('sigma_Z_intercept', sd=10)
    gamma_Z_intercept_raw = pm.Normal('gamma_Z_intercept_raw', mu=0, sd=1, shape=Z_intercept.shape[1])
    gamma_Z_intercept = pm.Deterministic('gamma_Z_intercept', gamma_Z_intercept_raw * sigma_Z_intercept)

    sigma_Z_slope = pm.HalfNormal('sigma_Z_slope', sd=10)
    gamma_Z_slope_raw = pm.Normal('gamma_Z_slope_raw', mu=0, sd=1, shape=Z_slope.shape[1])
    gamma_Z_slope = pm.Deterministic('gamma_Z_slope', gamma_Z_slope_raw * sigma_Z_slope)
    estimate_Z =, gamma_Z_intercept) +, gamma_Z_slope) 
    ## likelihood
    mu_estimate = pm.Deterministic('mu_estimate', estimate_X + estimate_Z)
    sigma_unexplained = pm.HalfNormal('sigma_unexplained', sd=10) # unexplained variability
    y_likelihood = pm.Normal('y_likelihood', mu=mu_estimate, sd=sigma_unexplained, observed=Y)

Sampling was slow and I got all kinds of errors:

There were 19 divergences after tuning. Increase target_accept or reparameterize.
There were 55 divergences after tuning. Increase target_accept or reparameterize.
There were 156 divergences after tuning. Increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.6974308250411304, but should be close to 0.8. Try to increase the number of tuning steps.
There were 18 divergences after tuning. Increase target_accept or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

I’m also not clear on how to interpret the estimates when using design matrices. Are all the slopes offsets to the orthogonal projection in block 1 (intercept)? Or is ‘beta_X_b2_parallel_offset’, for example, the difference to ‘beta_X_parallel_offset’?
Also in order to stack all the slopes I had to write them down one-by-one, because I couldn’t stack tensors with different amount of columns, e.g. if I had one beta_X_slope with shape 6. If I added condition and more interactions this would get repetitive quite quickly. Is there a vectorized way for this?

Obviously this wasn’t the best attempt. How can it be improved? How can the coords be utilized?

This thread has become quite messy with my many confusions, so that it’s probably difficult to find useful information in it, but I add to my journey into the Bayesian abyss anyway in the hopes it may help somebody else.

In the meantime I found a comment on a blog post about the indexing syntax to which @twiecki replied and explained what the indexing syntax does. Say, you got something like:

with pm.Model() as hierarchical_model:
    # Intercept for each county, distributed around group mean mu_a
    a = pm.Normal('alpha', mu=mu_a, sigma=sigma_a, shape=len(data.county.unique()))    
    # Expected value
    radon_est = a[county_idx]

You have a – a vector with one distribution per county. Your data however, has multiple values per county. So what you want to do is, create a version of a that has the same length as your data and associates each data point with the right county-distribution.
For example, say we have 2 counties each with 3 data points. Then you have a = [county_1, county_2], to then associate that to your data you could do: a[0, 0, 0, 1, 1, 1] to get the right vector of length 6 that syncs up correctly to your data.