Model Comparison: How to constrain a model to positive mu-differences?

I needed some time to go through the ressources you both provided, dust off my knowledge about HLM (Mixed Model) and think about how this would apply to my actual problem. I’ll discuss now the 2 different approaches, one being mixed model, and the other I call the intuitive approach.

First I want to stress and illustrate again what my research question is. There are several hypothesis by different authors about how the variances of the data should differ between the blocks and direction of projection.
How the data is nested:

3 between-conditions
  N users (N varies between conditions)
    3 within-conditions: blocks
        X trials (varies per block and user)
            2 outcome directions: parallel, orthogonal
                outcome: projection (scalar)

The question of interest is: Which prediction about the variance structure in the data is more probable?
To illustrate this I present again some ASCII-diagrams for different hypotheses. This first illustration makes it easy to think about the data in terms of its collection process (one block after the other, 2 outcomes). These only show block and direction as independent categorical variables and the standard devation as the depedent variableand neglect the between-factor and different users.

Main Effect Projection  Main Effect Block2    Main Effects Projection + Block2
sigma                   sigma                 sigma
^                       ^                     ^     o        o - parallel
|                       |                     |    / \       x - orthogonal
| o__o__o               |     o               |   /   \
|                       |    /x\              |  /  x  \
|                       |   // \\             | o  / \  o
|                       |  //   \\            |   /   \
| x__x__x               | //     \\           |  /     \
|                       |ox       ox          | x       x
|----------> block      |------------> block  |------------> block
  1  2  3                 1   2   3            1   2   3

Block x Projection v1   Block x Proj. v2      Block x Projection v3
sigma                   sigma                 sigma
^                       ^                     ^
|o       o              | o       o           | o
| \     /               |  \     /            |  \
|  \   /                |   \   /             |   \
|   \o/                 |    \ /              |    \
|   /x\                 |     V               |     o___o
|  /   \                |     o               |
| x     x               | x___x___x           | x___x___x
|----------> block      |------------> block  |------------> block
  1  2  3                 1   2   3             1   2   3

Block x Projection v4   Null Model      
sigma                   sigma                 
^                       ^                     
|                       |            
|                       |            
| o__o__o               |            
|    x                  |            
|   / \                 | xo__xo__xo 
|  /   \                |            
| x     x               |            
|----------> block      |------------> block  
  1  2  3                 1   2   3           

Multiple users (e.g. 3) may look like this:

Block2 x Projection v1  Block2 x Proj. v2     Main Effects Block2 + Projection
sigma                   sigma                 sigma o
^                       ^                     ^
|                       |                     |     oo         o - parallel
|                       |         o           |     o          x - orthogonal
| oo      o             | oo      o           | oo      o
| o       oo            | o       o           |         oo
|     oo                |                     | o   xx
|     xox               |                     |     x
| x   x                 | x   oo  xx          | x
| xx      xxx           | xx  xxx x           | xx      xxx
|------------> block    |------------> block  |------------> block
  1   2   3               1   2   3             1   2   3

These would be intuitive to formulate as separate models with restrictions on average sigma per block x direction.

So let’s look at how I would go about the intuitive approach. Here’s a code example with 3 models and 3 users behaving differently.

# %%
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns


#################################
#          Dummy Data           #
#################################
# %%
# Create some data conforming to null model: all deviations on average equal across blocks.
n_trials = 90 - np.random.randint(36)
df0 = pd.DataFrame({'user': 0, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 5.5, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 5.5, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df0[['parallel', 'orthogonal']] = df0.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to main effect of projection model: variance in parallel > orthogonal.
n_trials = 90 - np.random.randint(36)
df1 = pd.DataFrame({'user': 1, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 7.0, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 3.3, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df1[['parallel', 'orthogonal']] = df1.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to interaction effect version 1:
# blocks 1&3 have smallest orthogonal deviations, but largest parallel deviations. Strong synergy.
n_trials = 30 - np.random.randint(12, size=3)
block1 = pd.DataFrame({'user': 2, 'block': 1,
                       'parallel': np.random.normal(0.0, 7.0, size=n_trials[0]),
                       'orthogonal': np.random.normal(0.0, 3.0, size=n_trials[0])})
block3 = pd.DataFrame({'user': 2, 'block': 3,
                       'parallel': np.random.normal(0.0, 7.0, size=n_trials[1]),
                       'orthogonal': np.random.normal(0.0, 3.0, size=n_trials[1])})
# Block 2 has smaller parallel deviations than blocks 1&3, but higher orthogonal deviations than blocks 1&3.
# In this case sigma_block2_parallel - sigma_block2_orthogonal = 0
# No synergy.
block2 = pd.DataFrame({'user': 2, 'block': 2,
                       'parallel': np.random.normal(0.0, 5.0, size=n_trials[2]),
                       'orthogonal': np.random.normal(0.0, 5.0, size=n_trials[2])})
df2 = pd.concat((block1, block2, block3), axis='index')
df2[['parallel', 'orthogonal']] = df2.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
df = pd.concat((df0, df1, df2), axis='index')

# %%
df[['user', 'block']] = df[['user', 'block']].astype('category')
df.sort_values(by=['user', 'block'], inplace=True)

# %%
# Reshape data into long format.
samples = df.melt(id_vars=['user', 'block'], value_vars=['parallel', 'orthogonal'],
                  var_name='direction', value_name='projection')
# Dimensions.
n_users = samples['user'].nunique()
n_blocks = samples['block'].nunique()
n_directions = samples['direction'].nunique()

# %%
# Vizualize the data.
g = sns.FacetGrid(samples, col='block', row='user', hue="direction")
g.despine(offset=10)
g = (g.map(sns.distplot, "projection").add_legend(title="Projection", loc='upper right'))
for ax in g.axes.flat:
    ax.set_ylabel("Probability Density")
    ax.set_xlabel("Projection Length")
plt.tight_layout()
plt.show()

# %%
#################################
#          Null Model           #
#################################
with pm.Model() as null_model:
    null_model.name = "Null Model"
    sigma = pm.Gamma("sigma", alpha=2.2, beta=0.4, shape=n_users)

    obs = pm.Normal(f"obs", mu=0.0, sigma=sigma[samples['user']], observed=samples['projection'])

# %%
pm.model_to_graphviz(null_model)

# %%
with null_model:
    idata0 = pm.sample(1000, tune=800, return_inferencedata=True)
az.summary(idata0, var_names=["Null Model_sigma"])
az.loo(idata0)

# %%
#################################
# Main Effect Projection Model  #
#################################
with pm.Model() as projection_model:
    projection_model.name = "Main Effect Projection"
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    # The sum of two gamma-distributed variables is gamma-distributed.
    sigma_diff = pm.Gamma('sigma_diff', mu=4.0, sigma=5.0, shape=n_users)
    sigma_parallel = pm.Deterministic("sigma_parallel", sigma_ortho + sigma_diff)

    obs_parallel = pm.Normal("obs_parallel", mu=0.0, 
                             sigma=sigma_parallel[samples.loc[samples['direction'] == 'parallel', 'user']], 
                             observed=samples.loc[samples['direction'] == 'parallel', 'projection'])
    obs_orthogonal = pm.Normal("obs_orthogonal", mu=0.0,
                               sigma=sigma_ortho[samples.loc[samples['direction'] == 'orthogonal', 'user']], 
                               observed=samples.loc[samples['direction'] == 'orthogonal', 'projection'])

# %%
pm.model_to_graphviz(projection_model)

# %%
with projection_model:
    idata1 = pm.sample(1000, tune=1000, return_inferencedata=True)
az.summary(idata1, var_names=["Main Effect Projection_sigma"], filter_vars='like')

# %%
#idata1.log_likelihood['projection_diff'] = (idata1.log_likelihood['Main Effect Projection_obs_parallel'] 
#                                            - idata1.log_likelihood['Main Effect Projection_obs_orthogonal'])
#az.loo(idata1, var_name='projection_diff')

# %%
##########################################
# Main Effects Block2 + Projection Model #
##########################################
with pm.Model() as both_main_effects_model:
    both_main_effects_model.name = "2 Main Effects"
    sigma_ortho13 = pm.Gamma("sigma_ortho_blocks_1_3", mu=3.34, sigma=1.44, shape=n_users)
    # The sum of two gamma-distributed variables is gamma-distributed.
    sigma_diff_proj = pm.Gamma('sigma_diff_projection', alpha=4., beta=1.0, shape=n_users)
    sigma_diff_block2 = pm.Gamma('sigma_diff_block2', alpha=1., beta=1.0, shape=n_users)

    sigma_ortho2 = pm.Deterministic('sigma_ortho_block_2', sigma_ortho13 + sigma_diff_block2)
    sigma_parallel13 = pm.Deterministic('sigma_parallel_blocks_1_3', sigma_ortho13 + sigma_diff_proj)
    sigma_parallel2 = pm.Deterministic('sigma_parallel_block_2', sigma_parallel13 + sigma_diff_block2)

    data_parallel13 = samples.query('`direction`=="parallel" & `block` in [1, 3]')
    data_ortho13 = samples.query('`direction`=="orthogonal" & `block` in [1, 3]')
    data_parallel2 = samples.query('`direction`=="parallel" & `block`==2')
    data_ortho2 = samples.query('`direction`=="orthogonal" & `block`==2')

    obs_parallel13 = pm.Normal("obs_blocks_1_3_parallel", mu=0.0, sigma=sigma_parallel13[data_parallel13['user']], 
                               observed=data_parallel13['projection'])
    obs_parallel2 = pm.Normal("obs_block2_parallel", mu=0.0, sigma=sigma_parallel2[data_parallel2['user']], 
                              observed=data_parallel2['projection'])
    obs_orthogonal13 = pm.Normal("obs_blocks_1_3_orthogonal", mu=0.0, sigma=sigma_ortho13[data_ortho13['user']], 
                                 observed=data_ortho13['projection'])
    obs_orthogonal2 = pm.Normal("obs_block2_orthogonal", mu=0.0, sigma=sigma_ortho2[data_ortho2['user']], 
                                 observed=data_ortho2['projection'])

# %%
pm.model_to_graphviz(both_main_effects_model)

# %%
with both_main_effects_model:
    idata2 = pm.sample(1000, tune=1000, return_inferencedata=True)
az.summary(idata2, var_names=["2 Main Effects_sigma_diff"], filter_vars='like')

What’s missing in these models is the partial pooling, but I guess this could be easily done by just adding more priors on top. To get from these models to answering the research question is where the bridge sampling should come into play. I have to figure out how to get the traces and model in the vectorized form suited for the marginal_llk-function.

The Mixed Model approach does have advantages like partial-pooling and being able to include within- and between-conditions. The disadvantage is that it is not as intuitive for me for the research question at hand and fitting one model without restrictions also takes me only halfway to what I really want to know.

Let me illustrate the same predictions from above as regression models. The i stands for intercept and the b are the slopes. i12 is abbreviation for intercept block 1 & intercept block 2. The orthogonal direction is coded as 0 and parallel as 1.

    Main Effect Projection  Main Effect Block2    Main Effects Projection + Block2
    sigma                   sigma                 sigma  / b2    
    ^                       ^                     ^     /        
    |      / b123           |                     |    /         
    |     /               i2|_________b2          |   /  / b13   
    |    /                  |                     |  /  /        
    |   /                   |                     | /  /         
    |  /                    |                   i2|/  /          
    | /                     |                     |  /           
i123|/                   i13|_________b13         | /            
    |                       |                  i13|/             
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

    Block x Projection v1   Block x Proj. v2      Block x Projection v3
    sigma                   sigma                 sigma          
    ^                       ^                     ^              
    |      / b13            |      / b13          |      / b1    
    |     /                 |     /               |     /        
    |    /                  |    /                |    /         
  i2|___/___ b2             |   /                 |   /          
    |  /                    |  /                  |  /           
    | /                     | /                   | /            
 i13|/                  i123|/______ b2       i123|/______ b23   
    |                       |                     |              
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

    Block x Projection v4   Null Model
    sigma                   sigma          
    ^                       ^              
  i2|______                 |              
    | b2  /                 |              
    |    /                  |              
    |   / b13               |              
    |  /                i123|________      
    | /                     |  b123        
 i13|/                      |              
    |                       |              
    |---------->            |---------->   
    0      1                0      1       
    ortho  parallel         ortho  parallel

If I fitted a model without restrictions on intercepts and slopes I would have to compare the estimates I get for all of them to these hypothesis which describe the illustrations above in order to answer my research question:

Main Effect Projection:
    intercepts and slopes are fixed
    b1 = b2 = b3 > 0
    i1 = i2 = i3

Main Effect Block2:
    b1 = b2 = b3 (fixed slopes)
    i2 > i1, i3
    i3 - i1 = 0

Main Effects Projection + Block2:
    i2 > i1, i3
    b1 = b2 = b3

Block x Projection v1:
    i3 - i1 = 0
    i2 > i1, i3
    b3 - b1 = 0
    b3, b1 > 0
    b3 > i2 - i3
    b1 > i2 - i1
    b2 = 0

Block x Proj. v2:
    i1 = i2 = i3
    b1, b3 > 0
    b3 - b1 = 0
    b2 = 0

Block x Projection v3:
    i1 = i2 = i3
    b1 > 0
    b2 = b3 = 0

Block x Projection v4:
    i1 = i3
    i2 > i1, i3
    b2 = 0
    b1 = i2 - i1
    b3 = i2 - i3

Null Model:
    i1 = i2 = i3
    b1 = b2 = b3 = 0

One way I could imagine to do it quantitavely is to take the estimates as input to models with these restrictions and get the likelihoods. But the “intuitve” solution with bridge sampling appears more straight forward to me.
What do you think?