Model Comparison: How to constrain a model to positive mu-differences?

Hi! I’m new to PyMC3 and bayesian modelling in general. I’m not even sure my question is appropiate to my problem. I’m not sure about my modelling choices. So please be a bit patient with me, thanks!
There’s a code example following the explanations.

Background:
I’ve got data from an experiment with a repeated measures/within-subject design (the real data is from a mixed design, though the between factor doesn’t concern me right now). Each participant does 3 consecutive blocks of 30 trials. Due to outlier removal and missing data the resulting blocks do not have the same length. The data are lengths of projections of 2D points onto a vector (called parallel in the data) and onto its orthogonal vector. I’m interested in the difference in variance between these projection directions parallel (\parallel) and orthogonal (\perp).

Goal:
I’ve done the classic ANOVA approach (on synergy indices as a measure of the ratio between projections in an uncontrolled manifold analysis, for those who are interested), but I’d like to learn from the data how plausible certain theoretical models about the structure of variance are, given the observed data. My goal is to receive this kind of comparison between models. I used @junpenglao’s bridge sampling implementation, but didn’t parameterize the models very well.

Approach:
The mean of the projections is zero in each block. Therefore, the mean of the squared projections is the variance. So I could compare mean squared projections. Squared projections are all positive values, but there are a few extremes, the variance seems to increase with the mean, the squared data are skewed to the right, and the range of values is large. These were indicators to me that I should log-transform the squared data.

Here I’ll start with 2 simple models: The null model which proposes all data is generated by the same distribution. And a model for a main effect of projection direction (parallel > orthogonal). To come back to my question: How do I make sure that for the main effect model the estimated \mu_\parallel > \mu_\perp ?

The difference in the means of transformed projections between participants is so large that there’s overlap in my priors for \mu_\parallel & \mu_\perp . So it could happen that the model generates 2 almost identical \mu_\parallel & \mu_\perp, when in theory I want the likelihood of the data, given that \mu_\parallel > \mu_\perp, and subsequently P(\mu_\parallel > \mu_\perp|D) vs. P(\mu_\parallel = \mu_\perp|D).
In my mind I would need to standardize the data somehow, so I can make the probabilities for \mu_\parallel and \mu_\perp non-overlapping? Or somehow model the difference between the 2 to be at least > 0? You see, I’m a bit confused about it.

# %%
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns


# %%
# Create some data conforming to null model: all deviations on average equal across blocks.
n_trials = 90 - np.random.randint(36)
df0 = pd.DataFrame({'user': 0, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 4.0, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 4.0, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df0[['parallel', 'orthogonal']] = df0.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to main effect of projection model: variance in parallel > orthogonal.
n_trials = 90 - np.random.randint(36)
df1 = pd.DataFrame({'user': 1, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 5.0, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 1.0, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df1[['parallel', 'orthogonal']] = df1.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
df = pd.concat((df0, df1), axis='index')

# %%
df[['user', 'block']] = df[['user', 'block']].astype('category')
df.sort_values(by=['user', 'block'], inplace=True)

# log-transform squared deviations.
df[['parallel', 'orthogonal']] = df[['parallel', 'orthogonal']].transform('square').transform('log')


# %%
# Vizualize the tranformed data.
samples = df.melt(id_vars='user', value_vars=['parallel', 'orthogonal'], var_name='projection', value_name='ln(proj_sq)')
g = sns.FacetGrid(samples, col='user', hue="projection")
g.despine(offset=10)
g = (g.map(sns.distplot, "ln(proj_sq)").add_legend(title="Projection", loc='upper right'))
for ax in g.axes.flat:
    ax.set_ylabel("Probability Density")
    ax.set_xlabel("Ln(Squared Projection length)")
plt.tight_layout()
plt.show()

# %%
# Massage data into single distributions per user. Each column holds all data for 1 user.
# Since not all users have the same amount of data, we will have missing data at the ends of the columns.
null_data = df.melt(id_vars='user', value_vars=['parallel', 'orthogonal'], var_name='projection', value_name='ln(proj_sq)').drop('projection', axis='columns')
null_data['idx'] = null_data.groupby('user').cumcount()
null_data = null_data.pivot(index='idx', columns='user', values='ln(proj_sq)')
n_users = df['user'].nunique()

# %%
with pm.Model() as null_model:
    null_model.name = "Null Model"
    mu = pm.Normal("mu", mu=0.8, sigma=1.25, shape=n_users)
    sigma = pm.Gamma("sigma", alpha=25.5, beta=9, shape=n_users)

    obs = pm.Normal(f"obs", mu=mu, sigma=sigma, shape=df['user'].nunique(), observed=null_data)

# %%
with null_model:
    idata0 = pm.sample(1000, tune=800, return_inferencedata=True)

# %%
az.summary(idata0, var_names=["Null Model_mu", "Null Model_sigma"])

# %%
# Massage data into single distributions per user and projection direction.
proj_effect_data = df.loc[:, ['user', 'parallel', 'orthogonal']]
proj_effect_data['idx'] = proj_effect_data.groupby('user').cumcount()
proj_effect_data = proj_effect_data.pivot(index='idx', columns='user', values=['parallel', 'orthogonal'])

# %%
with pm.Model() as projection_model:
    projection_model.name = "Main Effect Projection"
    mu_ortho = pm.Normal("mu_ortho", mu=0.09, sigma=2.55, shape=n_users)
    mu_parallel = pm.Normal("mu_parallel", mu=1.5, sigma=2.5, shape=n_users)
    sigma = pm.Gamma("sigma", mu=2.65, sigma=0.25, shape=n_users)
    
    obs_parallel = pm.Normal(f"obs_parallel", mu=mu_parallel, sigma=sigma, shape=n_users, 
                                observed=proj_effect_data['parallel'])
    obs_orthogonal = pm.Normal(f"obs_orthogonal", mu=mu_ortho, sigma=sigma, shape=n_users, 
                                observed=proj_effect_data['orthogonal'])

    diff = pm.Deterministic('mu_diff', mu_parallel - mu_ortho)

# %%
with projection_model:
    idata1 = pm.sample(1000, tune=800, return_inferencedata=True)

# %%
az.summary(idata1, var_names=["Main Effect Projection_mu", "Main Effect Projection_sigma"], filter_vars='like')

There are more models I want to evaluate for main effects of block and interactions (e.g. only parallel projection variance is smaller in block 2), so I need to understand the basics.

Previous approach:
Before I decided to log-transform the data, \mu_\perp and a \Delta\mu were modelled using gamma distributions and obs_parallel had \mu_\perp + \Delta\mu as its mu-parameter to construct \mu_\parallel.

I also created the models for each participant to feed the result to the bridge sampling. Then I found out I could vectorize the model. So now I would need to figure out, if and how the bridge sampling works with the vectorized form. The whole project can be found here.

1 Like

To answer your question:

You can model the differences as positive value, and add it to mu_ortho:

with pm.Model() ...
    ...
    mu_ortho = pm.Normal("mu_ortho", mu=0.09, sigma=2.55, shape=n_users)
    mu_diff = pm.HalfNormal("mu_diff", 1., shape=n_users)
    mu_parallel = pm.Deterministic("mu_parallel", mu_ortho+mu_diff)

However, instead of computing the variance of the 2 conditions and log transforming them, I would prefer modeling the single trial - in that case, you can give informative prior to the variance, and works much better for single subject modeling. In addition, I wouldnt even restrict parallel > orthogonal, and just look at the posterior samples count how many parallel > orthogonal

Also, for model comparison, generally LOO-CV based method is better.

2 Likes

Thanks for your help! I believe your answer is the solution to the question I had. Thanks for your recommendations on the modelling as well. As I mentioned I wasn’t sure about the model choices and methods to use. If you don’t mind, I have some follow-up questions.

I’m not sure what you mean by modeling single trials. Having a probability distribution for each single trial? The observed data I produce in the example below (and above) contains up to 90 trials per participant (trial count differs per block and participant). My previous code was transforming the single trials so their mean would give the variance.
Having a random variable for the standard deviation would give me what I need, right? Or did you mean to rather do the modelling for the untransformed projections? In that case I changed the example to the one below with untransformed projections. Though I’m not sure a normal distribution is the right fit for the real data. I think a Shapiro-Wilk test indicated non-normality.

short version:

with pm.Model() as projection_model:
    projection_model.name = "Main Effect Projection"
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    sigma_parallel = pm.Gamma("sigma_parallel", mu=7.0, sigma=3.17, shape=n_users)
    # Alternatively: The sum of two gamma-distributed variables is gamma-distributed.
    #sigma_diff = pm.Gamma('sigma_diff', mu=4.0, sigma=5.0, shape=n_users)
    #sigma_parallel = pm.Deterministic("sigma_parallel", sigma_ortho + sigma_diff)
    
    obs_parallel = pm.Normal(f"obs_parallel", mu=0.0, sigma=sigma_parallel, shape=n_users, observed=proj_effect_data['parallel'])
    obs_orthogonal = pm.Normal(f"obs_orthogonal", mu=0.0, sigma=sigma_ortho, shape=n_users, observed=proj_effect_data['orthogonal'])
    sigma_diff = pm.Deterministic('sigma_diff', sigma_parallel - sigma_ortho)

full example:

# %%
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns


# %%
# Create some data conforming to null model: all deviations on average equal across blocks.
n_trials = 90 - np.random.randint(36)
df0 = pd.DataFrame({'user': 0, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 5.5, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 5.5, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df0[['parallel', 'orthogonal']] = df0.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to main effect of projection model: variance in parallel > orthogonal.
n_trials = 90 - np.random.randint(36)
df1 = pd.DataFrame({'user': 1, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 7.0, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 3.3, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df1[['parallel', 'orthogonal']] = df1.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
df = pd.concat((df0, df1), axis='index')

# %%
df[['user', 'block']] = df[['user', 'block']].astype('category')
df.sort_values(by=['user', 'block'], inplace=True)

# %%
# Vizualize the tranformed data.
samples = df.melt(id_vars='user', value_vars=['parallel', 'orthogonal'], var_name='direction', value_name='projection')
g = sns.FacetGrid(samples, col='user', hue="direction")
g.despine(offset=10)
g = (g.map(sns.distplot, "projection").add_legend(title="Projection", loc='upper right'))
for ax in g.axes.flat:
    ax.set_ylabel("Probability Density")
    ax.set_xlabel("Projection Length")
plt.tight_layout()
plt.show()

# %%
# Massage data into single distributions per user. Each column holds all data for 1 user.
# Since not all users have the same amount of data, we will have missing data at the ends of the columns.
null_data = df.melt(id_vars='user', value_vars=['parallel', 'orthogonal'], var_name='direction', value_name='projection').drop('direction', axis='columns')
null_data['idx'] = null_data.groupby('user').cumcount()
null_data = null_data.pivot(index='idx', columns='user', values='projection')
n_users = df['user'].nunique()

# %%
with pm.Model() as null_model:
    null_model.name = "Null Model"
    sigma = pm.Gamma("sigma", alpha=2.2, beta=0.4, shape=n_users)

    obs = pm.Normal(f"obs", mu=0.0, sigma=sigma, shape=n_users, observed=null_data)

# %%
with null_model:
    idata0 = pm.sample(1000, tune=800, return_inferencedata=True)

# %%
az.summary(idata0, var_names=["Null Model_sigma"])

# %%
# Massage data into single distributions per user and projection direction.
proj_effect_data = df.loc[:, ['user', 'parallel', 'orthogonal']]
proj_effect_data['idx'] = proj_effect_data.groupby('user').cumcount()
proj_effect_data = proj_effect_data.pivot(index='idx', columns='user', values=['parallel', 'orthogonal'])

# %%
with pm.Model() as projection_model:
    projection_model.name = "Main Effect Projection"
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    sigma_parallel = pm.Gamma("sigma_parallel", mu=7.0, sigma=3.17, shape=n_users)
    # Alternatively: The sum of two gamma-distributed variables is gamma-distributed.
    #sigma_diff = pm.Gamma('sigma_diff', mu=4.0, sigma=5.0, shape=n_users)
    #sigma_parallel = pm.Deterministic("sigma_parallel", sigma_ortho + sigma_diff)

    
    obs_parallel = pm.Normal(f"obs_parallel", mu=0.0, sigma=sigma_parallel, shape=n_users, observed=proj_effect_data['parallel'])
    obs_orthogonal = pm.Normal(f"obs_orthogonal", mu=0.0, sigma=sigma_ortho, shape=n_users, observed=proj_effect_data['orthogonal'])
    sigma_diff = pm.Deterministic('sigma_diff', sigma_parallel - sigma_ortho)


# %%
with projection_model:
    idata1 = pm.sample(1000, tune=800, return_inferencedata=True)

# %%
az.summary(idata1, var_names=["Main Effect Projection_sigma"], filter_vars='like')

My final goal is to compare the marginal likelihoods of several different theoretically driven models of directed hypotheses (null, main effect parallel > orthogonal, main effect of block 2 on both directions, different interaction effects, …)
I don’t yet understand how the parallel > orthogonal counts would lead me to the marginal likelihoods of the models.
I was previously using your bridge sampling implementation this kind of way (not exactly):

def get_marginal_likelihood(model, trace):
    with model:
        marginal_log_likelihood = bridgesampler.marginal_llk(trace, model=model, maxiter=10000)["logml"]
    return marginal_log_likelihood

# models and traces for each particpant, I hadn't vectorized it yet at the time.
users = [1,2,3,4,5]  # and so on.
model_posterior = dict()
for user in users:
    model_posterior[user] = np.array([get_marginal_likelihood(*model) for model in zip(models[user], traces[user])])  # models[user] would contain all (unvectorized) models for this user.
    # Exponentiate and normalize to turn into a probability distribution.
    model_posterior[user] = np.exp(model_posterior-np.logaddexp.reduce(model_posterior[user]))

So you’re saying using a LOO-CV based method would yield more accurate results?
I tried az.compare({null_model.name: idata0, projection_model.name: idata1}), but it fails with:

TypeError: Found several log likelihood arrays ['Projection Effect_obs_parallel', 'Projection Effect_obs_orthogonal'], var_name cannot be None

I read about the error in this answer, but I don’t understand it enough to translate it to my application.

1 Like

I think I have to clarify some more why I turned to pymc3 and I’m not just doing a Bayesian ANOVA in JASP or jamovi.
I’m especially interested in predictions from 2 hypotheses I haven’t mentioned before.
1: low variability for orthogonal in all blocks, high variability for parallel projection in block 1 & 3, but not in block 2.

sigma
^
| .       . parallel
|  \     /
|   \   /
|    \./
| .___.___. orthogonal
|------------> block
  1   2   3

2: same as before except that orthogonal variability rises in block 2.

sigma
^
|.       . parallel
| \     /
|  \   /
|   \./
|   /*\
| ./   \.  orthogonal
|------------> block
  1  2  3

The ANOVA on the Synergy Index (a measure of the variability ratio) can’t discriminate these 2 predictions, only that there is some difference in block 2 vs 1 & 3, which doesn’t help much. Therefore, I hope to model these 2 predictions in pymc3, amongst other ones.
That’s why I think counting the samples where parallel > orthogonal would not be able to give me the answer I seek, because it would give me the same result for these 2 examples?

1 Like

Ah I see, the observed is a square array (n_subject * n_trials). In that case, my suggestion would be to flatten the observed and use indexing to construct the linear predictors, it is better for missing trials etc - something like:

with pm.Model() as projection_model:
    projection_model.name = "Main Effect Projection"
    # better to also model mu
    mu_ortho = pm.Normal(...)
    mu_parallel = pm.Normal(...)
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    sigma_parallel = pm.Gamma("sigma_parallel", mu=7.0, sigma=3.17, shape=n_users)
    
    obs_parallel = pm.Normal(f"obs_parallel",
                             mu=mu_parallel[parallel_idx],
                             sigma=sigma_parallel[parallel_idx],
                             observed=parallel_data_flatten)
    obs_orthogonal = pm.Normal(f"obs_orthogonal",
                               mu=mu_ortho[orthogonal_idx],
                               sigma=sigma_ortho[orthogonal_idx],
                               observed=orthogonal_data_flatten)
    sigma_diff = pm.Deterministic('sigma_diff', sigma_parallel - sigma_ortho)

This is reaction time data? maybe try log-normal? I am not sure what is the recommended best practice nowadays as well.

Well, the language of hypothesis testing is always problematic to me as it usually dont answer what we really want to know - is this conclusion going to be useful (for making decision, for advancing our understanding, etc). As it is only useful when you have a non-tiny effect, looking at things like effect size is much more natural. In a linear model (like ANOVA), all the hypothesis you want to test are ultimately comparisons of parameters/coefficients, so instead of fitting multiple models and compare marginal likelihood or Bayes Factor, I usually recommend to fit the most comprehensive model with all the information (in this case would mean fit both conditions and coded for all subject, trial and block), and look at the distribution of the posterior contrasts (an example: https://docs.pymc.io/notebooks/BEST.html)

You need to coded the block in your model as well - for example, you can have a indexing of [0, 1, 2, …, 5] to indicate parallel_block_1, parallel_block_2, …, orthogonal_block_3, and index to a shape=6 vector of parameter.

2 Likes

Thank you very much for taking the time again to answer my questions. I try to understand it and why it may be the better approach.

I agree, that’s why I’ve come here. :smile: Too bad I’m still a noob at Bayesian inference.

No. It’s a motoric task measuring 2 end effector positions, thus forming a trial-to-trial point cloud in a cartesian 2D-coordinate system. The demeaned 2D-coordinate vectors get projected onto a vector and its orthogonal, basically just a change of basis. “parallel” and “orthogonal” data are merely the coefficients in \vec{v} = a_\parallel\hat{i} + a_\perp \hat{j}.

Yes, in my first post this was the case. That’s why I modelled mu which then was the parameter I was interested in, since in that case mu resembled the variance of the untransformed projection data. It was also log-transformed to pull in extremes (due to the squares) and stabilize the variance of mu itself (variance of variance) etc. It may have been more complex than it needs to be.
For the non-squared data in my second post, however, the mean is guaranteed to be 0, that’s why I didn’t model mu, only sigma.

By flattening you mean throwing all participants together? I don’t quite get your code example. mu and sigma are of shape n_users, but you only estimate these for 1 user and using the data of all users at the same time? I’m confused.
Since participants may deploy different strategies in the motoric task (assigned to different conditions as between factor and each participant may come up with their own solution), I want to estimate model parameters per participant. I started with having each single model per each participant before I discovered I could vectorize the parameters. Having all models per particpant eliminated missing data. Only after vectorizing parameters instead of having each single model for each participant I was facing the issue of missing data, since participants differ in their trial count per block (1 column per participant and trials being the rows in the observed data).

Yes, I only left out the blocks from my previous examples for brevity, since they weren’t relevant to those simple models, and for trying to understand the basics first. If I understand you correctly, this is the approach where I have 1 uber-model per participant? And for variables I’m interested in (9 differences) I would create deterministic variables,e.g. sigma[2] - sigma[1] for difference of parallel deviation block2-block1, and sigma[1]-sigma[4] for difference between parallel and orthogonal deviation in block 2?
But if I had all blocks in 1 vectorized form (trials as rows, block_<projection> as columns) I’d again have missing data, since the trial count differs between blocks? This can be solved though by having block1, block2, block3 with shape 2 each (parallel & orthogonal).

Having those 9 estimates and their uncertainty would be a good thing. Thanks!
Only, having all 9 estimates isn’t fully satisfying my question, since there are competing theoretical predictions about the sizes of those differences, see above ASCII diagrams. I’d like to know whose theory is most compatible with the observed data. Wouldn’t that involve a secondary step of a model selection with the estimated differences as observed data, including their uncertainty somehow?

The approach I had with several models was to state from the beginning what I’d expect the differences to look like for each theory. Null model would have no differences whatsoever. Main effect projection would state that there is a constant unknown difference (>0) between the projections across blocks. And then more sofisticated models with statements about the differences between individual blocks/projections. I feel like this was the other way around of what you suggested, first getting the estimates ofthose differences and then making statements about them.

1 Like

Since they are orthogonal you can model them with univariate distribution that has similar shape and the same support (e.g., positive only) as your observed.

Usually, you can coded your data as a table like:

subject_id | condition1 | condition2 | observation
         0 |   parallel |    block_1 |       .3452
...

Instead of having 1 column per participant and trials being the rows in the observed data, you have 1 column for all your data, and participant and trials are indicator (subject_id and trial_id).

Then in your model, you can either use GLM like approach to construct a design matrix, or use indexing:

subject_mu = pm.Normal(..., shape=n_subject)  <= shape is n_subject
subject_mu_all_trial = subject_mu[subject_id] <= shape is now number of row in your data frame

which estimate the parameter independently for each subject.

No matter what approach you decide to go with in terms of model comparison / hypothesis testing, I think you should use this formulation as it is a more general way to do so, and connect nicely with classical GLM formulation. You can also consult the recent book Regression and other stories

2 Likes

Thanks for the clarification on the data format. So just have it in long format.

I’m sorry, I still don’t understand what type of variable subject_id in subject_mu_all_trial = subject_mu[subject_id] shall be in your example. Is that the single ID of a participant (e.g. single integer)? But then why have subject_mu with shape n_subjects, if I’d have to loop over all subject_id anyway and call with pm.Model() as model: ... within the loop? Or is subject_id an array of all unique subject IDs [0,1,2,3,…]?
Or is it the first column of the data with duplicates [0,0,0,0,0,…,1,1,1,1,1,…,2,2,2,2,2,2,…]? If the shape of subject_mu_all_trials was the number of rows in the long format dataframe (total number of all trials*2) it would be n_users*n_user_trials*n_directions (e.g. 40x90x2=7200)? Wouldn’t that just give me the values of every single trial? I obviously don’t understand what you are trying to tell me. :confused:

Or did you mean to do it in 2 steps:

df = ... # long format dataframe with column 'user' as ID for participants.
n_users = df['user'].nunique()
mus = list()
with pm.Model() as model:
    user_mu = pm.Normal(..., shape=n_users)

for user_id in df['user'].unique():
    with model:
        user_mu_all_trials = user_mu[user_id]
    mus.append(user_mu)  # Otherwise it would be overwritten each iteration.

That doesn’t seem to be what you intended. Thanks for your effort and patience with me!

1 Like

Yes

It wouldn’t as you dont have parameter related to each trial - this is essentially what IID data or repeat measurement means. Maybe it is clearer in code:

data1 = np.random.normal(loc=0., scale=1., size=10)
data2 = np.random.normal(loc=10., scale=5., size=10)
data_all = np.concatenate([data1, data2], axis=0)
sbj_idx = np.concatenate([np.zeros(10), np.ones(10)], axis=0).astype(int)

with pm.Model() as model1:
  mu = pm.Normal('mu', 0., 10., shape=2)
  sigma = pm.HalfNormal('sigma', 5., shape=2)
  obs = pm.Normal('obs', mu, sigma, observed=np.stack([data1, data2]).T)

with pm.Model() as model2:
  mu = pm.Normal('mu', 0., 10., shape=2)
  sigma = pm.HalfNormal('sigma', 5., shape=2)
  obs = pm.Normal('obs', mu[sbj_idx], sigma[sbj_idx], observed=data_all)

model1 and model2 are identical but model2 is much more flexible as it allows different length of data among subjects.

3 Likes

Thank you so much for the code example. This is great info. I don’t know if that is even anywhere in the documentation. Using the graphviz plot I see what’s going on. I imagine it now as mapping the observed data to the mus. I scoured the internet in an effort to find any explanation for how to use it, to not occupy too much more of your time, but I didn’t find a single example using the indexing method. So I am once again kindly asking for your help.

I found a well illustrated primer on different shape types. And a blog post which explains that shapes in PyMC3 are, unfortunately, one hot mess. It also states:

distribution_shape == batch_shape + event_shape
[observed] data should have a shape that looks like sample_shape + distribution_shape .

If I understood correctly, for a model where I estimate 6 parameters (e.g. \sigma for n_blocks(3)*n_projections(2)) for each user by drawing from independent univariate distributions (same distribution family), my sample shape is n_trials (per block), batch shape is n_users*n_blocks*n_projections, and event shape = ().

I have the data in long format like you suggested:

user block  direction    projection   
0    1      parallel     7.129121
0    1      parallel     3.414878
0    1      parallel     -6.330880
0    1      parallel     0.428940
0    1      parallel     -3.650233
...  ...    ...          ...
1    3      orthogonal   -2.223337
1    3      orthogonal   -1.275683
1    3      orthogonal   -0.796934
1    3      orthogonal   2.221719
1    3      orthogonal   -5.626795

I could unstack the ‘direction’ if that was of any use, as these are guaranteed to be the same length for each individual block.
I tried different shape settings and I don’t know which one is correct. For example,

with pm.Model() as model3:
    sigma = pm.HalfNormal('sigma', 5., shape=12)  # 2 users * 6 parameters. batch shape?
    obs = pm.Normal('obs', 0.0, sigma[samples['user']], shape=12, observed=samples['projection'])

But it’s hard to identify which parameter (sigma 0-11) belongs to which user/block/direction coordinate.
Whatever else I tried with the indexing and shape argument, I always got this error:

RecursionError: maximum recursion depth exceeded in comparison

If I manage to use it as intended I may write something about how to vectorize a model properly without missing data using your indexing method to ease the struggle for other beginners.

1 Like

So, you can still use indexing to do this, but it will be quite inefficient as you need to create a index of direction + block, for example, index 0 would be parallel block 1.

A better approach is to formulate a design matrix:

from patsy import dmatrices
outcome, predictors = dmatrices("projection ~ block + direction + block * direction", data)

which is what usually call mixed effect model or hierarchical model. The pymc3 doc have quite an extensive session on this (e.g., https://docs.pymc.io/notebooks/GLM-hierarchical.html)

The differences is that, here you your case the linear prediction will goes into sigma instead of mu

2 Likes

Hi, we know cross validation and model comparison can be tricky because in many cases there are several correct ways of comparing the models. We are actually working on some resources to help with that, they are still a work in progress, but it should be useful already. Here is the notebook covering multiple likelihoods. The link is to my fork which has the open PR, feedback will be most appreciated. You may also be interested in the other notebooks of the section: https://nbviewer.jupyter.org/github/arviz-devs/Exploratory-Analysis-of-Bayesian-Models/tree/master/content/Section_04/

Having said that, there is also this example in R which I think will be much closer to your model and could also be helpful: https://avehtari.github.io/modelselection/rats_kcv.html

I will try to read this topic with more time to see if I can help further. I think this multiple measurement and multiple subject models are ideal to teach the nooks and crannies of CV.

By any chance, do you know about some public dataset other than the rats one with similar characteristics?

2 Likes

I needed some time to go through the ressources you both provided, dust off my knowledge about HLM (Mixed Model) and think about how this would apply to my actual problem. I’ll discuss now the 2 different approaches, one being mixed model, and the other I call the intuitive approach.

First I want to stress and illustrate again what my research question is. There are several hypothesis by different authors about how the variances of the data should differ between the blocks and direction of projection.
How the data is nested:

3 between-conditions
  N users (N varies between conditions)
    3 within-conditions: blocks
        X trials (varies per block and user)
            2 outcome directions: parallel, orthogonal
                outcome: projection (scalar)

The question of interest is: Which prediction about the variance structure in the data is more probable?
To illustrate this I present again some ASCII-diagrams for different hypotheses. This first illustration makes it easy to think about the data in terms of its collection process (one block after the other, 2 outcomes). These only show block and direction as independent categorical variables and the standard devation as the depedent variableand neglect the between-factor and different users.

Main Effect Projection  Main Effect Block2    Main Effects Projection + Block2
sigma                   sigma                 sigma
^                       ^                     ^     o        o - parallel
|                       |                     |    / \       x - orthogonal
| o__o__o               |     o               |   /   \
|                       |    /x\              |  /  x  \
|                       |   // \\             | o  / \  o
|                       |  //   \\            |   /   \
| x__x__x               | //     \\           |  /     \
|                       |ox       ox          | x       x
|----------> block      |------------> block  |------------> block
  1  2  3                 1   2   3            1   2   3

Block x Projection v1   Block x Proj. v2      Block x Projection v3
sigma                   sigma                 sigma
^                       ^                     ^
|o       o              | o       o           | o
| \     /               |  \     /            |  \
|  \   /                |   \   /             |   \
|   \o/                 |    \ /              |    \
|   /x\                 |     V               |     o___o
|  /   \                |     o               |
| x     x               | x___x___x           | x___x___x
|----------> block      |------------> block  |------------> block
  1  2  3                 1   2   3             1   2   3

Block x Projection v4   Null Model      
sigma                   sigma                 
^                       ^                     
|                       |            
|                       |            
| o__o__o               |            
|    x                  |            
|   / \                 | xo__xo__xo 
|  /   \                |            
| x     x               |            
|----------> block      |------------> block  
  1  2  3                 1   2   3           

Multiple users (e.g. 3) may look like this:

Block2 x Projection v1  Block2 x Proj. v2     Main Effects Block2 + Projection
sigma                   sigma                 sigma o
^                       ^                     ^
|                       |                     |     oo         o - parallel
|                       |         o           |     o          x - orthogonal
| oo      o             | oo      o           | oo      o
| o       oo            | o       o           |         oo
|     oo                |                     | o   xx
|     xox               |                     |     x
| x   x                 | x   oo  xx          | x
| xx      xxx           | xx  xxx x           | xx      xxx
|------------> block    |------------> block  |------------> block
  1   2   3               1   2   3             1   2   3

These would be intuitive to formulate as separate models with restrictions on average sigma per block x direction.

So let’s look at how I would go about the intuitive approach. Here’s a code example with 3 models and 3 users behaving differently.

# %%
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns


#################################
#          Dummy Data           #
#################################
# %%
# Create some data conforming to null model: all deviations on average equal across blocks.
n_trials = 90 - np.random.randint(36)
df0 = pd.DataFrame({'user': 0, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 5.5, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 5.5, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df0[['parallel', 'orthogonal']] = df0.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to main effect of projection model: variance in parallel > orthogonal.
n_trials = 90 - np.random.randint(36)
df1 = pd.DataFrame({'user': 1, 'block': 1 + np.random.randint(3, size=n_trials),
                    'parallel': np.random.normal(0.0, 7.0, size=n_trials),
                    'orthogonal': np.random.normal(0.0, 3.3, size=n_trials)})
# Make sure mean of each block is zero as with real data.
df1[['parallel', 'orthogonal']] = df1.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Create some data conforming to interaction effect version 1:
# blocks 1&3 have smallest orthogonal deviations, but largest parallel deviations. Strong synergy.
n_trials = 30 - np.random.randint(12, size=3)
block1 = pd.DataFrame({'user': 2, 'block': 1,
                       'parallel': np.random.normal(0.0, 7.0, size=n_trials[0]),
                       'orthogonal': np.random.normal(0.0, 3.0, size=n_trials[0])})
block3 = pd.DataFrame({'user': 2, 'block': 3,
                       'parallel': np.random.normal(0.0, 7.0, size=n_trials[1]),
                       'orthogonal': np.random.normal(0.0, 3.0, size=n_trials[1])})
# Block 2 has smaller parallel deviations than blocks 1&3, but higher orthogonal deviations than blocks 1&3.
# In this case sigma_block2_parallel - sigma_block2_orthogonal = 0
# No synergy.
block2 = pd.DataFrame({'user': 2, 'block': 2,
                       'parallel': np.random.normal(0.0, 5.0, size=n_trials[2]),
                       'orthogonal': np.random.normal(0.0, 5.0, size=n_trials[2])})
df2 = pd.concat((block1, block2, block3), axis='index')
df2[['parallel', 'orthogonal']] = df2.groupby('block')[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
df = pd.concat((df0, df1, df2), axis='index')

# %%
df[['user', 'block']] = df[['user', 'block']].astype('category')
df.sort_values(by=['user', 'block'], inplace=True)

# %%
# Reshape data into long format.
samples = df.melt(id_vars=['user', 'block'], value_vars=['parallel', 'orthogonal'],
                  var_name='direction', value_name='projection')
# Dimensions.
n_users = samples['user'].nunique()
n_blocks = samples['block'].nunique()
n_directions = samples['direction'].nunique()

# %%
# Vizualize the data.
g = sns.FacetGrid(samples, col='block', row='user', hue="direction")
g.despine(offset=10)
g = (g.map(sns.distplot, "projection").add_legend(title="Projection", loc='upper right'))
for ax in g.axes.flat:
    ax.set_ylabel("Probability Density")
    ax.set_xlabel("Projection Length")
plt.tight_layout()
plt.show()

# %%
#################################
#          Null Model           #
#################################
with pm.Model() as null_model:
    null_model.name = "Null Model"
    sigma = pm.Gamma("sigma", alpha=2.2, beta=0.4, shape=n_users)

    obs = pm.Normal(f"obs", mu=0.0, sigma=sigma[samples['user']], observed=samples['projection'])

# %%
pm.model_to_graphviz(null_model)

# %%
with null_model:
    idata0 = pm.sample(1000, tune=800, return_inferencedata=True)
az.summary(idata0, var_names=["Null Model_sigma"])
az.loo(idata0)

# %%
#################################
# Main Effect Projection Model  #
#################################
with pm.Model() as projection_model:
    projection_model.name = "Main Effect Projection"
    sigma_ortho = pm.Gamma("sigma_ortho", mu=3.34, sigma=1.44, shape=n_users)
    # The sum of two gamma-distributed variables is gamma-distributed.
    sigma_diff = pm.Gamma('sigma_diff', mu=4.0, sigma=5.0, shape=n_users)
    sigma_parallel = pm.Deterministic("sigma_parallel", sigma_ortho + sigma_diff)

    obs_parallel = pm.Normal("obs_parallel", mu=0.0, 
                             sigma=sigma_parallel[samples.loc[samples['direction'] == 'parallel', 'user']], 
                             observed=samples.loc[samples['direction'] == 'parallel', 'projection'])
    obs_orthogonal = pm.Normal("obs_orthogonal", mu=0.0,
                               sigma=sigma_ortho[samples.loc[samples['direction'] == 'orthogonal', 'user']], 
                               observed=samples.loc[samples['direction'] == 'orthogonal', 'projection'])

# %%
pm.model_to_graphviz(projection_model)

# %%
with projection_model:
    idata1 = pm.sample(1000, tune=1000, return_inferencedata=True)
az.summary(idata1, var_names=["Main Effect Projection_sigma"], filter_vars='like')

# %%
#idata1.log_likelihood['projection_diff'] = (idata1.log_likelihood['Main Effect Projection_obs_parallel'] 
#                                            - idata1.log_likelihood['Main Effect Projection_obs_orthogonal'])
#az.loo(idata1, var_name='projection_diff')

# %%
##########################################
# Main Effects Block2 + Projection Model #
##########################################
with pm.Model() as both_main_effects_model:
    both_main_effects_model.name = "2 Main Effects"
    sigma_ortho13 = pm.Gamma("sigma_ortho_blocks_1_3", mu=3.34, sigma=1.44, shape=n_users)
    # The sum of two gamma-distributed variables is gamma-distributed.
    sigma_diff_proj = pm.Gamma('sigma_diff_projection', alpha=4., beta=1.0, shape=n_users)
    sigma_diff_block2 = pm.Gamma('sigma_diff_block2', alpha=1., beta=1.0, shape=n_users)

    sigma_ortho2 = pm.Deterministic('sigma_ortho_block_2', sigma_ortho13 + sigma_diff_block2)
    sigma_parallel13 = pm.Deterministic('sigma_parallel_blocks_1_3', sigma_ortho13 + sigma_diff_proj)
    sigma_parallel2 = pm.Deterministic('sigma_parallel_block_2', sigma_parallel13 + sigma_diff_block2)

    data_parallel13 = samples.query('`direction`=="parallel" & `block` in [1, 3]')
    data_ortho13 = samples.query('`direction`=="orthogonal" & `block` in [1, 3]')
    data_parallel2 = samples.query('`direction`=="parallel" & `block`==2')
    data_ortho2 = samples.query('`direction`=="orthogonal" & `block`==2')

    obs_parallel13 = pm.Normal("obs_blocks_1_3_parallel", mu=0.0, sigma=sigma_parallel13[data_parallel13['user']], 
                               observed=data_parallel13['projection'])
    obs_parallel2 = pm.Normal("obs_block2_parallel", mu=0.0, sigma=sigma_parallel2[data_parallel2['user']], 
                              observed=data_parallel2['projection'])
    obs_orthogonal13 = pm.Normal("obs_blocks_1_3_orthogonal", mu=0.0, sigma=sigma_ortho13[data_ortho13['user']], 
                                 observed=data_ortho13['projection'])
    obs_orthogonal2 = pm.Normal("obs_block2_orthogonal", mu=0.0, sigma=sigma_ortho2[data_ortho2['user']], 
                                 observed=data_ortho2['projection'])

# %%
pm.model_to_graphviz(both_main_effects_model)

# %%
with both_main_effects_model:
    idata2 = pm.sample(1000, tune=1000, return_inferencedata=True)
az.summary(idata2, var_names=["2 Main Effects_sigma_diff"], filter_vars='like')

What’s missing in these models is the partial pooling, but I guess this could be easily done by just adding more priors on top. To get from these models to answering the research question is where the bridge sampling should come into play. I have to figure out how to get the traces and model in the vectorized form suited for the marginal_llk-function.

The Mixed Model approach does have advantages like partial-pooling and being able to include within- and between-conditions. The disadvantage is that it is not as intuitive for me for the research question at hand and fitting one model without restrictions also takes me only halfway to what I really want to know.

Let me illustrate the same predictions from above as regression models. The i stands for intercept and the b are the slopes. i12 is abbreviation for intercept block 1 & intercept block 2. The orthogonal direction is coded as 0 and parallel as 1.

    Main Effect Projection  Main Effect Block2    Main Effects Projection + Block2
    sigma                   sigma                 sigma  / b2    
    ^                       ^                     ^     /        
    |      / b123           |                     |    /         
    |     /               i2|_________b2          |   /  / b13   
    |    /                  |                     |  /  /        
    |   /                   |                     | /  /         
    |  /                    |                   i2|/  /          
    | /                     |                     |  /           
i123|/                   i13|_________b13         | /            
    |                       |                  i13|/             
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

    Block x Projection v1   Block x Proj. v2      Block x Projection v3
    sigma                   sigma                 sigma          
    ^                       ^                     ^              
    |      / b13            |      / b13          |      / b1    
    |     /                 |     /               |     /        
    |    /                  |    /                |    /         
  i2|___/___ b2             |   /                 |   /          
    |  /                    |  /                  |  /           
    | /                     | /                   | /            
 i13|/                  i123|/______ b2       i123|/______ b23   
    |                       |                     |              
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

    Block x Projection v4   Null Model
    sigma                   sigma          
    ^                       ^              
  i2|______                 |              
    | b2  /                 |              
    |    /                  |              
    |   / b13               |              
    |  /                i123|________      
    | /                     |  b123        
 i13|/                      |              
    |                       |              
    |---------->            |---------->   
    0      1                0      1       
    ortho  parallel         ortho  parallel

If I fitted a model without restrictions on intercepts and slopes I would have to compare the estimates I get for all of them to these hypothesis which describe the illustrations above in order to answer my research question:

Main Effect Projection:
    intercepts and slopes are fixed
    b1 = b2 = b3 > 0
    i1 = i2 = i3

Main Effect Block2:
    b1 = b2 = b3 (fixed slopes)
    i2 > i1, i3
    i3 - i1 = 0

Main Effects Projection + Block2:
    i2 > i1, i3
    b1 = b2 = b3

Block x Projection v1:
    i3 - i1 = 0
    i2 > i1, i3
    b3 - b1 = 0
    b3, b1 > 0
    b3 > i2 - i3
    b1 > i2 - i1
    b2 = 0

Block x Proj. v2:
    i1 = i2 = i3
    b1, b3 > 0
    b3 - b1 = 0
    b2 = 0

Block x Projection v3:
    i1 = i2 = i3
    b1 > 0
    b2 = b3 = 0

Block x Projection v4:
    i1 = i3
    i2 > i1, i3
    b2 = 0
    b1 = i2 - i1
    b3 = i2 - i3

Null Model:
    i1 = i2 = i3
    b1 = b2 = b3 = 0

One way I could imagine to do it quantitavely is to take the estimates as input to models with these restrictions and get the likelihoods. But the “intuitve” solution with bridge sampling appears more straight forward to me.
What do you think?

Thanks for the explanation, personally, I am not sure the data could answer the question like:“hey after comparing 8 different model here, we are sure that model X is the generative process”. And building a 8*8 comparison table of all the models seems a bit difficult to reason. The mixed effect model is more like: “we assume every conditions are different, and build a full model the encodes all the information, then look at the posterior to see if they are really all different (i.e., no overlapping)” - basically you plot the posterior and see which ASCII-diagrams for different hypotheses it fits the best.
But since you have strong theoretical motivation, it certainly sound like a reasonable approach.

I agree with your opinion about a less restrictive hierarchical model having the potential to reveal more about the underlying data. I’ve tried to work my way up to the full model after reading this informative Primer on Bayesian Methods for Multilevel Modeling, but I’m stuck early on after getting to the user-level and trying to add the block-level.
Using the dims keyword argument and coords seems more elegant than the indexing scheme from this post.

Example data with multilevel predictors

    user	experience condition  block rating	direction	projection
0	9	    0.0        A      	  1	    2.0     parallel	-1.058687
1	9	    0.0        A      	  1	    2.0     parallel	0.201144
2	9	    0.0        A      	  1	    2.0     parallel	-0.806420
...

experience is a user-level predictor and rating a block-level “predictor” (actually, rating is measured afterwards, so it’s correlated with block).

The indexing/design matrix still poses a problem to me. I don’t understand how to use the predictors when doing outcome, predictors = dmatrices("projection ~ condition * user * block * direction", df). The example didn’t utilize the design matrix, stating the glm() function does not play nice with hierarchical models yet.
This is an excerpt from what I get with aforementioned formula.

Terms:
‘Intercept’ (column 0)
‘condition’ (columns 1:3)
‘user’ (columns 3:32)
‘block’ (columns 32:34)
‘direction’ (column 34)
‘condition:user’ (columns 35:93)
‘condition:block’ (columns 93:97)
‘user:block’ (columns 97:155)
‘condition:direction’ (columns 155:157)
‘user:direction’ (columns 157:186)
‘block:direction’ (columns 186:188)
‘condition:user:block’ (columns 188:304)
‘condition:user:direction’ (columns 304:362)
‘condition:block:direction’ (columns 362:366)
‘user:block:direction’ (columns 366:424)
‘condition:user:block:direction’ (columns 424:540)

From what I gathered from a tutorial in R about rat growth that @OriolAbril provided, in R this formula would be kind of this \sigma_{proj} \sim (condition * block * direction)/(1|trial) + (1|user), meaning trials are viewed as nested random effects, and users as random effects. It seems patsy doesn’t support nesting and random effects.

Following the primer I’ve come so far as to include users, as they are what counties are in the radon example. The difference is that I model the standard deviation sigma, not the mu. When trying to add rating as a pendent to uranium in the radon example, I ran into broadcasting issues.

Model with only direction x user:

conditions = df['condition'].cat.categories.tolist()
n_conditions = df['condition'].nunique()
condition_indices = df['condition'].cat.codes.values # Condition index for every single value.
condition_lookup = pd.Series(*pd.factorize(df['condition'].cat.categories), name='condition_idx_lookup')

# We need to be able to index by user. But the user IDs of the sample don't start at 0. We create an index and a mapping between them.
users = df['user'].cat.categories.tolist()
n_users = df['user'].nunique()
user_indices = df['user'].cat.codes.values  # User index for every single value.
user_lookup = pd.Series(*pd.factorize(df['user'].cat.categories), name='user_idx_lookup')
# User level predictors. More options: age group, gender.
# experience is not on an interval scale (no equidistance between items), but ordinal.
experience = df.drop_duplicates(subset="user")['experience']  # Shape = user_indices

blocks = df['block'].cat.categories.tolist()
n_blocks = df['block'].nunique()  # Same as using cat.categories.size
block_indices = df['block'].cat.codes.values # Block index for every single value.
block_lookup = pd.Series(*pd.factorize(df['block'].cat.categories), name='block_idx_lookup')
# Block level predictors.
# Note: Ratings are correlated with block index.
ratings = df.drop_duplicates(subset=["user", 'block'])['rating']  # Shape = block_indices

# Coded direction of projection: 0 = orthogonal, 1 = parallel.
directions = df['direction'].cat.categories.tolist()
n_directions = df['direction'].nunique()
direction_indices = df['direction'].cat.codes.values
direction_lookup = pd.Series(*pd.factorize(df['direction'].cat.categories), name='direction_idx_lookup')

coords = {'Direction': directions, 'User': users, 'Block': blocks, 'Condition': conditions, 'obs_id': np.arange(direction_indices
.size)}

with pm.Model(coords=coords) as varying_intercept_slope_by_user:
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')

    # Hyperpriors:
    intercept = pm.Gamma('intercept', mu=4.0, sigma=2.0)
    intercept_sd = pm.Exponential('sigma_intercept', 1.0)
    slope = pm.Normal('slope', mu=0.0, sigma=1.0)
    slope_sd = pm.Exponential('sigma_slope', 0.5)

    # Varying intercepts:
    intercept_user = pm.Gamma('intercept_user', mu=intercept, sigma=intercept_sd, dims='User')
    # Varying slopes:
    slope_user = pm.Normal('slope_user', mu=slope, sigma=slope_sd, dims='User')

    # Expected deviation per user:
    theta = intercept_user[user_idx] + slope_user[user_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

How do I add the other levels? Condition would be like state in the radon example. I try, but I keep hitting walls. In the meantime I try again and share my progress. I guess I have to formulate linear expressions for each level and use those as inputs to the levels below.

I just realized that using the ratings (for blocks) as I did there isn’t going to work with how I defined blocks. Block indices are only 0,1,2, but each user gave an individual rating for each block leading to n_users*n_blocks ratings. If it was a block-level predictor, there would only be 3 values, right? I could make an index for each block (0-89), but then I wouldn’t be able to compare each first, second and third block across users.

I tried adding a new level, but I’m stuck with the broadcasting issue. I really don’t know what I’m supposed to do with the dims or shape keyword arguments and the slicing of the parameters. Unfortunately, I didn’t find a comprehensible documentation on how this works anywhere. All the examples just have models with 1 or 2 levels, nothing nested deeper than that.

What I’d like to get is an estimate for intercept and slope of projection variance for Blocks 1,2, and 3, while each user can have their own intercepts and slopes in each block as different baselines (random effect).

In addition to the above I tried with blocks (0,1,2) and an ID for each single block.

coords['Block_ID'] = block_ids  # 0..n_users*n_blocks

with pm.Model(coords=coords) as by_block_model:
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')
    block_id_idx = pm.Data('block_id_idx', block_id_indices, dims='obs_id')
    block_idx = pm.Data('block_idx', block_indices, dims='obs_id')
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    experience = pm.Data('experience', gaming_exp, dims='User')
    rating = pm.Data('rating', ratings, dims='Block_ID')  # Block has indices 0,1,2. But ratings are size of n_users*n_blocks. Try block_id?
    
    # Hyperpriors:
    global_intercept = pm.Gamma('global_intercept', mu=4.0, sigma=2.0)
    global_intercept_sd = pm.Exponential('global_intercept_sd', 1.0)
    global_slope = pm.Normal('global_slope', mu=1.5, sigma=1.0)  # We expect the slope to be positive most of the time.
    global_slope_sd = pm.Exponential('global_slope_sd', 0.5)

    # Users.
    # Varying intercepts per user:
    intercept_user = pm.Gamma('intercept_user', mu=global_intercept, sigma=global_intercept_sd, dims='User')
    intercept_user_sd = pm.Exponential('intercept_user_sd', 1.0, dims='User')
    # Varying slopes per user:
    slope_user = pm.Normal('slope_user', mu=global_slope, sigma=global_slope_sd, dims='User')
    slope_user_sd = pm.Exponential('slope_user_sd', 1.0, dims='User')

    # Blocks.
    # Varying intercepts per block:
    intercept_block = pm.Gamma('intercept_block', mu=intercept_user, sigma=intercept_user_sd, dims='Block')  # FixMe: ValueError: operands could not be broadcast together with shapes (3,) (30,) 
    # Varying slopes per block:
    slope_block = pm.Normal('slope_block', mu=slope_user, sigma=slope_user_sd, dims='Block')

    # Expected deviation per block:
    theta = intercept_block[block_idx] + slope_block[block_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')

Once I know how to add more levels, I’d like to cluster users by condition, which may have an effect on standard deviation.

I tried resolving the broadcasting issues by repeating the intercept_user and slope_user parameters.
block_id_idx 0, 1, 2 belong to user 0, block_id_idx 3, 4, 5 to user 1 and so on. intercept_user should then be like 0, 0, 0, 1, 1, 1, 2, 2, 2 etc.

import theano.tensor as tt

with pm.Model(coords=coords) as by_block_model:
    user_idx = pm.Data('user_idx', user_indices, dims='obs_id')
    block_id_idx = pm.Data('block_id_idx', block_id_indices, dims='obs_id')
    block_idx = pm.Data('block_idx', block_indices, dims='obs_id')
    direction_idx = pm.Data('direction_idx', direction_indices, dims='obs_id')
    experience = pm.Data('experience', gaming_exp, dims='User')
    rating = pm.Data('rating', ratings, dims='Block_ID')  # Block has indices 0,1,2. But ratings are size of n_users*n_blocks. Try block_id?
    
    # Hyperpriors:
    global_intercept = pm.Gamma('global_intercept', mu=4.0, sigma=2.0)
    global_intercept_sd = pm.Exponential('global_intercept_sd', 1.0)
    global_slope = pm.Normal('global_slope', mu=1.5, sigma=1.0)  # We expect the slope to be positive most of the time.
    global_slope_sd = pm.Exponential('global_slope_sd', 0.5)

    # Users.
    # Varying intercepts per user:
    intercept_user = tt.repeat(pm.Gamma('intercept_user', mu=global_intercept, sigma=global_intercept_sd, dims='User'), n_blocks)
    intercept_user_sd = tt.repeat(pm.Exponential('intercept_user_sd', 1.0, dims='User'), n_blocks)
    # Varying slopes per user:
    slope_user = tt.repeat(pm.Normal('slope_user', mu=global_slope, sigma=global_slope_sd, dims='User'), n_blocks)
    slope_user_sd = tt.repeat(pm.Exponential('slope_user_sd', 1.0, dims='User'), n_blocks)

    # Blocks.
    # Varying intercepts per block:
    intercept_block = pm.Gamma('intercept_block', mu=intercept_user, sigma=intercept_user_sd, dims='Block_ID')
    # Varying slopes per block:
    slope_block = pm.Normal('slope_block', mu=slope_user, sigma=slope_user_sd, dims='Block_ID')

    # Expected deviation per block:
    theta = intercept_block[block_idx] + slope_block[block_idx] * direction_idx

    projection = pm.Normal('projection', 0.0, sigma=theta, observed=df['projection'], dims='obs_id')


As you can see the model gets created now. There are a few Data containers floating around. I’m using the Block_ID (0…89) to create a slope and intercept for each single block in the data (30 users * block1, 30 * block2, 30 * block3). But since I want to estimate intercept and slope across users for each block (0,1,2), I used block_idx (0,1, or 2 for each data point) in the linear regression for the expected standard deviation in the data theta = intercept_block[block_idx] + slope_block[block_idx] * direction_idx. When using block_idx (either 0, 1, or 2) instead of block_id_idx (0…89), sampling time increases like 10x. Regardless of whether I choose block_idx or block_id_idx, I get every sampling error you can imagine: thousands of divergences, rhat > 1.05, acceptance probability, number of effective samples is smaller than 200.
If I used block_id_idx in the linear expression, is there a way to sample from the posterior using the block_idx? Like group every first, second and third block and get their group estimates?

How can I reparameterize the model? The between factor of condition isn’t even in there yet.
I just want estimates on the differences between standard deviations for blocks 1, 2 and 3, which each user does (within), but users are in different conditions (between). As a bonus it would be nice to know if experience (user-level) has an influence, or the individual rating of blocks. Doesn’t sound too hard.

I often read a common answer to some topics that everything should be straight forward to do in PyMC3. Unfortunately I don’t have a PhD in Bayesian Statistics and 15 years of computer science under the belt, so I appreciate any help! To me it’s everything but straight forward. I’m on a tight schedule, desperate and on the verge of going back to frequentist statistics. :slightly_frowning_face: The devil you know…

IIUC, you have condition, block, direction which potentially be different, and user could also display different strategy - in that case, I recommend you to try parsing the linear formula σproj∼(condition∗block∗direction*user) in patsy, and use indexing for trial (the random effect.)

In general, frequentist testing like ANOVA are GLMs, and GLMs are pretty straightforward to express in Bayesian - I would love to give you a bit more concrete suggestion, maybe it is easier you share a jupyter notebook with some simulation data?

That sounds about right. I don’t really care about trials, e.g. if all trials with number 1 differ or not. I did try to use patsy and got an outcome and a DesignMatrix. Unfortunately, I don’t have the slightest clue about what to do with the DesignMatrix object. I tried multiplying it with a distribution and only got an AssertionError. I also don’t know what to index where?

Yes, and you only have to drag some variables into the right slots in a GUI software most of the time. :sweat_smile:

That is very kind of you and much appreciated! I prepared a gist with some mocked up data. There’s not much of a model, because I don’t know how to use the DesignMatrix and indexing.