PyMC3+ArviZ: improve your workflow with labeled coords and dims

I have written a blog post about PyMC3 coords and dims and it’s integration with ArviZ (using ArviZ development version).

It tries to extend the example on the ArviZ cookbook and focus on integrating PyMC3<->ArviZ in a more natural and step by step flow.

Let me know what you think!

Also, would you add examples like this to PyMC3 or ArviZ docs? or to both? and if so, where?

10 Likes

Thanks for your post. I’m working my way through it right now. I got a question for the rugby example:

with pm.Model(coords=coords) as rugby_model:
    ...
    # team-specific model parameters
    atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, dims="team")
    defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, dims="team")

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star), dims="team")
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star), dims="team")

Wouldn’t atts and def be of dims="team" if we didn’t provide it explicitly? I thought they would be of the same dimensionality as atts_star and defs_star?

At least when defining the model without the explicit dims for atts and defs that seems to be the case (judging by the graphviz output).

Yes, they would have the same shape, but unluckily the dimensions are not kept when operations are performed on the model variables. Thus, in the resulting InferenceData, they would have the right shape but not the right dimension name nor coordinate values.

1 Like

In the linear regression example:

with pm.Model(coords=coords) as linreg_model:
    x = pm.Data("x", x_, dims="year")
    
    a = pm.Normal("a", 0, 3)
    b = pm.Normal("b", 0, 2)
    sigma = pm.HalfNormal("sigma", 2)
    
    y = pm.Normal("y", a + b * x, observed=y_, dims="year")

What does sigma do? It doesn’t seem to be connected to anything.

And do you have an estimate when the extend method of InferenceData will land in stable? I messed up my environment trying to install dev version of arviz.

2 Likes

Should be the standard deviation in y = pm.Normal("y", a + b * x, observed=y_, dims="year"), I completely forgot about it, I’ll fix it. Thanks again!

Soon, it’s hard to give an accurate estimate but should definitely be less than a month, hopefully 1-2 weeks.

1 Like

Great tutorial! Thank you for writing it.

Sorry if this question is off topic but I have just started playing around with models in pymc3, how do I incorporate that I have a very good idea of some of the teams specific parameters?
For example, there are 5 teams in total and I know the mean value of atts_star for team 2 and 3, do I just change the atts_star distribution to atts_star = pm.Normal(“atts_star”, mu=[0, 5, 7, 0, 0], sigma=sd_att, dims=“team”). This is what I have done before but I have not seen it in any of the tutorials or examples so I think I am missing something.

1 Like

That should work yes, and you could also use an array to set different sigmas to each of the teams.

I’m trying to incorporate coords on the likelihood, so I can plot groups of the posterior_predictive. The model seems to be built correctly, but when pm.Sample is about to finish I get the error of mismatching dims (5 != 3). My data consists of three levels: exercise (10), date (45) & obs_id (829). It works when I only have obs_id as the dims for the likelihood and I assume the 3 dims expected are Chains, Draws & obs_id. Any ideas?

prior_exercise_idx, prior_exercises = pd.factorize(data_model['Exercise'])
prior_date_idx, prior_dates = pd.factorize(data_model['Workout Start Time'])

prior_velocity_obs = data_model['Max Avg Velocity (m/s)']
prior_weight_obs = data_model['Weight']

prior_obs_id = data_model.index

prior_age_weights = scale_by_age(prior_dates)
prior_mvts = data_model['Exercise'].apply(find_mvt, args = [mvt_idata])
prior_velocity_weights = scale_by_velocity(prior_velocity_obs, prior_mvts)

prior_coords = {'exercise': prior_exercises,
                'date': prior_dates,
                'obs_id': prior_obs_id}

with pm.Model(coords = prior_coords) as prior_model:
    # Inputs
    exercise_idx = pm.Data('exercise_idx', prior_exercise_idx, dims = 'obs_id')
    date_idx = pm.Data('date_idx', prior_date_idx, dims = 'obs_id')

    velocity_obs = pm.Data('velocity_obs', prior_velocity_obs, dims = 'obs_id')
    weight_obs = pm.Data('weight_obs', prior_weight_obs, dims = 'obs_id')

    age_weights = pm.Data('age_weights', prior_age_weights, dims = 'date')
    velocity_weights = pm.Data('velocity_weights', prior_velocity_weights, dims = 'obs_id')

    # Global priors
    global_alpha = pm.Gamma('global_alpha', mu = 120, sigma = 5)
    global_alpha_sd = pm.HalfNormal('global_alpha_sd', sigma = 70)

    global_beta_pos = pm.Gamma('global_beta_pos', mu = 80, sigma = 5)
    global_beta = pm.Deterministic('global_beta', -global_beta_pos)
    global_beta_sd = pm.HalfNormal('global_beta_sd', sigma = 60)

    # Exercise priors
    exercise_alpha_offset = pm.Normal('exercise_alpha_offset', mu = 0, sigma = 1, dims = 'exercise')
    exercise_alpha = pm.Deterministic('exercise_alpha', global_alpha + global_alpha_sd*exercise_alpha_offset, dims = 'exercise')
    exercise_alpha_sd = pm.HalfNormal('exercise_alpha_sd', sigma = 70, dims = 'exercise')

    exercise_beta_offset = pm.Normal('exercise_beta_offset', mu = 0, sigma = 1, dims = 'exercise')
    exercise_beta = pm.Deterministic('exercise_beta', global_beta + global_beta_sd*exercise_beta_offset, dims = 'exercise')
    exercise_beta_sd = pm.HalfNormal('exercise_beta_sd', sigma = 60, dims = 'exercise')

    # Date priors
    date_alpha_offset = pm.Normal('date_alpha_offset', mu = 0, sigma = 1, dims = ['date', 'exercise'])
    date_alpha = pm.Deterministic('date_alpha', exercise_alpha + exercise_alpha_sd*date_alpha_offset, dims = ['date', 'exercise'])

    date_beta_offset = pm.Normal('date_beta_offset', mu = 0, sigma = 1, dims = ['date', 'exercise'])
    date_beta = pm.Deterministic('date_beta', exercise_beta + exercise_beta_sd*date_beta_offset, dims = ['date', 'exercise'])

    # Model error
    sigma = pm.HalfNormal('sigma', sigma = 2)
    sigma_age_weighted = pm.Deterministic('sigma_age_weighted', scale_sigma(sigma, age_weights), dims = 'date')
    sigma_velocity_weighted = pm.Deterministic('sigma_velocity_weighted', scale_sigma(sigma_age_weighted[date_idx], velocity_weights), dims = 'obs_id')

    # Expected value
    mu = date_alpha[date_idx, exercise_idx] + date_beta[date_idx, exercise_idx]*velocity_obs

    # Likelihood of the observations
    likelihood = pm.Normal('likelihood',
                           mu = mu,
                           sigma = sigma_velocity_weighted,
                           observed = weight_obs,
                           dims = 'obs_id')
1 Like

I am not sure what is happening yet, I will need some more info. I would start with 2 checks to make sure that the conversion of the log likelihood is the issue. 1) pass idata_kwargs={"log_likelihood": False} to pm.sample to see if everything else works and has no dim/shape problem and 2) now without the idata_kwargs, remove the dims argument from likelihood (use shape if necessary) and see what is the shape of the DataArray in log_likelihood group.


I hope some general coord/dim debugging info could also be useful to other people finding the post, and it will also help understand why I have asked these 2 questions above. The two most common error messages related to coords and dims are the following (according to my experience):

...

ValueError: different number of dimensions on data and dims: 3 vs 4

This is trying to tell us that our data (data being the numpy array with samples/sample_stats/observations… that we are trying to convert to xarray object) has 3 dimensions but ArviZ has 4 dimension names to assign to this variable. i.e. ary.shape = (4, 200, 8) but its dims are (chain, draw, level, obs_id).

This error should not happen the other way around, if the array has 4 dimensions and only 3 dims are “named”, ArviZ will use the default dim name <var_name>_dim_# for the forth dimension and carry on.

...

ValueError: Input dimension mis-match. (input[0].shape[1] = 7, input[1].shape[1] = 8)

This is trying to tell us that while the number of dimensions of the data agrees with the names to be assigned, their shapes don’t. This is therefore generally due to the coordinate values. We have a coords={"dim_1": np.arange(8)} and we are trying to use this to index a dimension whose length is 7.


Back to this particular case, the log_likelihood trick will serve to confirm that the issue is with likelihood variable and not with any of the ones in the posterior group, and using no dims will have ArviZ use the defaults which are defined from the shape of the data so won’t have mismatches. Things like broadcasting gotchas, theano-numpy incompatibilities or pymc shape bugs end up generating arrays with shapes that do not match our expectations and thus don’t match the dims we assign them either. This should give us the actual shape of the array and give some insight to why this shape is not compatible with the dims set.

4 Likes

Thanks for helping!

I’ve now done both:

  1. With my dims setup for the likelihood as ['obs_id', 'date', 'exercise'] I’m getting the error ValueError: different number of dimensions on data and dims: 1 vs 3 This kinda make sense, it works when I only have obs_id as the dim. I though that perhaps I needed to have the dims of the pm.Data() for obs_id setup the same, but then I got an error compiling the model basically saying the same as the previous error: Length of dims must match the dimensions of the dataset. (actual 3 != expected 1)

  2. With no dims for the likelihood, the log_likelihood looks like this:
    image

Could it have something to do with that not all dates exists for all exercises?
My source data is structured like this:
image

1 Like

It looks like your data is 1D and thus you should only pass obs_id as dims.

Even though your data has exercise and date levels, these are not encoded as dimensions, the data is flattened and has its exercise and date id. It might be possible to create an array where date and exercise are encoded as dimensions but I am not sure about it. In that case, you would probably not have obs_id anymore unless the date-exercise pairs had multiple observations.

It is possible to reshape the result from obs_id to date, exercise, see for example this stack overflow answer, but in your case where there are exercises without a date, these positions would be filled by nans, not sure if that is what you want.

1 Like

You are correct, so it’s one step forward, three steps back for me :slight_smile:
I’ve now changed the dataset to a xarray with three coordinates so the shape of my data is (10, 45, 15), now I “just” need to handle all the NaN in the data some way, but that’s a separate thread. Will report back here for other viewers once I have a solution.

1 Like

Are there any good resources on how to index / set up a structure of several levels from longitudinal data?

My problem is that my dataset has rows of data with with several subcategories that I think would be great as a partial pooled model of several layers. (For example if the dataset had categories for team such as Country and division as well), I seem to stumble on how to set up the indexes between different levels.

2 Likes

That sounds like the problem I had. Each level would probably need to be another dimension and a data frame can only allow 1-2 dimensions. xarray however, can handle them all.

The issue that then occurs that I still haven’t found a solution for is that missing data occurs if you don’t have observations for all the cells in the matrix.

In my example, I have 3 dimensions: 10 exercises, 45 dates and 15 observations.
Each observation contains two values.
But, I won’t always have 15 observations for every exercise/date pair.
And I won’t always have 45 dates for every exercise. My matrix is basically filled with NaNs :see_no_evil:

1 Like

I’ve now figured out how to create the model:
Creating Hierarchical Models Using 3-dimensional Xarray Data

Also, to handle NaNs, make use of xr.Dataset.to_masked_array()

3 Likes

Hey! This is amazing. Thank you for sharing this.

I love xarray; I was first introduced to it through ArviZ (thank you for your work on it!) but have come to appreciate it in its own right. I truly believe that labeled coordinates and dimensions are the future of scientific computing. I’m happy to see PyMC3 embracing this!

3 Likes

@OriolAbril this functionality is absolute gold, and thanks hugely for the guide too. I just refactored a model and manual PPC (it can’t use pymc3.sample_posterior_predictive` because it has observations in the posterior). It was getting out of control but now model and manual PPC use coords throughout including indexing and summary stats, and has really made life a lot simpler!

2 Likes

This explanation was great & very useful! I am trying to apply the combination of pm.Data and dims to my model but I’m getting an error when sampling. Here’s the model:

coords = {
    'LoT': np.arange(len(L)),
    'cat': np.arange(L.shape[1]),
    'obs': np.arange(len(category_i))
}

with pm.Model(coords=coords) as model:
    
    L_data = pm.Data('L', L)#, dims=('LoT', 'cat'))
    cat_i_data = pm.Data('cat_i', category_i)#, dims='obs')
    out_i_data = pm.Data('outcome_i', outcome_i)#, dims='obs')
        
    L_dims = L.shape
    n_lots, n_cats = L.shape

    sigma = pm.HalfNormal('sigma', sigma=2, dims='cat')
    a_0 = pm.Normal('a_0', mu=0, sigma=2)
    a_1 = pm.Normal('a_1', mu=0, sigma=2)

    z_logp = tt.zeros((n_lots,), dtype='float')
    for z_i in range(n_lots):
        zi_logp = 0.
        for cat in range(n_cats):
            obs_idx = np.where(cat_i_data==cat)[0]
            muZ = a_0 + a_1 * L[z_i, cat]
            zi_logp = zi_logp + tt.sum(
                pm.Normal
                .dist(mu=muZ, sd=sigma[cat])
                .logp(out_i_data[obs_idx])
            )
        z_logp = tt.set_subtensor(z_logp[z_i], zi_logp)
    lp3 = pm.Deterministic('z_logp', z_logp, dims='LoT')
    tot_logp = pm.math.logsumexp(z_logp)
    pot = pm.Potential('e', tot_logp)

The model compiles fine, but when I try to sample:

And in the trace zlog_p is always 0, so clearly the sampling isn’t working.

However, when I use the arrays directly instead of using their pm.Data version, everything works fine, and when I comment out

    lp3 = pm.Deterministic('z_logp', z_logp, dims='LoT')
    tot_logp = pm.math.logsumexp(z_logp)
    pot = pm.Potential('e', tot_logp)

everything also works fine (well, except that it’s a different model, but I mean that it runs fine). Also, when I comment out just pot = pm.Potential('e', tot_logp) the problem is still there. So my guess is that it has something to do with the Deterministic z_logp, but I’m not sure what about it is problematic. Thanks in advance for any ideas!

I think this should be a new topic so it can get more answers. I think that Theano shared vars should not be limited to that but I may be wrong. Moreover, this could have already been fixed in v4 with me not knowing about it.

1 Like

Getting the handle of it can be a bit tricky and xarray docs are a bit disperse and can be hard to search, but I am more and more in awe of all the automagical capabilities it has as well as the explicitness of the code (even if it’s obviously more verbose to use a dimension name instead of an integer).

Glad you liked it and thanks for the kind words, but I can’t take the credit for the functionality, it has been a team effort both in PyMC3 and in ArviZ

1 Like