PyMC+ArviZ: how to make the most of labeled coords and dims in PyMC 4.0

Nearly two years ago I wrote a blog post about the integration between PyMC3 and ArviZ (which I also shared here on discourse). I have now adapted and updated it to showcase all the further improvements on dims and coords in the latest 4.0 release!

I hope it is also useful!

The blog post mentions the old one a couple times, but it is completely independent and it doesn’t assume you know PyMC 3.x or how dims and coords worked there. Whether you want to jump directly on 4.0 using dims and coords or you want to see what improvements does 4.0 bring to the labeled arrays game this blog post has both cases covered.

10 Likes

Thanks for working on this.
Note the code in your blog isn’t wrapping around. When reading on my phone, I couldn’t see some of the dims code.

Thanks, I don’t really know how to fix it but would be happy to review and merge a PR that does. The blog source is on GitHub

Awesome blog post! One issue with the reproducing the code – I get an error in 4.0 on Windows when I tried to sample the model with flat priors, so I had to change these to something else uninformative.

On the subject of setting coords: is there any way to get a multi-index into the idata? If I were working with the rugby data, a first step in my EDA would be to do something like:

df_rugby.set_index(['home_team', 'away_team', 'year']).sort_index(level=[0, 1, 2])

It would be nice to be able to do something like post_pred.sel({'home_team':'France', 'away_team':'Italy'} to quickly decompose post-estimation plots to the group level, especially when working with long-form data (which you must do when there are missing values, as far as I can tell).

It also makes it a bit easier to look at the output of az.summary when there missing values. For example score_missing[35]… which one was 35 again? Sometimes I’ll make a string representation out of the index tuple (whales_france_2014), but that’s not ideal for easy indexing.

1 Like

I know xarray supports multiindex (and whenever you do dataset.stack a multiindex is created under the hood). I think (emphasis on think) if you use a multiindex as coords it will be passed as is to xarray which will handle it correctly. I am not sure how should one go about indexing it though.

This is something I’d like to see as an example notebook in the gallery though. handling multilevel models (with more than 2 levels) and making the most of pm.Data, coords, xarray with the output.

1 Like

I had a quick go of trying to directly pass a multi-index as a dim in a use-case where a >2 level model might be justified, in a geographic hierarchy. It seems like there is a hiccup in the dimension checking, though. Here is the example notebook with the dimension error.

I’ll try to muddle though, but hoping you can quickly see something I won’t. It seems to be a conflict between the fact that the array representation is only 2d, but I want to label 5 dimensions.

1 Like

The issue here might be with the automatic extraction of index as coords. IIUC, using a dataframe creates a 2d variable, one dimension has coordinates, the index, but the other one, the columns, does not. Moreover, from the last print it looks like the multiindex is being converted to a tuple (I think to prevent modifying coordinates without using the set_data/set_coords methods).

To try and isolate what is happening, can you try providing the coords explicitly when calling pm.Data and also defining the dim in pm.Data but without coords and passing the coords as idata_kwargs?

It’s definitely related to the extraction of coords. The determine_coords function in data.py is hard-coded to only look for two dimensions: index and column. pm.Data also explicitly checks that the length of dims is equal to the ndims of the data, so it won’t let you “overload” the dimension by decomposing the multi-index into several coords.

The multi-index is being converted to a tuple of tuples in the model.add_cord method (this also converts the multi-index to a tuple-of-tuples if it is passed in via the idata_kwargs keyword) If it could survive that step, xarray is happy to take a pd.MultiIndex as a coord, and then all the work is done. To demonstrate this, I pass no coords to the model then slip in the multi-index after the fact:

prior.prior_predictive.coords.update({'likelihood_dim_0':df.index})

Then you can use .sel as expected on a multi-dimensional index, e.g. prior.prior_predictive.sel({'country':'A'} returns the prior predictive for all sub-regions in country A: A11, A12, etc. This would be quite nice for quickly doing PPCs by different groupings.

As far as implementation goes, I guess either the add_coord method could be modified to allow the mutli-index through, or a new routine could be added to backends.arviz.InferenceDataConverter to look for the tuple-of-tuples structure, rebuild the multi-index with pd.MultiIndex.from_tuples, and then set the coords. Neither solution seems very clean, but the second would probably risk fewer unintended consequences.

3 Likes

It looks like we’d need to undo Coerce coords into np.array in numpy_to_data_array by eidanch · Pull Request #1695 · arviz-devs/arviz · GitHub on ArviZ side, and change the test from testing a tuple to testing a multiindex. Or use a try except to use the coords as is and only if that fails conveet them to an array. That would fix the issue when using idata_kwargs, after that we’d need to look on how to support this better from pymc (I am not very familiar with internals of coord handling in v4 yet and can’t really see what unexpected consequences would each of the options have)

2 Likes

Thanks for that thread guys! I was hitting this Multi-index issue and helped me get clarity quickly :ok_hand:

2 Likes