PyMC+ArviZ: how to make the most of labeled coords and dims in PyMC 4.0

OriolAbril · June 6, 2022, 11:34pm

Nearly two years ago I wrote a blog post about the integration between PyMC3 and ArviZ (which I also shared here on discourse). I have now adapted and updated it to showcase all the further improvements on dims and coords in the latest 4.0 release!

I hope it is also useful!

The blog post mentions the old one a couple times, but it is completely independent and it doesn’t assume you know PyMC 3.x or how dims and coords worked there. Whether you want to jump directly on 4.0 using dims and coords or you want to see what improvements does 4.0 bring to the labeled arrays game this blog post has both cases covered.

wmbelk · June 7, 2022, 10:39am

Thanks for working on this.
Note the code in your blog isn’t wrapping around. When reading on my phone, I couldn’t see some of the dims code.

OriolAbril · June 7, 2022, 2:49pm

Thanks, I don’t really know how to fix it but would be happy to review and merge a PR that does. The blog source is on GitHub

jessegrabowski · June 8, 2022, 9:09am

Awesome blog post! One issue with the reproducing the code – I get an error in 4.0 on Windows when I tried to sample the model with flat priors, so I had to change these to something else uninformative.

On the subject of setting coords: is there any way to get a multi-index into the idata? If I were working with the rugby data, a first step in my EDA would be to do something like:

df_rugby.set_index(['home_team', 'away_team', 'year']).sort_index(level=[0, 1, 2])

It would be nice to be able to do something like post_pred.sel({'home_team':'France', 'away_team':'Italy'} to quickly decompose post-estimation plots to the group level, especially when working with long-form data (which you must do when there are missing values, as far as I can tell).

It also makes it a bit easier to look at the output of az.summary when there missing values. For example score_missing[35]… which one was 35 again? Sometimes I’ll make a string representation out of the index tuple (whales_france_2014), but that’s not ideal for easy indexing.

OriolAbril · June 9, 2022, 8:57pm

I know xarray supports multiindex (and whenever you do dataset.stack a multiindex is created under the hood). I think (emphasis on think) if you use a multiindex as coords it will be passed as is to xarray which will handle it correctly. I am not sure how should one go about indexing it though.

This is something I’d like to see as an example notebook in the gallery though. handling multilevel models (with more than 2 levels) and making the most of pm.Data, coords, xarray with the output.

jessegrabowski · June 10, 2022, 4:23am

I had a quick go of trying to directly pass a multi-index as a dim in a use-case where a >2 level model might be justified, in a geographic hierarchy. It seems like there is a hiccup in the dimension checking, though. Here is the example notebook with the dimension error.

I’ll try to muddle though, but hoping you can quickly see something I won’t. It seems to be a conflict between the fact that the array representation is only 2d, but I want to label 5 dimensions.

OriolAbril · June 10, 2022, 6:40am

The issue here might be with the automatic extraction of index as coords. IIUC, using a dataframe creates a 2d variable, one dimension has coordinates, the index, but the other one, the columns, does not. Moreover, from the last print it looks like the multiindex is being converted to a tuple (I think to prevent modifying coordinates without using the set_data/set_coords methods).

To try and isolate what is happening, can you try providing the coords explicitly when calling pm.Data and also defining the dim in pm.Data but without coords and passing the coords as idata_kwargs?

jessegrabowski · June 10, 2022, 7:34am

It’s definitely related to the extraction of coords. The determine_coords function in data.py is hard-coded to only look for two dimensions: index and column. pm.Data also explicitly checks that the length of dims is equal to the ndims of the data, so it won’t let you “overload” the dimension by decomposing the multi-index into several coords.

The multi-index is being converted to a tuple of tuples in the model.add_cord method (this also converts the multi-index to a tuple-of-tuples if it is passed in via the idata_kwargs keyword) If it could survive that step, xarray is happy to take a pd.MultiIndex as a coord, and then all the work is done. To demonstrate this, I pass no coords to the model then slip in the multi-index after the fact:

prior.prior_predictive.coords.update({'likelihood_dim_0':df.index})

Then you can use .sel as expected on a multi-dimensional index, e.g. prior.prior_predictive.sel({'country':'A'} returns the prior predictive for all sub-regions in country A: A11, A12, etc. This would be quite nice for quickly doing PPCs by different groupings.

As far as implementation goes, I guess either the add_coord method could be modified to allow the mutli-index through, or a new routine could be added to backends.arviz.InferenceDataConverter to look for the tuple-of-tuples structure, rebuild the multi-index with pd.MultiIndex.from_tuples, and then set the coords. Neither solution seems very clean, but the second would probably risk fewer unintended consequences.

OriolAbril · June 11, 2022, 7:26am

It looks like we’d need to undo Coerce coords into np.array in numpy_to_data_array by eidanch · Pull Request #1695 · arviz-devs/arviz · GitHub on ArviZ side, and change the test from testing a tuple to testing a multiindex. Or use a try except to use the coords as is and only if that fails conveet them to an array. That would fix the issue when using idata_kwargs, after that we’d need to look on how to support this better from pymc (I am not very familiar with internals of coord handling in v4 yet and can’t really see what unexpected consequences would each of the options have)

AlexAndorra · November 11, 2022, 8:12pm

Thanks for that thread guys! I was hitting this Multi-index issue and helped me get clarity quickly

Topic		Replies	Views
PyMC3+ArviZ: improve your workflow with labeled coords and dims Sharing doc	20	5816	April 5, 2021
Understanding coords, indexation, Data, ..., for multilevel models v5 modeling	1	3965	April 29, 2022
Stochastic labels Questions	10	766	October 5, 2020
Indexing constantdata by label v5	0	123	January 24, 2024
Use xarray for traces Development	5	1775	June 15, 2017

PyMC+ArviZ: how to make the most of labeled coords and dims in PyMC 4.0

Related topics