Dims in pm.Data

Ansul · November 21, 2023, 2:57pm

Hi , I was going through this link - https://oriolabril.github.io/oriol_unraveled/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html by @OriolAbril
and I could not figure out a way to specify the dimension inside pm.Data

with pm.Model() as model:

    input_data = pm.MutableData("input_data", 
                                value=df,
                                dims=tuple(df.columns))

My df has 26 columns. I read through the documentation of pm.Data/pm.MutableData about the ‘value’ and ‘dims’ parameter and hence passed a tuple of columns

value : array_like or pandas.Series, pandas.Dataframe
A value to associate with this variable.
dims : str or tuple of str, optional
Dimension names of the random variables (as opposed to the shapes of these
random variables). Use this when value is a pandas Series or DataFrame. The
dims will then be the name of the Series / DataFrame’s columns. See ArviZ
documentation for more information about dimensions and coordinates:

But still I am getting below error -

Length of `dims` must match the dimensions of the dataset. (actual 26 != expected 2)

Can anyone please help here to correctly specify the value and dims inside pm.MutableData?

jessegrabowski · November 21, 2023, 3:25pm

You are passing the labels (the names of each “thing” in a dimension) to the dims (which tells PyMC how many dimensions there are).

That is, your data is a matrix. It has two dimensions, “index” and “columns”. That’s why expected is 2. Within the second dimension (the “column” dimension) you want to provide labels, because there are 26 columns.

What you need to do is make a coords dictionary, like this:

coords = {'obs_idx': df.index, 'feature':df.columns}

Use this coords dict to tell PyMC what dims are expected, and what they contain. After PyMC knows this, you can use dims to label dimensions of your model objects:

with pm.Model(coords=coords) as model:

    input_data = pm.MutableData("input_data", 
                                value=df,
                                dims=['obs_idx', 'feature'])

Also you can check out the tutorial on data containers, which will make you an expert on everything data-related in PyMC.

Ansul · November 21, 2023, 3:55pm

Thanks @jessegrabowski, that worked. While defining pm.set_data, it is searching for columns. This is how I have defined set data -
I am giving a different dataframe to set_data with different length as compared to df above.

with model:
    pm.set_data(df_2.to_dict())
    y_test = pm.sample_posterior_predictive(idata)

But this throws a keyerror

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in __getitem__(self, key)
   1519             try:
-> 1520                 return self.named_vars[self.name_for(key)]
   1521             except KeyError:

KeyError: 'column_1'

Am I supposed to define each of the 26 variables in the mutable data function inside the with context?

jessegrabowski · November 21, 2023, 3:58pm

pm.set_data expects a dictionary with the names of data containers as keys and new data as values. In your case, this will be pm.set_data({'input_data':df_2}). You will also need to use coords_mutable for the index dimension if you plan to set new data, and provide new labels to it via set_data.

I strongly encourage you to work through the linked tutorial about data containers.

Ansul · November 22, 2023, 10:56am

Hi @jessegrabowski , I went through the link and made changes to the code as below -

coords = {'obs_id': df.index, 'feature':df.columns}
coords_mutable = {'obs_id': np.arange(len(df))}
with pm.Model(coords=coords,coords_mutable=coords_mutable) as model:

    input_data = pm.MutableData("input_data", 
                                value=df,
                                dims=['obs_id', 'feature'])

df.shape = (142, 26)
and data_2.shape = (65, 26)
And then for out of sample predictions, here is what i did -

with model:
    pm.set_data({"input_data":data_2},
               coords=np.arange(len(data_2)))
    y_test = pm.sample_posterior_predictive(idata)

This gives me the following error -

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_336/3347906026.py in <module>
      1 with model:
----> 2     pm.set_data({"input_data":data_2},
      3                coords=np.arange(len(data_2)))
      4     y_test = pm.sample_posterior_predictive(idata)
/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in set_data(new_data, model, coords)
   2019 
   2020     for variable_name, new_value in new_data.items():
-> 2021         model.set_data(variable_name, new_value, coords=coords)
   2022 
   2023 

/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in set_data(self, name, values, coords)
   1145         values = convert_observed_data(values)
   1146         dims = self.named_vars_to_dims.get(name, None) or ()
-> 1147         coords = coords or {}
   1148 
   1149         if values.ndim != shared_object.ndim:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

jessegrabowski · November 22, 2023, 11:07am

the coords argument in pm.set_data needs to be a dictionary. See the docstring for details.

Topic		Replies	Views
PyMC3+ArviZ: improve your workflow with labeled coords and dims Sharing doc	20	5802	April 5, 2021
Dims Argument for Univariate Target Data v5 development , theano , modeling	5	85	February 12, 2025
Impute results in mismatch dimensions in dims and data v5	6	1536	September 28, 2022
Understanding 'Coords' and 'dims' to analyze dataset version agnostic modeling	3	170	September 28, 2024
MutableData Container - Dimensions for LKJCholeskyCov Distribution v5 modeling	4	105	July 1, 2024

Dims in pm.Data

Related topics