Dims in pm.Data

Hi , I was going through this link - https://oriolabril.github.io/oriol_unraveled/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html by @OriolAbril
and I could not figure out a way to specify the dimension inside pm.Data

with pm.Model() as model:

    input_data = pm.MutableData("input_data", 
                                value=df,
                                dims=tuple(df.columns))

My df has 26 columns. I read through the documentation of pm.Data/pm.MutableData about the ‘value’ and ‘dims’ parameter and hence passed a tuple of columns

value : array_like or pandas.Series, pandas.Dataframe
A value to associate with this variable.
dims : str or tuple of str, optional
Dimension names of the random variables (as opposed to the shapes of these
random variables). Use this when value is a pandas Series or DataFrame. The
dims will then be the name of the Series / DataFrame’s columns. See ArviZ
documentation for more information about dimensions and coordinates:

But still I am getting below error -

Length of `dims` must match the dimensions of the dataset. (actual 26 != expected 2)

Can anyone please help here to correctly specify the value and dims inside pm.MutableData?

You are passing the labels (the names of each “thing” in a dimension) to the dims (which tells PyMC how many dimensions there are).

That is, your data is a matrix. It has two dimensions, “index” and “columns”. That’s why expected is 2. Within the second dimension (the “column” dimension) you want to provide labels, because there are 26 columns.

What you need to do is make a coords dictionary, like this:

coords = {'obs_idx': df.index, 'feature':df.columns}

Use this coords dict to tell PyMC what dims are expected, and what they contain. After PyMC knows this, you can use dims to label dimensions of your model objects:

with pm.Model(coords=coords) as model:

    input_data = pm.MutableData("input_data", 
                                value=df,
                                dims=['obs_idx', 'feature'])

Also you can check out the tutorial on data containers, which will make you an expert on everything data-related in PyMC.

1 Like

Thanks @jessegrabowski, that worked. While defining pm.set_data, it is searching for columns. This is how I have defined set data -
I am giving a different dataframe to set_data with different length as compared to df above.

with model:
    pm.set_data(df_2.to_dict())
    y_test = pm.sample_posterior_predictive(idata)

But this throws a keyerror

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in __getitem__(self, key)
   1519             try:
-> 1520                 return self.named_vars[self.name_for(key)]
   1521             except KeyError:

KeyError: 'column_1'

Am I supposed to define each of the 26 variables in the mutable data function inside the with context?

pm.set_data expects a dictionary with the names of data containers as keys and new data as values. In your case, this will be pm.set_data({'input_data':df_2}). You will also need to use coords_mutable for the index dimension if you plan to set new data, and provide new labels to it via set_data.

I strongly encourage you to work through the linked tutorial about data containers.

Hi @jessegrabowski , I went through the link and made changes to the code as below -

coords = {'obs_id': df.index, 'feature':df.columns}
coords_mutable = {'obs_id': np.arange(len(df))}
with pm.Model(coords=coords,coords_mutable=coords_mutable) as model:

    input_data = pm.MutableData("input_data", 
                                value=df,
                                dims=['obs_id', 'feature'])

df.shape = (142, 26)
and data_2.shape = (65, 26)
And then for out of sample predictions, here is what i did -

with model:
    pm.set_data({"input_data":data_2},
               coords=np.arange(len(data_2)))
    y_test = pm.sample_posterior_predictive(idata)

This gives me the following error -

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_336/3347906026.py in <module>
      1 with model:
----> 2     pm.set_data({"input_data":data_2},
      3                coords=np.arange(len(data_2)))
      4     y_test = pm.sample_posterior_predictive(idata)
/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in set_data(new_data, model, coords)
   2019 
   2020     for variable_name, new_value in new_data.items():
-> 2021         model.set_data(variable_name, new_value, coords=coords)
   2022 
   2023 

/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in set_data(self, name, values, coords)
   1145         values = convert_observed_data(values)
   1146         dims = self.named_vars_to_dims.get(name, None) or ()
-> 1147         coords = coords or {}
   1148 
   1149         if values.ndim != shared_object.ndim:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

the coords argument in pm.set_data needs to be a dictionary. See the docstring for details.