with pm.Model() as model:
input_data = pm.MutableData("input_data",
value=df,
dims=tuple(df.columns))
My df has 26 columns. I read through the documentation of pm.Data/pm.MutableData about the ‘value’ and ‘dims’ parameter and hence passed a tuple of columns
value : array_like or pandas.Series, pandas.Dataframe
A value to associate with this variable.
dims : str or tuple of str, optional
Dimension names of the random variables (as opposed to the shapes of these
random variables). Use this when value is a pandas Series or DataFrame. The dims will then be the name of the Series / DataFrame’s columns. See ArviZ
documentation for more information about dimensions and coordinates:
But still I am getting below error -
Length of `dims` must match the dimensions of the dataset. (actual 26 != expected 2)
Can anyone please help here to correctly specify the value and dims inside pm.MutableData?
You are passing the labels (the names of each “thing” in a dimension) to the dims (which tells PyMC how many dimensions there are).
That is, your data is a matrix. It has two dimensions, “index” and “columns”. That’s why expected is 2. Within the second dimension (the “column” dimension) you want to provide labels, because there are 26 columns.
What you need to do is make a coords dictionary, like this:
Use this coords dict to tell PyMC what dims are expected, and what they contain. After PyMC knows this, you can use dims to label dimensions of your model objects:
with pm.Model(coords=coords) as model:
input_data = pm.MutableData("input_data",
value=df,
dims=['obs_idx', 'feature'])
Thanks @jessegrabowski, that worked. While defining pm.set_data, it is searching for columns. This is how I have defined set data -
I am giving a different dataframe to set_data with different length as compared to df above.
with model:
pm.set_data(df_2.to_dict())
y_test = pm.sample_posterior_predictive(idata)
pm.set_data expects a dictionary with the names of data containers as keys and new data as values. In your case, this will be pm.set_data({'input_data':df_2}). You will also need to use coords_mutable for the index dimension if you plan to set new data, and provide new labels to it via set_data.
I strongly encourage you to work through the linked tutorial about data containers.
Hi @jessegrabowski , I went through the link and made changes to the code as below -
coords = {'obs_id': df.index, 'feature':df.columns}
coords_mutable = {'obs_id': np.arange(len(df))}
with pm.Model(coords=coords,coords_mutable=coords_mutable) as model:
input_data = pm.MutableData("input_data",
value=df,
dims=['obs_id', 'feature'])
df.shape = (142, 26)
and data_2.shape = (65, 26)
And then for out of sample predictions, here is what i did -
with model:
pm.set_data({"input_data":data_2},
coords=np.arange(len(data_2)))
y_test = pm.sample_posterior_predictive(idata)
This gives me the following error -
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_336/3347906026.py in <module>
1 with model:
----> 2 pm.set_data({"input_data":data_2},
3 coords=np.arange(len(data_2)))
4 y_test = pm.sample_posterior_predictive(idata)
/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in set_data(new_data, model, coords)
2019
2020 for variable_name, new_value in new_data.items():
-> 2021 model.set_data(variable_name, new_value, coords=coords)
2022
2023
/opt/conda/lib/python3.9/site-packages/pymc/model/core.py in set_data(self, name, values, coords)
1145 values = convert_observed_data(values)
1146 dims = self.named_vars_to_dims.get(name, None) or ()
-> 1147 coords = coords or {}
1148
1149 if values.ndim != shared_object.ndim:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()