Trace_to_dataframe errors after trace.remove_values

Hi PyMC3 community,

I ran into the following issue with PyMC 3.8: trace_to_dataframe raises a KeyError if called on a trace that had a RV removed via remove_values. Here is a MWE:

import pymc3 as pm
import numpy as np

data = np.random.normal(loc=2,scale=0.5,size=10)

with pm.Model() as m:
    mu = pm.Normal("mu",mu=3,sigma=1)
    sig = pm.InverseGamma("sigma",alpha=1,beta=1)
    pm.Normal("d",mu=mu,sigma=sig,observed=data)

trace = pm.sample(model=m,chains=2,cores=1)

trace.remove_values("sigma")
pm.summary(trace)
df = pm.trace_to_dataframe(trace)
df.head()

The error is

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Untitled-1 in 
     14 trace.remove_values("sigma")
     15 pm.summary(trace)
---> 16 df = pm.trace_to_dataframe(trace)
     17 df.head()

~\miniconda3\envs\dive\lib\site-packages\pymc3\backends\tracetab.py in trace_to_dataframe(trace, chains, varnames, include_transformed)
     36     var_dfs = []
     37     for v in varnames:
---> 38         vals = trace.get_values(v, combine=True, chains=chains)
     39         flat_vals = vals.reshape(vals.shape[0], -1)
     40         var_dfs.append(pd.DataFrame(flat_vals, columns=flat_names[v]))

~\miniconda3\envs\dive\lib\site-packages\pymc3\backends\base.py in get_values(self, varname, burn, thin, combine, chains, squeeze)
    471         try:
    472             results = [self._straces[chain].get_values(varname, burn, thin)
--> 473                        for chain in chains]
    474         except TypeError:  # Single chain passed.
    475             results = [self._straces[chains].get_values(varname, burn, thin)]

~\miniconda3\envs\dive\lib\site-packages\pymc3\backends\base.py in (.0)
    471         try:
    472             results = [self._straces[chain].get_values(varname, burn, thin)
--> 473                        for chain in chains]
    474         except TypeError:  # Single chain passed.
    475             results = [self._straces[chains].get_values(varname, burn, thin)]

~\miniconda3\envs\dive\lib\site-packages\pymc3\backends\ndarray.py in get_values(self, varname, burn, thin)
    286         A NumPy array
    287         """
--> 288         return self.samples[varname][burn::thin]
    289 
    290     def _slice(self, idx):

KeyError: 'sigma'

Is there a workaround for this? I also see that trace_to_dataframe is going to be deprecated and removed (issue #3907). What should I use instead to convert a trace to a Pandas dataframe?

Thanks!

Hi,
Yeah I think removing an RV messes up with the trace’s metadata, which would explain the error. I think I remember that @OriolAbril already encountered this case, didn’t you?

As for handling the traces, PyMC’s plots and diagnostics are now handled by ArviZ, which uses InferenceData objects to manipulate traces. This is much more powerful than pandas DataFrames or numpy arrays to handle multidimensional traces, as is very often the case in Bayesian modeling.
In particular, you can name the dimensions of the variables and use them to slice the trace, without having to bother with shape handling. I would advise reading the quickstart for a more detailed example.

Hope this helps :vulcan_salute:

1 Like

As @AlexAndorra says, in most cases using xarray datasets will fit better the data stored in the trace object which is bound to be multidimensional and not table-like. Moreover, arviz.from_pymc3 will work even after removing variables from the trace.

I don’t really know how does trace_to_dataframe works, but if you really want to use pandas instead of xarray, xarray.Dataset objects have a converter function to_dataframe. Thus, something like this would produce a dataframe (hopefully similar to the one generated by trace_to_dataframe):

idata = az.from_pymc3(trace, model=m)
df = idata.posterior.to_dataframe()

Having said that, I would argue that most situations do not really require removing variables from the trace. For simple cases, using var_names argument in summary and plots would suffice, and for more complicated ones (or for convenience with simpler ones), you can create a subset of the posterior group (no extra memory usage) and work with it:

idata = az.from_pymc3(trace, model=m)
data = idata.posterior[["mu", ...]] # select only desired variables

This would have the main advantage of still having sigma if it were needed eventually without having to rerun the calculations. For example, to calculate pointwise likelihood values (needed for waic or loo), having all the variables in the original trace is required, as well as all variables are required for sample_posterior_predictive (which can be used either for posterior predictive checks or to predict new values based on the model)

2 Likes

Thank you both, this is very helpful!

Should I file an issue regarding remove_values?

The only reason I am converting to a pandas dataframe is to generate a CSV file that I could then import into an old MATLAB workflow (which I will port over to Python stepwise). What is the most efficient route to save a pymc3.MultiTrace as a CSV file?

Glad this was helpful!
I’m not sure this is an issue actually: as Oriol said, removing variables from the trace will prevent computing information criteria like WAIC or LOO as well as doing posterior predictive checks. This is a pretty big dent in the principled Bayesian workflow, so I’m not sure making it easier is the way to go.

As for saving to a CSV, I would look at xarray doc – as they are based on pandas, there is a chance they have a way to directly save to CSV; or at least you can do to_pandas then to_csv, as Oriol said.
I don’t know if Matlab reads net_cdf files, but just want to highlight that ArviZ can save the trace to net_cdf, in case it’s useful :man_shrugging:

1 Like

Thank you, NetCDF is the solution! One can even skip the InterferenceData intermediate.

For future reference, here’s the full workflow:

Python 3.7.6 / PyMC3 3.8 / ArviZ 0.6.1:

import pymc3 as pm
import numpy as np
import arviz as az

data = np.random.normal(loc=2,scale=0.5,size=10)

with pm.Model() as m:
    mu = pm.Normal("mu",mu=3,sigma=1)
    sig = pm.InverseGamma("sigma",alpha=1,beta=1)
    pm.Normal("d",mu=mu,sigma=sig,observed=data)

trace = pm.sample(model=m,chains=2,cores=1)

az.to_netcdf(trace,"results.nc")

MATLAB:

mu = ncread('results.nc','/posterior/mu');
plot(mu)