Sample_posterior_predictive slow performance due to dataset_to_point_list

T_L · July 21, 2022, 11:49am

Hello,

I am trying to sample from my posterior predictive distribution, but sample_posterior_predictive is taking a very long time (a few minutes) before the progress bar even shows, and it is slow even when predicting one sample. After some code reading, I figured out that the dataset_to_point_list function in util.py is likely to be causing the performance issue.

In sampling.py, sample_posterior_predictive(),

elif isinstance(trace, xarray.Dataset):
    idata_kwargs["coords"].setdefault("draw", trace["draw"])
    idata_kwargs["coords"].setdefault("chain", trace["chain"])
    _trace = dataset_to_point_list(trace)
    nchain, len_trace = chains_and_samples(trace)

And dataset_to_point_list is being calculated using a triple for-loop, which can be quite slow.
In utils.py, dataset_to_point_list(),

for c in ds.chain:
    for d in ds.draw:
        points.append({vn: da.sel(chain=c, draw=d).values for vn, da in ds.items()})

My trace data is rather big,

7000 draws per model parameter
13 model parameters, each being 4 dimensional so 52 in total
I am saving my model trace into an .h5 file via trace.posterior.to_netcdf()
version = pymc 4.0.1

My questions,

What is the recommended way to save & load model trace?
- It seems that dataset_to_point_list will not be called if
  - isinstance(trace, MultiTrace) - How do I save & load a MultiTrace object?
    or
  - isinstance(trace, list) and all(isinstance(x, dict) for x in trace) - This seems to be the output of dataset_to_point_list.
- I suppose in my case I could save & load dataset_to_point_list(trace) instead, but it loses the #chains information. Is there a better way?
Is there any particular reason why dataset_to_point_list is implemented using 3 for loops?
- Will there be a more efficient version in the future?

Thanks!

OriolAbril · July 22, 2022, 10:43pm

I think this will be improved in the future, ideally the posterior predictive sampling would broadcast along the chain and draw dimensions instead to needing to loop over them. I think @ricardoV94 and @lucianopaz have some ideas on this.

As a temporal workaround, and depending on your model, you can try using xarray-einstats to generate posterior predictive samples from the posterior ones. It might be faster and it is also compatible with Dask already, so it can handle arrays of posterior predictive samples that don’t fit in memory. It is however much less convenient than pm.sample_posterior_predictive. I have some examples of generating posterior predictive samples in CmdStanPy and ArviZ integration | Oriol unraveled. It uses cmdstanpy for posterior sampling, but once you get an az.InferenceData object, the computations no longer depend on the PPL used, so the posterior predictive sampling will look similar to your case.

T_L · August 15, 2022, 8:17am

Thank you!

Topic		Replies	Views
Sample_posterior_predictive Questions	1	917	May 2, 2019
Bug in fast sample posterior predictive? Questions	9	1499	March 14, 2021
Trace from Pymc3 being used in Pymc 4.0 v5	11	911	June 10, 2022
Are samples from the trace equivalent to samples from pm.sample_posterior_predictive? Questions	6	635	October 16, 2020
Why does sample_posterior_predictive use only one chain? Questions	16	1748	November 5, 2020

Sample_posterior_predictive slow performance due to dataset_to_point_list

Related topics