Memory mapping `arviz.InferenceData`

Giving some extra background hoping it will help. Roughly speaking, this is the memory footprint of pm.sample (actual values used are meaningless and only illustrative):

The blue solid line is tuning and sampling. The array needed is allocated first, then its values are filled as sampling progresses. The dashed lines that come later are optional and not part of sampling but related to the conversion to InferenceData. By default only posterior, sample_stats and, if any, observed/constant data groups are present. All these are data that is already stored in RAM from sampling, so converting to InferenceData doesn’t really need any extra memory from what was used in sampling.

There are functions in ArviZ however that need more information than the posterior samples and sampling stats. So it is possible to provide log_likelihood=True and log_prior=True in pm.sample via idata_kwargs. If so, these quantities are then computed and stored (orange and green lines respectively), and these are quantities that are not used during sampling so they do require allocating extra arrays.

log_prior should have the same shape as the posterior, but will always have float dtype as opposed to the posterior that could contain integer variables for example, so it should roughly double the RAM needed as compared to the default behaviour.

log_likelihood has shape (chain, draw, n_observations) so it is often larger (or much larger) than the posterior, but not necessarily. In hierarchical models for example where it is common to store variables in the posterior with that same shape it will be smaller than the posterior.

Note: a range of PyMC versions stored the likelihood too by default, but it is not the case anymore

3 Likes