Memory mapping `arviz.InferenceData`

I am observing a quite large memory footprint (> 5GiB) for the following simple hierarchical linear regression code.

import pytensor
pytensor.config.gcc__cxxflags += " -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk"
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

def generate_data(num_groups: int = 5, num_samples_per_group: int = 20) -> pd.DataFrame:
    rng = np.random.default_rng(seed=42)
    alpha = 2.0
    beta = -0.5
    omega = 0.1
    tau = 0.1
    sigma = 1.0
    result = {"group": [], "y": [], "x": []}

    for i in range(num_groups):
        alpha_g = rng.normal(loc=alpha, scale=omega)
        beta_g = rng.normal(loc=beta, scale=tau)
        x = rng.normal(size=num_samples_per_group)
        eps = rng.normal(size=num_samples_per_group)
        y = alpha_g + beta_g * x + sigma * eps
        result["group"] += [i] * num_samples_per_group
        result["y"] += y.tolist()
        result["x"] += x.tolist()
    return pd.DataFrame(result)


def make_model(frame: pd.DataFrame) -> pm.Model:
    coords = {"group": frame["group"].unique()}
    with pm.Model(coords=coords) as model:
        # Data
        g = pm.MutableData("g", frame["group"])
        x = pm.MutableData("x", frame["x"])
        y = pm.ConstantData("y", frame["y"])

        # Population level
        alpha = pm.Normal("alpha")
        beta = pm.Normal("beta")

        omega = pm.HalfNormal("omega")
        sigma = pm.HalfNormal("sigma")
        tau = pm.HalfNormal("tau")

        # Group level
        epsilon_alpha_g = pm.ZeroSumNormal("epsilon_alpha_g", dims="group")
        alpha_g = pm.Deterministic("alpha_g", alpha + omega * epsilon_alpha_g)

        epsilon_beta_g = pm.ZeroSumNormal("epsilon_beta_g", dims="group")
        beta_g = pm.Deterministic("beta_g", beta + tau * epsilon_beta_g)

        # Linear model
        mu_ = pm.Deterministic("mu_", alpha_g[g] + beta_g[g] * x)

        # Likelihood
        pm.Normal("y_obs", mu=mu_, sigma=sigma, observed=y, shape=mu_.shape)

    return model


if __name__ == "__main__":
    frame = generate_data(num_groups=5000)
    model = make_model(frame)
    with model:
        trace = pm.sample()
    summary = az.summary(trace, var_names=["alpha", "beta", "omega", "sigma", "tau"]).loc[:, ["mean", "hdi_3%", "hdi_97%", "ess_bulk", "ess_tail", "r_hat"]]
    print(summary)

I assume this is due to the large number of groups that result in a large number of data variables in the arviz.InferenceData object generated by the pm.sample call. Would it be possible to use a “memory mapping” so that the ‘trace’ is directly written to disk instead of ending up claiming a lot of RAM when sampling form the model?

You can also remove the pm.Deterministics which can take a lot of memory. Those can always be recomputed later away

2 Likes

Thanks @ricardoV94 does that mean it is in general discouraged to use deterministic variables (for models with many groups and or observations) because they can result in a huge memory footprint?

How would one then go about re-computing e.g. the mean of mu_, when generating predictions on unseen data? Make a dedicated model for predictions, akin to what’s described here Out of model predictions with PyMC - PyMC Labs?

Ideally I’d like to fix the definition of those deterministic variables when I specify the model for sampling (“fitting to training data”). But I do not want to necessarily store them during posterior sampling. Then when I generate predictions on unseen data I can simply “activate” them in order to compute the mean of mu_ without having to define a new model.

Having to omit those variables just in order to “reduce the memory footprint” of my traces also leads to harder to read code IMHO, e.g. the likelihood definition becomes

pm.Normal("y_obs", mu=alpha + omega * epsilon_alpha_g[g] + (beta + tau * epsilon_beta_g[g]) * x + sigma=sigma)

While this looks still comprehensible I am afraid I would quickly loose sight of things once I have a more complicated multi-level hierarchy and/or e.g. an errors in variables models etc.

I haven’t found an easy way to swap in a backend that directly stores the trace to disk, the default (numpy-based) backend holds all variables in memory as far as I understand, is that correct?

Thanks a lot for the great help and support.

Regarding readability, you can still assign expressions to intermediate Python variables:

mu = alpha + omega * epsilon_alpha_g[g] + (beta + tau * epsilon_beta_g[g]) * x
pm.Normal("y_obs", mu=mu + sigma=sigma)

Regarding not having to define two models. We are working on it in Add var_names argument to sample by fonnesbeck ¡ Pull Request #7206 ¡ pymc-devs/pymc ¡ GitHub

In the meantime here is a hack you can use:

with pm.Model() as m:
  ...
# Move deterministics "outside" of model
deterministics = m.deterministics
m.deterministics = []

with m:
  idata = pm.sample()

# Restore deterministics
m.deterministics = deterministics

It’s not discouraged, just depends on whether 1) you need them and 2) you have enough RAM. That being said many time people use Deterministics when the only thing they want to do is to define intermediate variables when writing model code, which you can definitely do without them. See discussion in Consider renaming `Deterministic` to `store_in_trace` · Issue #6695 · pymc-devs/pymc · GitHub

Regarding saving things to disk directly. I think you can do that with GitHub - pymc-devs/mcbackend: A backend for storing MCMC draws.

@michaelosthege is there an example of doing that (I didn’t check if there’s one in the Readme)

1 Like

Thanks for the quick reply. This is super-helpful! Great to see that I am not overlooking something trivial here. Also thanks for the mcbackend reference, I will try it out also.

As for the assign intermediate expressions part - very good point, thank you. I think I will replace most of my pm.Deterministic variables with normal expressions then.

One small inconvenience with removing the mu = pm.Deterministic(...) is that I was relying on mu.shape to inform the “shape of the likelihood”…

mu.shape still works without Deterministic. pm.Deterministic does not change anything about a PyMC model, it just tells this is a variable I care about so show me what it evaluates to when I’m sampling.

1 Like

Yes, of course. Thanks a lot for pointing that out.

And thanks for the quick and very helpful replies!

Giving some extra background hoping it will help. Roughly speaking, this is the memory footprint of pm.sample (actual values used are meaningless and only illustrative):

The blue solid line is tuning and sampling. The array needed is allocated first, then its values are filled as sampling progresses. The dashed lines that come later are optional and not part of sampling but related to the conversion to InferenceData. By default only posterior, sample_stats and, if any, observed/constant data groups are present. All these are data that is already stored in RAM from sampling, so converting to InferenceData doesn’t really need any extra memory from what was used in sampling.

There are functions in ArviZ however that need more information than the posterior samples and sampling stats. So it is possible to provide log_likelihood=True and log_prior=True in pm.sample via idata_kwargs. If so, these quantities are then computed and stored (orange and green lines respectively), and these are quantities that are not used during sampling so they do require allocating extra arrays.

log_prior should have the same shape as the posterior, but will always have float dtype as opposed to the posterior that could contain integer variables for example, so it should roughly double the RAM needed as compared to the default behaviour.

log_likelihood has shape (chain, draw, n_observations) so it is often larger (or much larger) than the posterior, but not necessarily. In hierarchical models for example where it is common to store variables in the posterior with that same shape it will be smaller than the posterior.

Note: a range of PyMC versions stored the likelihood too by default, but it is not the case anymore

3 Likes

Thanks @OriolAbril for the additional background. What exactly do you mean by “the same shape as the posterior”? Does it mean shape=(chain, draw) or shape=(chain, draw, #of RVs in the model) and that’s why it’s effectively doubling the memory footprint?

Equivalent to this 2nd one, but instead of a single array with this shape you have multiple arrays. In your case,

you have alpha, beta, omega, sigma, tau with shape chain, draw, epsilon_alpha_g, alpha_g, epsilon_beta_g, beta_b with shape chain, draw, group and mu with chain, draw, n_observations. Thus, your posterior is equivalent to an array of floats with shape chain, draw, 5 + n_group * 4 + n_observations. The log prior will not have the deterministics, but for the variables it has, they’ll be the same shape as in the posterior. Therefore in your case (unless dropping all deterministics) it wouldn’t quite double the memory needed, it would be equivalent to an array of shape chain, draw, 5 + 2 * n_groups.

You only have a single observed variable y_obs, so your log_likelihood and posterior_predictive will be a single array with shape chain, draw, n_observations and you’d be in the case where the posterior needs more memory than the log_likelihood group. Once you stop storing mu this is inverted though.

1 Like

Thanks a lot @OriolAbril for breaking this down for me in detail. Makes perfect sense. This is an excellent explanation. Thanks a lot for your time!