I am observing a quite large memory footprint (> 5GiB) for the following simple hierarchical linear regression code.
import pytensor
pytensor.config.gcc__cxxflags += " -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk"
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
def generate_data(num_groups: int = 5, num_samples_per_group: int = 20) -> pd.DataFrame:
rng = np.random.default_rng(seed=42)
alpha = 2.0
beta = -0.5
omega = 0.1
tau = 0.1
sigma = 1.0
result = {"group": [], "y": [], "x": []}
for i in range(num_groups):
alpha_g = rng.normal(loc=alpha, scale=omega)
beta_g = rng.normal(loc=beta, scale=tau)
x = rng.normal(size=num_samples_per_group)
eps = rng.normal(size=num_samples_per_group)
y = alpha_g + beta_g * x + sigma * eps
result["group"] += [i] * num_samples_per_group
result["y"] += y.tolist()
result["x"] += x.tolist()
return pd.DataFrame(result)
def make_model(frame: pd.DataFrame) -> pm.Model:
coords = {"group": frame["group"].unique()}
with pm.Model(coords=coords) as model:
# Data
g = pm.MutableData("g", frame["group"])
x = pm.MutableData("x", frame["x"])
y = pm.ConstantData("y", frame["y"])
# Population level
alpha = pm.Normal("alpha")
beta = pm.Normal("beta")
omega = pm.HalfNormal("omega")
sigma = pm.HalfNormal("sigma")
tau = pm.HalfNormal("tau")
# Group level
epsilon_alpha_g = pm.ZeroSumNormal("epsilon_alpha_g", dims="group")
alpha_g = pm.Deterministic("alpha_g", alpha + omega * epsilon_alpha_g)
epsilon_beta_g = pm.ZeroSumNormal("epsilon_beta_g", dims="group")
beta_g = pm.Deterministic("beta_g", beta + tau * epsilon_beta_g)
# Linear model
mu_ = pm.Deterministic("mu_", alpha_g[g] + beta_g[g] * x)
# Likelihood
pm.Normal("y_obs", mu=mu_, sigma=sigma, observed=y, shape=mu_.shape)
return model
if __name__ == "__main__":
frame = generate_data(num_groups=5000)
model = make_model(frame)
with model:
trace = pm.sample()
summary = az.summary(trace, var_names=["alpha", "beta", "omega", "sigma", "tau"]).loc[:, ["mean", "hdi_3%", "hdi_97%", "ess_bulk", "ess_tail", "r_hat"]]
print(summary)
I assume this is due to the large number of groups that result in a large number of data variables in the arviz.InferenceData object generated by the pm.sample call. Would it be possible to use a âmemory mappingâ so that the âtraceâ is directly written to disk instead of ending up claiming a lot of RAM when sampling form the model?
Thanks @ricardoV94 does that mean it is in general discouraged to use deterministic variables (for models with many groups and or observations) because they can result in a huge memory footprint?
How would one then go about re-computing e.g. the mean of mu_, when generating predictions on unseen data? Make a dedicated model for predictions, akin to whatâs described here Out of model predictions with PyMC - PyMC Labs?
Ideally Iâd like to fix the definition of those deterministic variables when I specify the model for sampling (âfitting to training dataâ). But I do not want to necessarily store them during posterior sampling. Then when I generate predictions on unseen data I can simply âactivateâ them in order to compute the mean of mu_ without having to define a new model.
Having to omit those variables just in order to âreduce the memory footprintâ of my traces also leads to harder to read code IMHO, e.g. the likelihood definition becomes
pm.Normal("y_obs", mu=alpha + omega * epsilon_alpha_g[g] + (beta + tau * epsilon_beta_g[g]) * x + sigma=sigma)
While this looks still comprehensible I am afraid I would quickly loose sight of things once I have a more complicated multi-level hierarchy and/or e.g. an errors in variables models etc.
I havenât found an easy way to swap in a backend that directly stores the trace to disk, the default (numpy-based) backend holds all variables in memory as far as I understand, is that correct?
with pm.Model() as m:
...
# Move deterministics "outside" of model
deterministics = m.deterministics
m.deterministics = []
with m:
idata = pm.sample()
# Restore deterministics
m.deterministics = deterministics
Itâs not discouraged, just depends on whether 1) you need them and 2) you have enough RAM. That being said many time people use Deterministics when the only thing they want to do is to define intermediate variables when writing model code, which you can definitely do without them. See discussion in Consider renaming `Deterministic` to `store_in_trace` ¡ Issue #6695 ¡ pymc-devs/pymc ¡ GitHub
Thanks for the quick reply. This is super-helpful! Great to see that I am not overlooking something trivial here. Also thanks for the mcbackend reference, I will try it out also.
As for the assign intermediate expressions part - very good point, thank you. I think I will replace most of my pm.Deterministic variables with normal expressions then.
One small inconvenience with removing the mu = pm.Deterministic(...) is that I was relying on mu.shape to inform the âshape of the likelihoodââŚ
mu.shape still works without Deterministic. pm.Deterministic does not change anything about a PyMC model, it just tells this is a variable I care about so show me what it evaluates to when Iâm sampling.
Giving some extra background hoping it will help. Roughly speaking, this is the memory footprint of pm.sample (actual values used are meaningless and only illustrative):
The blue solid line is tuning and sampling. The array needed is allocated first, then its values are filled as sampling progresses. The dashed lines that come later are optional and not part of sampling but related to the conversion to InferenceData. By default only posterior, sample_stats and, if any, observed/constant data groups are present. All these are data that is already stored in RAM from sampling, so converting to InferenceData doesnât really need any extra memory from what was used in sampling.
There are functions in ArviZ however that need more information than the posterior samples and sampling stats. So it is possible to provide log_likelihood=True and log_prior=True in pm.sample via idata_kwargs. If so, these quantities are then computed and stored (orange and green lines respectively), and these are quantities that are not used during sampling so they do require allocating extra arrays.
log_prior should have the same shape as the posterior, but will always have float dtype as opposed to the posterior that could contain integer variables for example, so it should roughly double the RAM needed as compared to the default behaviour.
log_likelihood has shape (chain, draw, n_observations) so it is often larger (or much larger) than the posterior, but not necessarily. In hierarchical models for example where it is common to store variables in the posterior with that same shape it will be smaller than the posterior.
Note: a range of PyMC versions stored the likelihood too by default, but it is not the case anymore
Thanks @OriolAbril for the additional background. What exactly do you mean by âthe same shape as the posteriorâ? Does it mean shape=(chain, draw) or shape=(chain, draw, #of RVs in the model) and thatâs why itâs effectively doubling the memory footprint?
Equivalent to this 2nd one, but instead of a single array with this shape you have multiple arrays. In your case,
you have alpha, beta, omega, sigma, tau with shape chain, draw, epsilon_alpha_g, alpha_g, epsilon_beta_g, beta_b with shape chain, draw, group and mu with chain, draw, n_observations. Thus, your posterior is equivalent to an array of floats with shape chain, draw, 5 + n_group * 4 + n_observations. The log prior will not have the deterministics, but for the variables it has, theyâll be the same shape as in the posterior. Therefore in your case (unless dropping all deterministics) it wouldnât quite double the memory needed, it would be equivalent to an array of shape chain, draw, 5 + 2 * n_groups.
You only have a single observed variable y_obs, so your log_likelihood and posterior_predictive will be a single array with shape chain, draw, n_observations and youâd be in the case where the posterior needs more memory than the log_likelihood group. Once you stop storing mu this is inverted though.