Understand root cause of high memory utilization

cocodimama · December 6, 2023, 4:31pm

Hello everyone!

I have come across some unexpectedly high memory usage whilst running a hierarchical model - so much so my AWS Linux container (12GB memory) is running out of memory.

Set-up:

2.1m observations

Params by tree level
Level 0 - 3 params
Level 1 - 151 params
Level 2 - 576 params
Level 3 - 4587 params
Level 4 - 158199 params

Using float32, not float64

Running on CPU, not GPU

Using Theano

Code snippet:

with pm.Model() as bayesian_model:

    b_0 = pm.Bound(pm.Normal, lower=hbr_lb, upper=hbr_ub)('b_0', mu=hbr_b_mu, sd=hbr_b_sig, shape=level_0_count)
    b_1 = pm.Bound(pm.Normal, lower=hbr_lb[level_1_link_zero - 1], upper=hbr_ub[level_1_link_zero - 1])('b_1', mu=b_1[level_1_link_prev - 1], sd=hbr_b_sig[level_1_link_zero - 1], shape=level_1_count)
    b_2 = pm.Bound(pm.Normal, lower=hbr_lb[level_2_link_zero - 1], upper=hbr_ub[level_2_link_zero - 1])('b_2', mu=b_2[level_2_link_prev - 1], sd=hbr_b_sig[level_2_link_zero - 1], shape=level_2_count)
    b_3 = pm.Bound(pm.Normal, lower=hbr_lb[level_3_link_zero - 1], upper=hbr_ub[level_3_link_zero - 1])('b_3', mu=b_3[level_3_link_prev - 1], sd=hbr_b_sig[level_3_link_zero - 1], shape=level_3_count)
    b_4 = pm.Bound(pm.Normal, lower=hbr_lb[level_4_link_zero - 1], upper=hbr_ub[level_4_link_zero - 1])('b_4', mu=b_4[level_4_link_prev - 1], sd=hbr_b_sig[level_4_link_zero - 1], shape=level_4_count)

    eps = pm.HalfCauchy('eps', beta=1)

    Y_est = b_4[level_4 - 1] * X

    Y_like = pm.Normal('Y_like', mu=Y_est, sd=eps, observed=Y)

    idata = pm.sample(
                    draws=300, 
                    tune=500, 
                    cores=1, 
                    chains=1, 
                    return_inferencedata=True,
                    idata_kwargs = {'log_likelihood': True})

As you can see I am only taking 300 draws in total across all chains, and 500 tune steps.

I am also returning the log likelihood since I need that for a later WAIC calculation. My understanding is that the log likelihood size would be 1 * 300 * 2.1m * 4 bytes = 2.5GB, so it’s large but can’t explain the container running out of memory (12GB).

Are there other large data structures being stored in memory that could account for the high memory consumption - is it perhaps an issue with the number of parameters I’m estimating instead (over several levels)?

Many thanks in advance!

ricardoV94 · December 6, 2023, 9:48pm

Yes I think it’s the number of parameters. I can imagine the computational graph (logp and gradient) easily taking what’s left of your RAM.

jessegrabowski · December 6, 2023, 10:11pm

Deterministics also take a ton of RAM, and it doesn’t show up until after sampling when using JAX (they are computed in the “transforming” step). I had a model that would crash the kernel after an hour of sampling because I was saving a lot of intermediate computations.

cocodimama · December 8, 2023, 12:31pm

Thanks ricardo - I’d like to understand a bit better what factors affect the size of the computational graph. I take it number of parameters is one, but do the below also affect it as well?

Number of tune steps
Number of draws
Number of observations
Number of levels (for the same number of parameters - i.e. dimensionality is relevant in this way)
Is logp only included if we’re returning the log likelihood, or is that needed anyway?

Just trying to understand what degrees of freedom I have

Thanks again!

cocodimama · December 8, 2023, 12:31pm

Thanks Jesse - by Deterministic I assume you’re referring to the ‘Bound’ construct I’m using?

jessegrabowski · December 8, 2023, 12:32pm

No, I assumed there was more code in your actual model. If there isn’t you can ignore my comment.

ricardoV94 · December 8, 2023, 1:17pm

@cocodimama I think the bottleneck is the logp/dlogp function which holds several intermediate computations in RAM. It’s hard to say what freedom you have, because this depends on the exact model and operations you have.

There is a memory profiler for PyTensor functions but I have never used it myself: Profiling PyTensor function — PyTensor dev documentation

A crude approach is to just get the function and run it in a loop to see how much RAM that consumes:

with pm.Model() as m:
  ... # your model

ip = m.initial_point()
fn = m.compile_fn([m.logp(), m.dlogp()])
[fn(ip) for _ in range(10_000)]  # Run long enough so you have time to see RAM consumption

The log-likelihood is probably not the problem here since it is only computed at the end of sampling and you are already running out of memory during sampling (is that correct?)

If it proves to be a problem though, you can compute it manually via pm.compute_log_likelihood. This gives you the chance do do it iteratively by slicing the inferencedata → compute → save to disk → repeat so that it only has to keep a subset of the draws (or chains) at a time in RAM.

https://www.pymc.io/projects/docs/en/latest/api/generated/pymc.compute_log_likelihood.html#pymc-compute-log-likelihood

cocodimama · December 8, 2023, 2:28pm

Thank you - you’ve been a great help and super quick to reply.

Just one final question if that’s OK - if I drop the log likelihood computation altogether (and don’t calculate the WAIC for the model) and use ADVI instead, I’m assuming then that the logp would no longer be relevant since, if I understand correctly, this is specific to sampling methods rather than VI?

ricardoV94 · December 8, 2023, 2:53pm

You can get draws from the VI fit and compute the log-likelihood from those. If the VI gives a good approximation to the posterior I don’t see why the same metrics wouldn’t be relevant, but I am not familiar with the theory either

Topic		Replies	Views
Compute_log_likelihood RAM usage is huge version agnostic arviz	4	482	September 27, 2023
Memory spike at the end of the MCMC sampling Questions	10	2608	July 15, 2022
Memory Error when sampling my Hierarchical logistic model Questions	8	1556	March 17, 2021
Memory mapping `arviz.InferenceData`	10	299	April 3, 2024
Memory allocation limit for NUTS with custom `logp` function (but not with VI methods) v5 modeling	39	1233	December 10, 2022

Understand root cause of high memory utilization

Related topics