Understand root cause of high memory utilization

Hello everyone!

I have come across some unexpectedly high memory usage whilst running a hierarchical model - so much so my AWS Linux container (12GB memory) is running out of memory.


2.1m observations

Params by tree level
Level 0 - 3 params
Level 1 - 151 params
Level 2 - 576 params
Level 3 - 4587 params
Level 4 - 158199 params

Using float32, not float64

Running on CPU, not GPU

Using Theano

Code snippet:

with pm.Model() as bayesian_model:

    b_0 = pm.Bound(pm.Normal, lower=hbr_lb, upper=hbr_ub)('b_0', mu=hbr_b_mu, sd=hbr_b_sig, shape=level_0_count)
    b_1 = pm.Bound(pm.Normal, lower=hbr_lb[level_1_link_zero - 1], upper=hbr_ub[level_1_link_zero - 1])('b_1', mu=b_1[level_1_link_prev - 1], sd=hbr_b_sig[level_1_link_zero - 1], shape=level_1_count)
    b_2 = pm.Bound(pm.Normal, lower=hbr_lb[level_2_link_zero - 1], upper=hbr_ub[level_2_link_zero - 1])('b_2', mu=b_2[level_2_link_prev - 1], sd=hbr_b_sig[level_2_link_zero - 1], shape=level_2_count)
    b_3 = pm.Bound(pm.Normal, lower=hbr_lb[level_3_link_zero - 1], upper=hbr_ub[level_3_link_zero - 1])('b_3', mu=b_3[level_3_link_prev - 1], sd=hbr_b_sig[level_3_link_zero - 1], shape=level_3_count)
    b_4 = pm.Bound(pm.Normal, lower=hbr_lb[level_4_link_zero - 1], upper=hbr_ub[level_4_link_zero - 1])('b_4', mu=b_4[level_4_link_prev - 1], sd=hbr_b_sig[level_4_link_zero - 1], shape=level_4_count)

    eps = pm.HalfCauchy('eps', beta=1)

    Y_est = b_4[level_4 - 1] * X

    Y_like = pm.Normal('Y_like', mu=Y_est, sd=eps, observed=Y)

    idata = pm.sample(
                    idata_kwargs = {'log_likelihood': True})

As you can see I am only taking 300 draws in total across all chains, and 500 tune steps.

I am also returning the log likelihood since I need that for a later WAIC calculation. My understanding is that the log likelihood size would be 1 * 300 * 2.1m * 4 bytes = 2.5GB, so it’s large but can’t explain the container running out of memory (12GB).

Are there other large data structures being stored in memory that could account for the high memory consumption - is it perhaps an issue with the number of parameters I’m estimating instead (over several levels)?

Many thanks in advance!

Yes I think it’s the number of parameters. I can imagine the computational graph (logp and gradient) easily taking what’s left of your RAM.

Deterministics also take a ton of RAM, and it doesn’t show up until after sampling when using JAX (they are computed in the “transforming” step). I had a model that would crash the kernel after an hour of sampling because I was saving a lot of intermediate computations.

Thanks ricardo - I’d like to understand a bit better what factors affect the size of the computational graph. I take it number of parameters is one, but do the below also affect it as well?

  • Number of tune steps
  • Number of draws
  • Number of observations
  • Number of levels (for the same number of parameters - i.e. dimensionality is relevant in this way)
  • Is logp only included if we’re returning the log likelihood, or is that needed anyway?

Just trying to understand what degrees of freedom I have :slight_smile:

Thanks again!

Thanks Jesse - by Deterministic I assume you’re referring to the ‘Bound’ construct I’m using?

No, I assumed there was more code in your actual model. If there isn’t you can ignore my comment.

1 Like

@cocodimama I think the bottleneck is the logp/dlogp function which holds several intermediate computations in RAM. It’s hard to say what freedom you have, because this depends on the exact model and operations you have.

There is a memory profiler for PyTensor functions but I have never used it myself: Profiling PyTensor function — PyTensor dev documentation

A crude approach is to just get the function and run it in a loop to see how much RAM that consumes:

with pm.Model() as m:
  ... # your model

ip = m.initial_point()
fn = m.compile_fn([m.logp(), m.dlogp()])
[fn(ip) for _ in range(10_000)]  # Run long enough so you have time to see RAM consumption

The log-likelihood is probably not the problem here since it is only computed at the end of sampling and you are already running out of memory during sampling (is that correct?)

If it proves to be a problem though, you can compute it manually via pm.compute_log_likelihood. This gives you the chance do do it iteratively by slicing the inferencedata → compute → save to disk → repeat so that it only has to keep a subset of the draws (or chains) at a time in RAM.


1 Like

Thank you - you’ve been a great help and super quick to reply.

Just one final question if that’s OK - if I drop the log likelihood computation altogether (and don’t calculate the WAIC for the model) and use ADVI instead, I’m assuming then that the logp would no longer be relevant since, if I understand correctly, this is specific to sampling methods rather than VI?

You can get draws from the VI fit and compute the log-likelihood from those. If the VI gives a good approximation to the posterior I don’t see why the same metrics wouldn’t be relevant, but I am not familiar with the theory either

1 Like