Unexpected increase in computation time when data size increases

I have a hierarchical (nested multilevel) logistic model which I started running with glmer in lme4. glmer gave warnings about singular fit, which I understand can happen when the mode is at the boundary (zero), and which a bayesian fit would not have a problem with since the median and the mean would not be exactly at the boundary. I went to brms and from there found bambi/pymc, which is EXACTLY what I have desired for years now. I have a jaxlib with cuda support, everything seems to just work.

The problem is that the time per iteration increases so fast when data is increased so I suspect that there is something wrong. As I am very new to bayesian modelling, it can of course be my understanding that is wrong, but here is my take on it.

The model is a random slope model with random slopes at three levels. The lowest level has very many elements (not sure exactly how many, but in the magnitude of 30.000). But since it is a nested model, there should not be a need to create a matrix with 30.000 by 30.000 elements, so I am pretty sure the complexity of the model should not increase exponentially as data increases.

The model formula is the following

model_hierarchical = bmb.Model("p(deprived, size) ~ 1 + year +
  (1 | year.as.factor) + (year | superClusterID) +
  (year | country) + (year | m49.region)", df, 
  family="binomial")

And to fit the model I use this line:

idata_hierarchical = model_hierarchical.fit(
  random_seed=random_seed, tune=2000, draws=2000,
  target_accept=0.95, method = "nuts_numpyro",
  chain_method="vectorized")

year.as.factor is character.
year is numeric.
superClusterID is the lowest level grouping factor with ~ 30.000 different values.
country is the middle level grouping factor with ~ 90 different values.
m49.region is the highest level grouping factor with 7 different values.

The data set has about 60.000 rows (each superClusterID has at least two measurements, ie. two rows in df).

In lme4 you do not have to explictly state the nestedness, ie. that only some of the superClusterIDs share the same country and should be pooled accordingly, (g)lmer() finds out that structure by itself. Do I have to tell bambi about the nested structure?

In addition to the rapidly increasing computation time, another thing that makes me worry is the RAM requirements. When I run the model in stan, it required about 2.3 GB of ram, when I run it in Bambi, Bambi allocates 55.6 GB on the CPU and about 12 GB on the GPU. EDIT: 1.2 GB on the GPU.

The data set I use is available here: http://hansekbrand.se/temp/water.without.europe.csv
and below is the script I use to run the model:

import arviz as az
import bambi as bmb
import numpy as np
import pandas as pd
df = pd.read_csv("water.without.europe.csv")
model_hierarchical = bmb.Model("p(deprived, size) ~ 1 + year +
  (1 | year.as.factor) + (year | superClusterID) +
  (year | country) + (year | m49.region)", df, 
  family="binomial")
idata_hierarchical = model_hierarchical.fit(
  random_seed=random_seed, tune=2000, draws=2000,
  target_accept=0.95, method = "nuts_numpyro",
  chain_method="vectorized")

Am I right in that Bambi “blows up” on this model/data? Or did I simply have unrealistic expectations? I’m greatful for any help/hints!

1 Like

@tcapretto

1 Like

It depends on how the variables are codified. For example, if superClusterID==1 can happen within two different countries, i.e. codes are not unique, then you have to specify the nestedness. If codes are unique for all variables, there’s no need to worry about it. Here you’re going to find valuable information mixed model - Crossed vs nested random effects: how do they differ and how are they specified correctly in lme4? - Cross Validated (stackexchange.com).

On the other hand, Bambi relies on formulae to generate design matrices, which turns out to generate a regular matrix for group-specific effects. In this case, it’s a very large matrix with almost all zeros, so a sparse matrix would have been better. So I think this explains the large memory comsumption. It is something we still need to improve on our end. However, this matrix is not directly used in the PyMC model, we use slicing to select only non-zero values.

What could be good in this case, is starting from the Bambi model and then trying to replicate that in PyMC, where you can have more control on all the fine details.

These are unfinished ideas, but it may help you to understand how to map a Bambi model to a PyMC one.

Edit: Notice the examples rely on an older version of Bambi which used PyMC3. Almost all would be identical now, but be aware you need to use Aesara instead of Theano and PyMC instead of PyMC3.

Thanks a lot @tcapretto, really helpful!

I’ll try mimicing your pymc code, but there is one thing that I do not understand, in your pymc code you use objects that are part of the bambi Model object: _design.common.design_matrix and model._design.group.design_matrix (and also the y values: which are taken from model._design.data.value.values), but when bambi.Model() fails due to not enogh RAM, which it does on a new model I have, then those objects are not available for me. Can you please give some hints on how to create them without using bambi.Model()?

Kind regards, Hans Ekbrand

At the moment there’s no library that allows you to create design matrices for group-specific effects (aka random effects) using sparse matrices.

There’s formulaic, which allows you to build sparse design matrices. The problem is that this library does not support group-specific effects yet.