I have a hierarchical (nested multilevel) logistic model which I started running with glmer in lme4. glmer gave warnings about singular fit, which I understand can happen when the mode is at the boundary (zero), and which a bayesian fit would not have a problem with since the median and the mean would not be exactly at the boundary. I went to brms and from there found bambi/pymc, which is EXACTLY what I have desired for years now. I have a jaxlib with cuda support, everything seems to just work.
The problem is that the time per iteration increases so fast when data is increased so I suspect that there is something wrong. As I am very new to bayesian modelling, it can of course be my understanding that is wrong, but here is my take on it.
The model is a random slope model with random slopes at three levels. The lowest level has very many elements (not sure exactly how many, but in the magnitude of 30.000). But since it is a nested model, there should not be a need to create a matrix with 30.000 by 30.000 elements, so I am pretty sure the complexity of the model should not increase exponentially as data increases.
The model formula is the following
model_hierarchical = bmb.Model("p(deprived, size) ~ 1 + year +
(1 | year.as.factor) + (year | superClusterID) +
(year | country) + (year | m49.region)", df,
family="binomial")
And to fit the model I use this line:
idata_hierarchical = model_hierarchical.fit(
random_seed=random_seed, tune=2000, draws=2000,
target_accept=0.95, method = "nuts_numpyro",
chain_method="vectorized")
year.as.factor
is character.
year
is numeric.
superClusterID
is the lowest level grouping factor with ~ 30.000 different values.
country
is the middle level grouping factor with ~ 90 different values.
m49.region
is the highest level grouping factor with 7 different values.
The data set has about 60.000 rows (each superClusterID has at least two measurements, ie. two rows in df
).
In lme4
you do not have to explictly state the nestedness, ie. that only some of the superClusterIDs share the same country and should be pooled accordingly, (g)lmer()
finds out that structure by itself. Do I have to tell bambi about the nested structure?
In addition to the rapidly increasing computation time, another thing that makes me worry is the RAM requirements. When I run the model in stan, it required about 2.3 GB of ram, when I run it in Bambi, Bambi allocates 55.6 GB on the CPU and about 12 GB on the GPU. EDIT: 1.2 GB on the GPU.
The data set I use is available here: http://hansekbrand.se/temp/water.without.europe.csv
and below is the script I use to run the model:
import arviz as az
import bambi as bmb
import numpy as np
import pandas as pd
df = pd.read_csv("water.without.europe.csv")
model_hierarchical = bmb.Model("p(deprived, size) ~ 1 + year +
(1 | year.as.factor) + (year | superClusterID) +
(year | country) + (year | m49.region)", df,
family="binomial")
idata_hierarchical = model_hierarchical.fit(
random_seed=random_seed, tune=2000, draws=2000,
target_accept=0.95, method = "nuts_numpyro",
chain_method="vectorized")
Am I right in that Bambi “blows up” on this model/data? Or did I simply have unrealistic expectations? I’m greatful for any help/hints!