I’m fairly new to Bambi but think it could be great for some of the work I’m doing. However, I’m struggling with the runtime and memory. I’m working on a fairly large machine with plenty of RAM and computing power. My dataset is extremely large (>1M observations) and my model has 4 fixed effects and 1 random effect. Is this going to be too large for bambi?
A quick second question (and perhaps it’s because I have yet to get there)… is there an easy way to set_data with the bambi model for out-of-sample predictions? I’m aware you can access the backend by doing something like model.backend.model.set_data but will that act the same as pm.set_data?
Do you have a Bambi model which is already causing trouble, or you’re simply wondering if Bambi would work for your problem? The size of the model also depends on the nature of your effects (whether they’re numeric or categorical, and if categorical, how many levels they have).
For the second part, Bambi does not require you to do .set_data() manually for out-of-sample predictions. Bambi models have a .predict() method which has an optional data argument where you can pass a pandas.DataFrame for the out-of-sample data.
I already have a Bambi model which is causing trouble. I attempt to sample and my kernel crashes. My model looks the following in terms of datatypes: Numerical ~ Numerical1:Categorical1 + Categorical1*Categorical2 + Numerical2 + Numerical3 + (1 + Categorical1 | Categorical3). Apologies for not being able to post the model publicly. Categorical3 has a lot of levels (likely > 2k). Hope this helps outline my model more.
The second part makes a ton of sense. Thank you so much!
Does it show any traceback? If yes, could you share it? I would like to see the error you get. You could change the names of the columns to keep the model details private.
What I would do in this case, would be to start with a simple model (just a numerical predictor) and I would add one term at a time.
I think all the interaction effects you have (mixing numerical and categorical predictors) together with the group-specific effect (random effect) using categorical variables for the group and the predictor is what creates a huge model under the hood.
It does not show any traceback even when the memory crashes. Azure pops up with an error message on the top of the screen. Below is a screenshot of what I see. This will stay like this until I received memory crash message. This is a different model than described above but still seeing the same issues. Any guidance on this would be great.
If it reaches the point where it starts sampling, it means the problem is not with the creation of the internal objects in Bambi. Have you tried creating first smaller models? Or maybe selecting a subset of the data? I would try that.
I have been attempting this now. Sampling works when I just model with the numerical variable (Dummy1 and Dummy**2) and even the random effect (1 | DummyGroup). The problem seems to stem from DummyCat. When I include this in my model it doesn’t seem to run. Is this because NUTS can only sample from continuous features?
I am currently considering the change of DummyCat to a random effect. I pinned it down to the prior I was defining. However, when I attempted a DiscreteUniform prior, the sampling was still stalled even though it had set up a compounding step where DummyCat was in the Metropolis sampler. I’m unsure what sort of prior I would want to use otherwise. I’d essentially want each level in DummyCat to act like it would in lme4 when using as.factor(var) – if that makes sense. I think a random effect might do the trick however. However, I’m not sure if this will work in the previous model I had discussed since there are interactions with DummyCat.
I have been attempting this now. Sampling works when I just model with the numerical variable (Dummy1 and Dummy**2) and even the random effect (1 | DummyGroup). The problem seems to stem from DummyCat. When I include this in my model it doesn’t seem to run. Is this because NUTS can only sample from continuous features?
Good to find that! NUTS works for both types of features. I guess the problem is that DummyCat contains so many levels? How many levels does it have?
I am currently considering the change of DummyCat to a random effect. I pinned it down to the prior I was defining. However, when I attempted a DiscreteUniform prior, the sampling was still stalled even though it had set up a compounding step where DummyCat was in the Metropolis sampler. I’m unsure what sort of prior I would want to use otherwise. I’d essentially want each level in DummyCat to act like it would in lme4 when using as.factor(var) – if that makes sense. I think a random effect might do the trick however. However, I’m not sure if this will work in the previous model I had discussed since there are interactions with DummyCat.
If you want it to work the same way as as.factor(var) in R, you need to include it the same way you’ve been doing it so far. Using it as a random effect still includes it as a categorical variable, but since it’s a random effect you’ll have hierarchical priors and hence a shrinkage effect. I don’t know if that’s what you want.
I think what’s going on here is that DummyCat contains so many levels that Bambi is building a huge design matrix under the hood. Then, there’s the computation of a dot product between that design matrix and a vector of parameters. And that is what can take so much memory. This is a problem I’m aware of with the current implementation of Bambi and I’m looking for alternative solutions (e.g. sparse matrices).
If you’re familiar with PyMC, we can try to come up with an equivalent model in PyMC that doesn’t rely on that design matrix.
Thank you for all your help. DummyCat contains 9 different levels. I attempted one-hot encoding, and it has the same issue. Even when allowing automatic priors to be chosen, this issue occurs so I’m not sure how to continue on this front.
I am familiar with PyMC, so I’d be happy to talk through some ideas you had on how to build a more complex model (i.e. something like this: Numerical ~ Numerical1:Categorical1 + Categorical1*Categorical2 + Numerical2 + Numerical3 + (1 + Categorical1 | Categorical3))