Negative binomial fit over millions of data

I am fitting negative binomial (NB) model for some count data. The goal is to have a distribution estimate for the mean of the NB distribution. However, the problem for me is I have 1 million data (or more). What is the suggestion to scale this up?

The model will be as simple as:

fml = “repurchase_count_4week ~ -1 + intercept_col”
with pm.Model() as model:
pm.glm.GLM.from_formula(formula=fml, data=df_sample, family=pm.glm.families.NegativeBinomial())
trace = pm.sampling_jax.sample_numpyro_nuts(1000, tune=1000, target_accept=0.9)

As you can see I’m trying to use numpyro and jax, which helps. But I can’t find any documentation on how to enable GPU. Right now on my CPU machine with 10k data, it took 11 mins. But I need to increase the size to 100 times of it.

If your model is as simple as you write than MCMC sampling might be an overkill. MLE is probably going to give you the very same results

1 Like

Thanks, I’ve considered that. Ultimately I want to build a framework to generate predictive distributions, but right now I’m testing the feasibility of using Pymc3.

So some background, Y is purchase count, X are some features. Thinking of GLM with NB distribution, I want to predict Y_hat for each person. After the model is built, I want to identify the people with most purchase, but with some uncertainty. Although person A is better than B in terms of MLE, but I want to know the probability for later usage.

I think these kind of use case is at the boundary of what you can do with PyMC3 and numpyro, or any framework that try to fit all data at the same time: a huge problem here is numerical error, which most PPL will try to do sum(log_prob(proposal)) - sum(log_prob(current)), but you will need sum(log_prob(proposal) - log_prob(current)) or something like Kahan summation algorithm - Wikipedia

Currently, tensorflow probability is the only one I am aware of that making effort to handle these use cases: Add `experimental_use_kahan_sum` argument to `tfd.Independent` and `t… · tensorflow/probability@ad9ccda · GitHub
(The work is very much ongoing, so there is no good example of how to use it in MCMC yet)

Otherwise, I recommend using variational inference, where you can subsample the training and can handle large data set with existing deep learning infrastructures (for training model and deploying).

3 Likes