Negative binomial fit over millions of data

snowdustdj · February 12, 2021, 7:58pm

I am fitting negative binomial (NB) model for some count data. The goal is to have a distribution estimate for the mean of the NB distribution. However, the problem for me is I have 1 million data (or more). What is the suggestion to scale this up?

The model will be as simple as:

fml = “repurchase_count_4week ~ -1 + intercept_col”
with pm.Model() as model:
pm.glm.GLM.from_formula(formula=fml, data=df_sample, family=pm.glm.families.NegativeBinomial())
trace = pm.sampling_jax.sample_numpyro_nuts(1000, tune=1000, target_accept=0.9)

As you can see I’m trying to use numpyro and jax, which helps. But I can’t find any documentation on how to enable GPU. Right now on my CPU machine with 10k data, it took 11 mins. But I need to increase the size to 100 times of it.

ricardoV94 · February 12, 2021, 8:20pm

If your model is as simple as you write than MCMC sampling might be an overkill. MLE is probably going to give you the very same results

snowdustdj · February 12, 2021, 8:45pm

Thanks, I’ve considered that. Ultimately I want to build a framework to generate predictive distributions, but right now I’m testing the feasibility of using Pymc3.

So some background, Y is purchase count, X are some features. Thinking of GLM with NB distribution, I want to predict Y_hat for each person. After the model is built, I want to identify the people with most purchase, but with some uncertainty. Although person A is better than B in terms of MLE, but I want to know the probability for later usage.

junpenglao · February 13, 2021, 9:21am

I think these kind of use case is at the boundary of what you can do with PyMC3 and numpyro, or any framework that try to fit all data at the same time: a huge problem here is numerical error, which most PPL will try to do sum(log_prob(proposal)) - sum(log_prob(current)), but you will need sum(log_prob(proposal) - log_prob(current)) or something like Kahan summation algorithm - Wikipedia

Currently, tensorflow probability is the only one I am aware of that making effort to handle these use cases: Add `experimental_use_kahan_sum` argument to `tfd.Independent` and `t… · tensorflow/probability@ad9ccda · GitHub
(The work is very much ongoing, so there is no good example of how to use it in MCMC yet)

Otherwise, I recommend using variational inference, where you can subsample the training and can handle large data set with existing deep learning infrastructures (for training model and deploying).

Topic		Replies	Views
Negative binomial model with exposure version agnostic gaussian_process , modeling	2	284	February 13, 2024
Regression model sampling solely one sample every 5th second version agnostic modeling	2	510	August 29, 2022
Modeling count time series (Negative Binomial VS Normal) Questions	2	1026	March 25, 2020
Better Negative Binomial Model specification? Questions	6	664	October 12, 2020
Can I avoid sampling irrelevant bits of the distribution? Questions	1	469	February 2, 2020

Negative binomial fit over millions of data

Related topics