Business question: What are your practices for using the BG/NBD model in terms of size of training, and how you scale to larger datasets?

In your experience as users, how do you approach estimating attrition, retention and projected spend for large datasets of customers? How big have your use cases been?

I ask because in the examples I’ve seen, the datasets used for BG/NBD are small - usually 5,000 or fewer customers. I understand the value of sharing a small dataset that can be operationalized as a proof of concept very rapidly. How have you approached the problem for larger sets of customers, say 1 million or more?

Do you train different models for different types of customers - say new vs existing vs lapsed, or by region or by line-of-business, retail banking vs online banking vs drive-thru banking?

Just looking to learn from the experience here.

I’m by no means an expert on BG/NBD models, but generally, if you have access to more data (and more predictors as you’ve alluded here), I think you have the opportunity to create a more robust model by capturing variability at different hierarchies of the data and with partial pooling. Training different models for different types of customers would be appropriate if you believe there’s nothing to be gained from sharing information across customer segments. But if there are common purchasing patterns across groups, a hierarchical model with pooling would be beneficial. I’m sure others can address BG/NBD more specifically.

1 Like

@jeremyloscheider

How have you approached the problem for larger sets of customers, say 1 million or more?

For large datasets, BG/NBD and other BTYD models like Pareto/NBD are well-behaved for fitting point estimates of parameters via Maximum a Posteriori (MAP), due to the unique, underlying conjugate prior assumptions (don’t use MAP for any other Bayesian model, though!) You’d lose credibility intervals for parameters and predictions with MAP, but these models have a dimension for each customer, and RAM limitations are a consideration. On that note, pymc-marketing also has an open PR right now for ADVI, which could enable minibatch model fits on GPUs.

@Keith_Min hierarchical support is definitely something I’m looking to add for the Gamma-Gamma spend model later this year, but there are additional considerations for BG/NBD and other transaction models. They have strong population assumptions which would be violated if segmenting on spending behavior alone. However, it could be viable for geographical regions.

Do you train different models for different types of customers - say new vs existing vs lapsed, or by region or by line-of-business, retail banking vs online banking vs drive-thru banking?

BG/NBD works best for retail transactions; don’t use it for subscription renewals. Lapsed customers are unobservable in the case of retail, which is precisely what these models were built to estimate. Your banking example would be a good application for static covariates, which is currently supported by the Pareto/NBD model.

3 Likes

Absolutely we can create a more robust model with covariates and potentially by hierarchies. The question still remains about scaling, the engineering of it. If our smallest meaningful slice of customers in a segment is, say 1 million customers, I’d want to know how others have succesfully executed the creation and application of this model.

We’re using a VertexAI cluster right now with 16 vCPUs and 104GB RAM, and seeing that the model only utilizes about 30% of CPU capacity. Running tests on estimating survival and future purchases over a time period, we’re looking at days potentially to fit to a reasonably sized data set.

Colt has suggested using point estimates for our forecasting and that’s on my agenda today. I’ll report back here.

We’re also looking at paths that include Dask for distributing the pandas dataframes, asynchio for scaling the CPU use, and PySpark which would require something of a re-factor of the methods from the model object. If you or anyone else has succesfully scaled the BG/NBD model for a million or more customers, I’d love to learn how it was done.

1 Like

Colt, thank you, the MAP is my current path. Losing the credibility intervals for this stage of our development is fine but we may revisit it in the future. With our compute model we can scale and manage RAM fairly well.

Your point about hierarchical support is intriguing and I’d love to be part of a future discussion around that. If there are strong population assumptions that are violated by segmenting on spend alone, what would you think of segmentation based on preferred channel or tenure?

Your point on subscriptions is well-taken.

1 Like

I missed the main point of your question! Sorry about that. Are you just using the .find_MAP method for the MAP? Also, @ColtAllen can you recommend some reads re: the conjugate prior assumptions for BG/NBD and Pareto/NBD?

There’s a list of conjugate prior distributions in this Wikipedia article:

You can review the modeling assumptions for BG/NBD and Pareto/NBD in their respective papers

1 Like

Sorry for the confusing choice of words in my last message - meant to say purchasing behavior (i.e., frequency/recency/T) rather than spending behavior, so don’t segment on tenure. Preferred channel may work better as a covariate.

Fitting with model.fit(fit_method=‘map’) allows us to scale sufficiently. I can train 100k in about 6 seconds and 1 million in about 20 seconds.

Here’s a related question. If I have 20 million customers, would you recommend I train the model on all of them, or train on a representative sample of say 1 million or 5 million customers then apply that model output to the out-of-sample customers?

1 Like

This would depend on your use case. A truly representative sample may be difficult to obtain, but isn’t necessarily required if you wished to compare results between geographical regions, for example.