Business question: What are your practices for using the BG/NBD model in terms of size of training, and how you scale to larger datasets?

Absolutely we can create a more robust model with covariates and potentially by hierarchies. The question still remains about scaling, the engineering of it. If our smallest meaningful slice of customers in a segment is, say 1 million customers, I’d want to know how others have succesfully executed the creation and application of this model.

We’re using a VertexAI cluster right now with 16 vCPUs and 104GB RAM, and seeing that the model only utilizes about 30% of CPU capacity. Running tests on estimating survival and future purchases over a time period, we’re looking at days potentially to fit to a reasonably sized data set.

Colt has suggested using point estimates for our forecasting and that’s on my agenda today. I’ll report back here.

We’re also looking at paths that include Dask for distributing the pandas dataframes, asynchio for scaling the CPU use, and PySpark which would require something of a re-factor of the methods from the model object. If you or anyone else has succesfully scaled the BG/NBD model for a million or more customers, I’d love to learn how it was done.

1 Like