What is the correct way to validate a customer being alive in a BTYD model?

@pymc-bot
I’m using BetaGeoNBD, ParetoNBD and Modified BetaGeoNBD models to predict the likelihood of a customer being alive at some future week n for a range, say n in[4,8,13,17,22,26].

The models are trained on data aggregated weekly and the output is the likelihood of each customer being alive at 4 weeks, 8 weeks, etc. I want to validate these predictions against observed data for the same time period.

  1. Is it correct interpretation to validate against a forecast that is 4 weeks, 8 weeks out from the conclusion of the training period? ie, if my training data ends on 5/31/2024, should I validate the 4 week prediction against 6/28/2024 which is 4 weeks later? Or is it against 4 weeks from the t_start_eval parameter?

  2. What is the correct interpretation of “a customer is alive”? Because palive is monotonically decreasing from period 1 through period n where n>1, I have validated a customer as being alive at period n if the customer has transacted between period 1 and period n. However, my validation is not very strong, and I wonder if a customer is actually still alive from the model’s perspective if they transact at some point AFTER period n in the future.

In the context of validating a customer being alive in a Buy Till You Die (BTYD) model, such as BetaGeoNBD, ParetoNBD, or Modified BetaGeoNBD, the correct approach involves comparing the model’s predictions with actual observed data. When you predict the likelihood of a customer being alive at future weeks (e.g., 4, 8, 13 weeks, etc.), you should validate these predictions against the observed transactions that occur at those specific future weeks relative to the end of your training period.

For example, if your training data ends on 5/31/2024, you would validate the 4-week prediction against the observed data on 6/28/2024, which is 4 weeks later. This approach aligns the forecast period directly with the time elapsed since the end of the training data.

Regarding the interpretation of “a customer is alive,” it typically means that the customer is still active or likely to make a purchase. In the context of your validation, if a customer has transacted between period 1 and period n, they can be considered alive at period n. However, if a customer transacts after period n, it suggests they were indeed alive at period n, even if they did not transact exactly at that time. This is because the model’s perspective of a customer being alive is probabilistic and based on the likelihood of future transactions, not just past ones. Therefore, a customer transacting after period n supports the model’s prediction of them being alive at period n.

For more information, visit https://www.pymc.io/

This answer was generated by the PyMC Assistant bot in response to your command. Please let us know if it was helpful!

Those things you are mentioning like “BetaGeoNBD” are not models, but just components that can be used to define likelihoods. I don’t know how those are being used to predict whether a customer is alive. Are those distributions over their life span like in a survival model?

Yes. The idea in time series is that you can only fairly train on the past and predict the future. And when you validate probabilistic predictions, you’re not looking for whether the means match, but whether the predicted probability intervals are well calibrated. That is, if you look at 50% posterior intervals, how many parameters are contained.

Technically, the only thing you know is the last time they showed signs of life. That means you need to model and then infer their alive/dead status somehow—in Bayes you’ll be averaging over their alive/dead status based on the probability of each.