BG/NBD (Beta-Geometric/Negative Binomial) Model for Customer Lifetime Value (CLV)


#1

Hi,
I’m trying to implement the BG/NBD model from Fader et al 2005 (“Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model, link).

I think I’ve implemented the log-likelihood function correctly, but I’m having convergence issues that lead me to believe something is misspecified.
This is my first time implementing something like this myself, so I’d appreciate any help or advice out there. Thanks in advance!

Claus

(This is partially based on this post and this PyData tutorial related to the Pareto/NBD model.)

I’m using the CDNow dataset of purchasing data via the lifetimes library, which is used in the paper, so I’m pretty confident this should be applicable to a BG/NBD model.

import pandas as pd
from lifetimes.utils import summary_data_from_transaction_data
from lifetimes.datasets import load_dataset

cdnow_transactions = load_dataset(
    'CDNOW_sample.txt', 
    header=None, 
    delim_whitespace=True, 
    names=['customer_id', 'customer_index', 'date', 'quantity', 'amount'],
    converters={'date': lambda x: pd.to_datetime(x, format="%Y%m%d")}
)
rfm = summary_data_from_transaction_data(
    cdnow_transactions,
    'customer_id',
    'date',
    observation_period_end=pd.to_datetime('1997-09-30'),
    freq='W'
)

# Extract data for model following notation from Fader/Hardie
N = rfm.shape[0] # number of customers
x = rfm['frequency'].values
t_x = rfm['recency'].values
T = rfm['T'].values

Per the paper, this is the likelihood function:
L(\lambda, p | X=x, T) = (1-p)^x \lambda^x e^{-\lambda T} + \delta_{x>0}p(1-p)^{x-1}\lambda^x e^{-\lambda t_x}

Which I’ve transformed into this log-likelihood:
LL(\lambda, p | X=x, T) = xlog(1-p) + xlog(\lambda) -\lambda T \\+ \delta_{x>0}( log(p)+(x-1)log(1-p)+xlog(\lambda) -\lambda t_x )

And then implemented in PyMC3 like so:
(UPDATED per @junpenglao’s suggestion)

import pymc3 as pm
import theano.tensor as tt

class BGNBD(pm.Continuous):
    """
    Custom distribution class for BG/NBD likelihood.
    """

def __init__(self, lambda_, p, *args, **kwargs):
    super(BGNBD, self).__init__(*args, **kwargs)
    self.lambda_ = lambda_
    self.p = p
    
def logp(self, x, t_x, T):
    """
    Loglikelihood function for an individual customer's purchasing rate \lambda
    and probability \p of becoming inactive given their 
        - frequency (x)
        - recency (t_x) 
        - time since first purchase (T)
    """        
    lambda_ = self.lambda_
    p = self.p
    log_lambda = tt.log(lambda_)
    log_p = tt.log(p)
    log_1_p = tt.log(1-p)
    
    delta_x = tt.where(x>0, 1, 0)
    
    A1 = x*log_1_p + x*log_lambda - lambda_*T
    A2 = delta_x * (log_p + (x-1)*log_1_p + x*log_lambda - lambda_*t_x)
    
    return tt.log(tt.exp(A1)+tt.exp(A2))

I then specified the model with the following priors:

bgnbd_model = pm.Model()

with bgnbd_model:
    
    # Hyperparameters for lambda, i.e. rate of number of transactions
    r = pm.HalfCauchy('r', beta=2)
    alpha = pm.HalfCauchy('alpha', beta=2)

    # Gamma prior on purchasing rate parameter lambda
    # Note: per Fader et al, we use r and alpha to parameterize Gamma(lambda)
    lambda_ = pm.Gamma('lambda', alpha=r, beta=alpha, 
                        shape=N, testval=np.random.rand(N))
    
    # Hyperparameters for p, i.e. probabibility of dropout
    a = pm.HalfCauchy('a', beta=2)
    b = pm.HalfCauchy('b', beta=2)

    # Gamma prior on probability of dropout p
    p = pm.Beta('p', alpha=a, beta=b, shape=N, testval=np.random.rand(N))
    
    # Custom distribution for BG-NBD likelihood function
    loglikelihood = BGNBD("loglikelihood", p=p, lambda_=lambda_, 
                           observed={'x': x, 't_x': t_x, 'T': T})

I sampled this with anywhere from 2000 to 15,000 draws with various levels of tuning etc, e.g.:

# from Austin Rochford
SEED = 22 # from random.org, for reproducibility

SAMPLE_KWARGS = {
    'chains': 2,
    'n_jobs': 2,
    'draws': 2000,
    'tune': 1000,
    'target_accept': 0.8,
    'random_seed': [
        SEED,
        SEED + 1
    ]
}
with bgnbd_model:
    trace = pm.sample(**SAMPLE_KWARGS)

But encounter a lot of divergence and effective sample size issues:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [p_logodds__, b_log__, a_log__, lambda_log__, alpha_log__, r_log__]
100%|██████████| 6000/6000 [04:48<00:00, 20.78it/s]
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.7110400521268769, but should be close to 0.8. Try to increase the number of tuning steps.
The estimated number of effective samples is smaller than 200 for some parameters.

#2

i didnt run the code but shouldn’t it be tt.log(tt.exp(A1)+tt.exp(A2))?
You should validate the logp implementation in numpy first.


#3

Thanks @junpenglao! I’m dating myself, but it’s been 30 years since I’ve had pre-calc, so I’m not sure how you got from the likelihood function to add those extra transformations?

I’ve updated the code and reran it. The traces look better now, but there are still issues resulting in very low effective samples.

I used lifetimes to get values for those parameters as benchmarks:

r: 0.29, alpha: 6.85, a: 0.67, b: 2.32

And here are the values from my trace, which except for a seem to be nowhere near. (b at least seems feasible.)


#4

@junpenglao also, how would go about testing this implementation of the log-likelihood function (in terms of lambda and p) in numpy given that the inputs are distributions? I’ve seen other implementations of this in numpy in terms of r, alpha, a and b that use MLE to solve it, but nothing using distributions for the unknowns.


#5

Sorry I didnt meant to sound harsh :sweat_smile: As a non-mathematician myself I make similar mistakes everyday, so a rule of thumb to me at least is to implement the likelihood function in numpy, take the log, and plug in some number to compare with the log version of the function.

Yes the estimation seems to be off, is the value from lifetimes MLE? The trace could definitely improve, but if there are no divergence it is not too bad. I would do sample_ppc to check how well the model fits, and then try replacing the HalfCauchy with HalfNormal and use more informative priors.


#6

No worries, I appreciate your prompt help, as always!

Yes, Cam essentially implemented the model from the paper in lifetimes, which uses MLE to estimate the parameter values.


#7

Reporting back. Turns out my original log-likelihood function with HalfNormal priors seems to do better in terms of convergence and effective samples. Any thoughts?

(moved logp inside the model for simplicty):

import theano.tensor as tt
from pymc3.math import log, exp

bgnbd_model = pm.Model()

with bgnbd_model:
    
    # Hyperparameters for lambda, i.e. rate of number of transactions
    r = pm.HalfNormal('r', sd=.5)
    alpha = pm.HalfNormal('alpha', sd=.5)

    # Gamma prior on purchasing rate parameter lambda
    # Note: per Fader et al, we use r and alpha to parameterize Gamma(lambda)
    lambda_ = pm.Gamma('lambda', alpha=r, beta=alpha, shape=N, testval=np.random.rand(N))
    
    # Hyperparameters for p, i.e. probability of dropout
    a = pm.HalfNormal('a', sd=.5)
    b = pm.HalfNormal('b', sd=.5)

    # Gamma prior on probabibility of dropout p
    p = pm.Beta('p', alpha=a, beta=b, shape=N, testval=np.random.rand(N))
    
    def logp(x, t_x, T):
        """
        Loglikelihood function for an individual customer's purchasing rate \lambda
        and probability \p of becoming inactive given their 
            - frequency (x)
            - recency (t_x) 
            - time since first purchase (T)
        """        
        
        log_lambda = log(lambda_)
        log_p = log(p)
        log_1_p = log(1-p)
        
        delta_x = tt.where(x>0, 1, 0)
        
        A1 = x*log_1_p + x*log_lambda - lambda_*T
        A2 = delta_x * (log_p + (x-1)*log_1_p + x*log_lambda - lambda_*t_x)

        return A1 + A2
    
    # Custom distribution for BG-NBD likelihood function
    loglikelihood = pm.DensityDist("loglikelihood", logp, observed={'x': x, 't_x': t_x, 'T': T}) 


#8

The fit looks quite good.

The estimation is still different from lifetimes, but most likely that’s due to the difference in parameterisation and preprocessing. If you really want to recover the same value, try dig into the R code and see if there are any normalisation and whether the model likelihood is coded the same way.


#9

Very cool @clausherther! Some thoughts:

  1. what dataset are you sampling on / how large is it? I wonder if its a small dataset - the priors may be having a strong influence on the results, hence the differences from lifetimes.
  2. Are you using any regularizer in lifetimes?

#10

Hi @CamDavidsonPilon!
The dataset is the small CDNow dataset included with lifetimes (N=2357). I run your RFM summary function on it and use the out-of-the-box BG/NBD fitter on it. I think this should closely reflect the setup from the paper?
I wasn’t sure about how to best apply the penalizer in this case to create a baseline for my projects. Do you have any recommendations?
Thanks!
Claus

from lifetimes.datasets import load_dataset
from lifetimes.utils import summary_data_from_transaction_data

cdnow_transactions = load_dataset(
    'CDNOW_sample.txt', 
    header=None, 
    delim_whitespace=True, 
    names=['customer_id', 'customer_index', 'date', 'quantity', 'amount'],
    converters={'date': lambda x: pd.to_datetime(x, format="%Y%m%d")}
) 

    
rfm = summary_data_from_transaction_data(
    cdnow_transactions,
    'customer_id',
    'date',
    observation_period_end=pd.to_datetime('1997-09-30'),
    freq='W'
)

N = rfm.shape[0] # number of customers
x = rfm['frequency'].values
t_x = rfm['recency'].values
T = rfm['T'].values # length of training period

print(N)

2,357

Then fit the BGF:

from lifetimes import BetaGeoFitter
bgf = BetaGeoFitter()
bgf.fit(x, t_x, T)

print(bgf.params_)

Output:

OrderedDict([('r', 0.2909361637762657),
             ('alpha', 6.8520660597477745),
             ('a', 0.6652655153880851),
             ('b', 2.320357432053001)])

#11

I’m still tinkering with this for a blog post, and it turns out this version of log-likelihood function seems to work the best in terms of recovering the parameter values of the original paper:

Note: I’m bringing back the logaddexp function @junpenglao originally suggested, but I move the delta function inside of it, but outside the exp() function.

def logp(x, t_x, T):
    """
    Loglikelihood function for an individual customer's purchasing rate \lambda
    and probability \p of becoming inactive given their 
        - frequency (x)
        - recency (t_x) 
        - time since first purchase (T)
    """    
    delta_x = where(x>0, 1, 0)

    A1 = x*log(1-p) + x*log(lambda_) - lambda_*T
    A2 = (log(p) + (x-1)*log(1-p) + x*log(lambda_) - lambda_*t_x)

    A3 = log(exp(A1) + delta_x * exp(A2))

    return A3

While this LL function doesn’t converge as well (Gelman-Rubin is under 1.1, but effective sample size for a and b is < 200 in 2000 samples) it recovers the following parameter values, which compare nicely with the original paper, as well as those returned by the lifetimes library using MLE on the same dataset:

r        0.242691
alpha    4.419125
a        0.792734
b        2.456735

from lifetimes:

r        0.2425929
alpha    4.413521
a        0.79289
b        2.42576

Full Model (I added fairly uninformative HalfNormal hyperparameters for the sd parameters):

from pymc3.math import log, exp, where, logsumexp

bgnbd_model = pm.Model()

with bgnbd_model:
    
    # Hyperparameters for lambda, i.e. rate of number of transactions
    sigma_r = pm.HalfNormal('sigma_r', 5.)
    sigma_alpha = pm.HalfNormal('sigma_alpha', 5.)
    r = pm.HalfNormal('r', sd=sigma_r)
    alpha = pm.HalfNormal('alpha', sd=sigma_alpha)

    # Hyperparameters for p, i.e. probability of dropout
    sigma_a = pm.HalfNormal('sigma_a', 5.)
    sigma_b = pm.HalfNormal('sigma_b', 5.)
    a = pm.HalfNormal('a', sd=sigma_a)
    b = pm.HalfNormal('b', sd=sigma_b)

    # Gamma prior on purchasing rate parameter lambda
    # Note: per Fader et al, we use r and alpha to parameterize Gamma(lambda)
    lambda_ = pm.Gamma('lambda', alpha=r, beta=alpha, shape=N, testval=np.random.rand(N))
    
    # Beta prior on probabibility of dropout p
    p = pm.Beta('p', alpha=a, beta=b, shape=N, testval=np.random.rand(N))
    
    def logp(x, t_x, T):
        """
        Loglikelihood function for an individual customer's purchasing rate \lambda
        and probability \p of becoming inactive given their 
            - frequency (x)
            - recency (t_x) 
            - time since first purchase (T)
        """    
        delta_x = where(x>0, 1, 0)
        
        A1 = x*log(1-p) + x*log(lambda_) - lambda_*T
        A2 = (log(p) + (x-1)*log(1-p) + x*log(lambda_) - lambda_*t_x)

        A3 = log(exp(A1) + delta_x * exp(A2))
        
        return A3
    
    # Custom distribution for BG-NBD likelihood function
    loglikelihood = pm.DensityDist("loglikelihood", logp, observed={'x': x, 't_x': t_x, 'T': T})


#12

Note: this uses the following dataset (n=2,357), which seems to match the dataset used in the paper the best

from lifetimes.datasets import load_cdnow_summary

rfm = load_cdnow_summary()

N = rfm.shape[0] # number of customers
x = rfm['frequency'].values
t_x = rfm['recency'].values
T = rfm['T'].values # length of training period

#13

Cool! Looking forward to see the blog post!