Shifted Beta Geometric (sBG) distribution in pymc3

I’m kind of new at Bayesian Modeling and very new at Python. I’m trying to model the shifted beta geometric distribution (sBG) just like Daniel Weitzenfield did. This model was based in a paper published by Fader and Hardie. The problem is I can’t get closer to the same results as Daniel, which are very similar to the ones in the paper. He’s using pymc and I’m using pymc3 (using DensityDist function). My beta posterior distribution grows as far I the higher boundary I set. I may be scrweing something within the model context, but can’t figure it out. Any help will be appreciated!

%matplotlib inline
import pymc3 as pm
import numpy as np
import scipy.stats as stats
from scipy import optimize
import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt
from theano.tensor import _shared
example_data = np.array([1000, 869, 743, 653, 593, 551, 517, 491])
def n_lost(data):
    lost = [None]
    for i in range(1, len(data)):
        lost.append(data[i - 1] - data[i])
    return lost

example_data_n_lost = n_lost(example_data)
n = len(example_data)
data = np.asarray((example_data, example_data_n_lost))
with pm.Model() as model:
    alpha = pm.Uniform('alpha', 0.00001, 1000.0, testval=1)
    beta = pm.Uniform('beta', 0.00001, 1000.0, testval=1)

    def P_T_is_t(alpha=alpha, beta=beta, num_periods=n):
        p = np.array([None, alpha / (alpha + beta)])
        for t in range(2, num_periods):
            pt = ((beta + t - 2) / (alpha + beta + t - 1)) * p[t-1]
            p = np.append(p,pt)
        return p

    def survival_function(num_periods=n):
        p = P_T_is_t()
        s = np.array([None, 1 - p[1]])
        for t in range(2, num_periods):
            s = np.append(s, s[t-1] - p[t])
        return s

    def logp(value):

        active = value[0,:]
        lost = value[1,:]

        # Those who've churned along the way...
        p = P_T_is_t()
        died = np.multiply(np.log(p[1:]), lost[1:]) 


        # and those still active in last period
        sf = survival_function()
        still_active = np.log(sf[-1]) * active[-1]
        return  sum(died) + still_active


    retention = pm.DensityDist('retention', logp, observed=data)
    trace = pm.sample(10000, step=pm.Metropolis())
    burned_trace = trace[3000:]

burned_trace['alpha'].mean()
burned_trace['beta'].mean()

Your implementation doesnt seems correct. For example, checking the log-likelihood it does not match the one in the post:

retention.logp(model.test_point)
array(-16884.090589033225)

The problem is that you need to rewrite the numpy part into theano. Something like this should work (sample with NUTS, which is :100:):

with pm.Model() as model:
    alpha = pm.Uniform('alpha', 0.00001, 1000.0, testval=1)
    beta = pm.Uniform('beta', 0.00001, 1000.0, testval=1)

    p = [0., alpha / (alpha + beta)]
    s = [0., 1 - p[1]]
    for t in range(2, n):
        pt = ((beta + t - 2) / (alpha + beta + t - 1)) * p[t-1]
        p.append(pt)
        s.append(s[t-1] - p[t])
    p = tt.stack(p)
    s = tt.stack(s)
    
    def logp(value):
        active = value[0,:]
        lost = value[1,:]

        # Those who've churned along the way...
        died = tt.mul(tt.log(p[1:]), lost[1:])

        # and those still active in last period
        still_active = tt.log(s[-1]) * active[-1]
        return  tt.sum(died) + still_active

    retention = pm.DensityDist('retention', logp, observed=data)
    trace = pm.sample(1000)
pm.traceplot(trace)

Thank you!
You’re the best!!! :smiley:

2 Likes

Hi @arielaguirre1420

did you succeed to implement it for multi cohort?
if yes, could you share the code?
Rgds

Here is my code for multi-cohort sBG fitting. but I am reaching an error from Pymc. I am new at Bayesian Modeling and very new at Python. I am trying to implement the sBG for multi-cohort.
But I got an error ValueError: setting an array element with a sequence. Can someone help??
@junpenglao @arielaguirre1420

import pymc3 as pm
import numpy as np
import scipy.stats as stats
from scipy import optimize
import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt
from theano.tensor import _shared
multi_cohort_data = np.array([[10000, 8000, 6480, 5307, 4391], 
                              [10000, 8000, 6480, 5307],
                              [10000, 8000, 6480], 
                              [10000, 8000]])

multi_cohort_data
n=len(multi_cohort_data[0:])+1
with pm.Model() as model:
    alpha = pm.Uniform('alpha', 0.00001, 1000.0, testval=1)
    beta = pm.Uniform('beta', 0.00001, 1000.0, testval=1)

    p = [0., alpha / (alpha + beta)]
    s = [0., 1 - p[1]]
    for t in range(2, n):
        pt = ((beta + t - 2) / (alpha + beta + t - 1)) * p[t-1]
        p.append(pt)
        s.append(s[t-1] - p[t])
    p = tt.stack(p)
    s = tt.stack(s)
    
    def logp(value):
        df = value
        cohorts = len(df)
        total = 0

        # Those who've churned along the way...
        # and those still active in last period
        for i, cohort in enumerate(df):
            for j in range(len(cohort) - 1):
                total = total + (cohort[j] - cohort[j+1])*tt.log(p[j])
                
            total = total + (cohort[-1] * tt.log(s[cohorts - i -1]))
        return total 
        
        #another formulation
        #for i in range(len(df[0:]) -1 ):
        #    for j in range(i,len(df[i,:])-1):
        #        total = total + tt.sum((df[i,j]-df[i,j+1])*tt.log(p[j]))
        #    total = total + df[i,-1]*tt.log(s[cohorts-i-1])
        #return total

    retention = pm.DensityDist('retention', logp, observed=multi_cohort_data)
    trace = pm.sample(1000,  step=pm.NUTS(), cores =1)
pm.traceplot(trace)

indent preformatted text by 4 spaces

Did you try flattening the multi_cohort_data into a 1d array? Note that in the original model, cohort is generated outside of pm.Model (as they are static), so the data goes into observed is actually a 2 x N array.

Hi, Thanks for your remark. following your suggestion I modified my code like below. unfortunately, i got result different from the one of the authors (alpha 3.8 beta = 15.19). I got 4.369 for alpha and 17.583 for beta

# the sBG model for contractual setting (LTV)
import numpy as np
import scipy
import pymc3 as pm
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import theano.tensor as tt
from theano.tensor import _shared

# here is an example of active customer per periode


 multi_cohort_data = [[10000, 8000, 6480, 5307, 4391], 
                               [10000, 8000, 6480, 5307],
                               [10000, 8000, 6480], [10000, 8000]]

# To calculate the likelihood, we'll need the number of customers lost at each period.
# we define a function to calculate it



def transform_array(data):
    array = []
    for i, data in enumerate(data):
        array = array + data
    return array


def n_lost_2(data, n):
    lost=[None]
    t = 0
    while n >= 2 :
        for i in range(t + 1, t + n):
            lost.append(data[i-1] - data[i])
        if n > 2:
            lost.append(None)
        t += n
        n -= 1
    return lost

num_periods = len(multi_cohort_data[0,])
res = transform_array(multi_cohort_data)
res_n_lost = n_lost_2(res, 5)

data = np.array((res, res_n_lost)) #2*N data with active customer and churned customer

    
with pm.Model() as model:
    alpha = pm.Uniform('alpha', 0.00001, 1000.0, testval=1)
    beta = pm.Uniform('beta', 0.00001, 1000.0, testval=1)
​
    p = [0., alpha / (alpha + beta)]
    s = [0., 1 - p[1]]
    for t in range(2, n):
        pt = ((beta + t - 2) / (alpha + beta + t - 1)) * p[t-1]
        p.append(pt)
        s.append(s[t-1] - p[t])
    p = tt.stack(p)
    s = tt.stack(s)
    
    def logp(value):
        
        active = value[0,:]
        lost = value[1,:]
        still_alive = 0
        died = 0
        n = num_periods
        m = num_periods
        t = 0
        
        # Those who've churned along the way...      
        while n >= 2 :
            j=1
            for i in range(t + 1, t + n):
                died += lost[i] * tt.log(p[j])
                j += 1
            t += n
            n -= 1
        
        # and those still active in last period
        while m >=2:
            still_alive += active[m-1] * tt.log(s[m-1])
            m -= 1
        
        return still_alive + died
​     retention = pm.DensityDist('retention', logp, observed=data)
    trace = pm.sample(10000,  step=pm.HamiltonianMC(), cores =1)
pm.traceplot(trace)

I rewrited the log_prob computation a bit so it is more efficient and could be scaled up easier.
The trick is to flatten everything into 1d so you dont need a for loop in the log_prob computation, and use indexing since we have a ragged observed.

# here is an example of active customer per periode
multi_cohort_data = [[10000, 8000, 6480, 5307, 4391],
                     [10000, 8000, 6480, 5307],
                     [10000, 8000, 6480],
                     [10000, 8000]]

num_periods = max(list(map(len, multi_cohort_data)))

def n_lost(data):
    lost = [None]
    for i in range(1, len(data)):
        lost.append(data[i - 1] - data[i])
    return lost

multi_cohort_data_lost = list(map(n_lost, multi_cohort_data))
active_flatten = [i[-1] for i in multi_cohort_data]
active_index = [len(i)-2 for i in multi_cohort_data]
lost_flatten = [j for i in multi_cohort_data_lost for j in i[1:]]
lost_index = [j for i in multi_cohort_data_lost for j in np.arange(len(i[1:]))]

# 2*N data with active customer and churned customer
active_flatten, active_index, lost_flatten, lost_index
# ([4391, 5307, 6480, 8000],
#  [3, 2, 1, 0],
#  [2000, 1520, 1173, 916, 2000, 1520, 1173, 2000, 1520, 2000],
#  [0, 1, 2, 3, 0, 1, 2, 0, 1, 0])


with pm.Model() as model:
    alpha = pm.Uniform('alpha', 0.00001, 1000.0, testval=1)
    beta = pm.Uniform('beta', 0.00001, 1000.0, testval=1)
    p1 = alpha / (alpha + beta)
    s1 = 1. - p1

    def scan_body(t, pt, st):
        pt_1 = ((beta + t - 2) / (alpha + beta + t - 1)) * pt
        st_1 = st - pt_1
        return pt_1, st_1
    (pt, st), _ = theano.scan(
        scan_body,
        sequences=np.arange(2, num_periods),
        outputs_info=[p1, s1])
    pt = tt.concatenate([p1[..., None], pt])
    st = tt.concatenate([s1[..., None], st])
    
    # compute the logp
    # Those who've churned along the way...
    died = tt.log(pt[lost_index]) * lost_flatten

    # and those still active in last period
    still_active = tt.log(st[active_index]) * active_flatten
    retention = pm.Potential('retention', tt.sum(died) + tt.sum(still_active))
    
    trace = pm.sample(cores=4, return_inferencedata=True)

However, it is also giving me estimation of alpha around 4.3 and beta around 17.5 - where is the other estimation you are getting from?

1 Like

Thank for the corrections.
I got the value from the Excel sheet of authors [who construct the model] of the article https://www.brucehardie.com/notes/017/

Not sure if you are still working on this, but fwiw, I had a similar issue when looking at the mean of my sBG model. The average parameters were larger than the values in the paper, but this was because the posteriors were skewed and there was a strong relationship between alpha and beta. I think if you tried to run:

import seaborn

plt.figure(figsize = (25, 20))
seaborn.jointplot(trace['alpha'], trace['beta'], kind = 'hex', color = 'green')
plt.plot(3.8, 15.19, 'ro')
plt.xlabel(r'$\alpha$')
plt.ylabel(r'$\beta$');

you would find the dot aligns well with the mean of the posterior. Hope this is useful