Hierarchical Logistic Regression Use Case

DataEgg · February 24, 2021, 6:11pm

Hi everyone. I’m fairly new to PyMC and Bayesian modeling more generally and have been struggling to improve my modeling approach when running into problems with my model (of which there appear to be many).
I am trying to build a logistic regression model for the following use case:
I want to model the probability that a given customer is going to churn (1 if churn, 0 otherwise) from my company. However, the design of my model is a little quirky. I want the churn probability for a given customer to informed by, and updated after, each event I observe for that given customer (for now, an event is simply a product purchase, but could later include returns, complaints, etc.). As an example, I may believe that Customer A has p = 0.5 probability of churning. I then observe Customer A place a really good order with the company, and update my belief of Customer A’s probabilty of churning accordingly (i.e. it would now be lower). Essentially, I am trying to model the impact of a given event within the customer it belongs to (Event is level 1, Customer is level 2). To do this, I want my model to learn what certain events look like for churned customers (churn = 1) and what events look like for non-churned customers (churn = 0).

My data (which unfortunately I cannot share here) is set up as follows: each row is an observed event for a given customer. My target variable is a binary “churn” label which is 1 if that event corresponds to a customer who eventually churns and 0 if it belongs to a customer who does not.

This design makes sense in my head, but I’ve been struggling to write it down in code. I’ve approached this as a logistic regression problem using a Bernoulli distribution for the likelihood. I repurposed some code I found here to do that, although I’m not certain that it’s the appropriate approach.

import arviz as az
import pymc3 as pm
import theano
from statsmodels.formula.api import glm as glm_sm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrix
import theano.tensor as tt
import patsy

with pm.Model() as Logistic_Model:

        data =  event_df 
        outcome = "Churn"  # DV column name
        formula = 'Profit + Days_Late'  # Patsy-style formula 
        submodel = 'Customer_Name'  # submodel index column
        
        design_matrix = dmatrix(formula, data)
        submodel_names = data[submodel].unique() #Get unique Customer Name
        sub_ix = data[submodel].replace(
            {r: i for i, r in enumerate(submodel_names)}).values #Change labels from customer name to numeric ID
        betas = []

        for n in design_matrix.design_info.column_names: 

            n = n.replace('"', '').replace("'", '').replace(' ', '').replace(',','') 
            μ = pm.StudentT(name='μ_' + n, nu=1,mu = 0, sd=1) 
            σ = pm.HalfStudentT(name='σ_' + n, sd=1) 
            δ = [pm.Normal(
                name='δ_(%s=%s)_(condition=%s)' % (submodel, r, n), mu=0., sd=10.
            ) for r in submodel_names]
            β = [pm.Deterministic(
                'β_(%s=%s)_(condition=%s)' % (submodel, r, n), μ + d * σ
            ) for d, r in zip(δ, submodel_names)]
            betas.append(β)

            
            
        B = tt.stack(betas, axis=0).T
        p = pm.invlogit(tt.sum(np.asarray(design_matrix) * B[sub_ix], axis=1)) 

        pm.Bernoulli(
            name=outcome,
            p=p,
            observed=data[outcome].values
        )

I have the following problems when trying to use this model:

I cannot include new predictors without breaking the model. I am currently only using two predictors (Profit of order and Days Late of shipped order). If I include any more predictors, I get the infamous value error when trying to sample:

ValueError: Mass matrix contains zeros on the diagonal. The derivative of RV [Whatever].ravel()[0] is zero.
This model runs, but it’s a piece of poop. Divergences up the whazoo, acceptance probability not meeting the target, rhat statistics larger than 1.4, you name the warning I’ve got it. I’ve been making an honest effort at trying to solve these issues but to no avail.

I hope I did an adequate job at describing my use case and issues. I’m a newbie to PyMC so appreciate any tips/advice and help anyone has to offer me. Thanks in advance for your time!

Versions: (I can’t update my Python version hence the older package versions)
PyMC3 v3.10.0
ArviZ v0.11.0
Theano v1.0.11
Python v3.6.10

DataEgg · February 26, 2021, 2:15pm

Maybe an easier question to tackle, how would I run prior predictive checks for this model?

Dirk_Nachbar · March 2, 2021, 4:23pm

I might have a deeper look later, but a good start might be Fader

Topic		Replies	Views
Cascade Hierarchical Model version agnostic modeling	1	373	August 13, 2022
Concepts of Parameter Estimation and Predictions, and Out of Sample Predicted Probability for Logistic Regression Questions	5	1430	May 11, 2018
Hierarchical logistic regression Sharing	17	4415	October 28, 2019
Time series modeling question Questions	3	605	January 10, 2018
How to model observed percentages (bounded from 0 to 1) Questions	8	2702	January 3, 2018

Hierarchical Logistic Regression Use Case

Related topics