Estimating Population statistics from individual predictions

nicholas-miles · October 22, 2020, 9:12pm

Hi, more of a general question about how to approach a problem than a specific question about coding here!

Given a cohort of users, I’m trying to predict how many will eventually convert through a process that takes anywhere from 1-40 days per user. My previous method was to use scikit-learn to build a classifier, predicting the conversion binary outcome. Then for all users still going through the process, use the predicted conversion likelihood (0, 1). For all users that have completed the process, use their actual outcome [0 or 1]. Finally, draw a few thousand bernoulli RVs for each user based on their conversion probability and build a credible interval from the sum of each trial.

The problem with that approach is that I am treating all individual conversion probabilities as known, when in fact there is a variance to my estimation that I am essentially discarding. My hope is that I can use pymc3 to build a hierarchical model to predict the mean and credible interval for the number of conversions in the cohort from the individual conversion probabilities. Is that doable?

So far I’ve built a logistic regression as per the below. Any idea on how to turn this into a population prediction?

X_shared = theano.shared(X_train.values)
simple_model = f"{target} ~ X1 + X2 + X3"
with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(simple_model,
                            X_shared,
                            family=pm.glm.families.Binomial())
    trace = pm.sample(tune=1000, draws=1000, chains=4, init='adapt_diag', cores=4)
X_shared.set_value(incomplete_data)
ppc = pm.sample_ppc(incomplete_data,
                model=logistic_model,
                samples=100)

In other words, I managed to follow a tutorial . Now I’m lost! How do I turn these predictions on an individual level into population level inference?

AlexAndorra · October 26, 2020, 9:14am

Hi Nicholas,
Thanks for the question. Not sure what you’re calling a “population prediction” here – I understood it as predictions from the model on future data, in which case sample_posterior_predictive is indeed what you’re looking for.

In any case, hierarchical models can definitely be implemented quite naturally in the Bayesian framework (and in PyMC3 in particular). I think you’ll find the PyMC3 port of Richard McElreath’s Statistical Rethinking very interesting in that regard (chapters 13 and 14 deal with hierarchical models). You can also read Osvaldo Martin’s Bayesian Analysis with Python, which is very good
Hope this helps

Topic		Replies	Views
Concepts of Parameter Estimation and Predictions, and Out of Sample Predicted Probability for Logistic Regression Questions	5	1437	May 11, 2018
Generating population level samples for heirarchical parameters Questions	5	481	January 22, 2020
Hierarchical betabinomial for conversion rate prediction Questions	2	944	August 3, 2020
Hierarchical Binomial with weights to priors Questions	2	615	May 5, 2019
How to properly do out-of-sample prediction for hierarchical model v5 modeling , hierarchical , prediction	24	1323	January 25, 2024

Estimating Population statistics from individual predictions

Related topics