Estimating Population statistics from individual predictions

Hi, more of a general question about how to approach a problem than a specific question about coding here!

Given a cohort of users, I’m trying to predict how many will eventually convert through a process that takes anywhere from 1-40 days per user. My previous method was to use scikit-learn to build a classifier, predicting the conversion binary outcome. Then for all users still going through the process, use the predicted conversion likelihood (0, 1). For all users that have completed the process, use their actual outcome [0 or 1]. Finally, draw a few thousand bernoulli RVs for each user based on their conversion probability and build a credible interval from the sum of each trial.

The problem with that approach is that I am treating all individual conversion probabilities as known, when in fact there is a variance to my estimation that I am essentially discarding. My hope is that I can use pymc3 to build a hierarchical model to predict the mean and credible interval for the number of conversions in the cohort from the individual conversion probabilities. Is that doable?

So far I’ve built a logistic regression as per the below. Any idea on how to turn this into a population prediction?

X_shared = theano.shared(X_train.values)
simple_model = f"{target} ~ X1 + X2 + X3"
with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula(simple_model,
                            X_shared,
                            family=pm.glm.families.Binomial())
    trace = pm.sample(tune=1000, draws=1000, chains=4, init='adapt_diag', cores=4)
X_shared.set_value(incomplete_data)
ppc = pm.sample_ppc(incomplete_data,
                model=logistic_model,
                samples=100)

In other words, I managed to follow a tutorial :slight_smile:. Now I’m lost! How do I turn these predictions on an individual level into population level inference?

Hi Nicholas,
Thanks for the question. Not sure what you’re calling a “population prediction” here – I understood it as predictions from the model on future data, in which case sample_posterior_predictive is indeed what you’re looking for.

In any case, hierarchical models can definitely be implemented quite naturally in the Bayesian framework (and in PyMC3 in particular). I think you’ll find the PyMC3 port of Richard McElreath’s Statistical Rethinking very interesting in that regard (chapters 13 and 14 deal with hierarchical models). You can also read Osvaldo Martin’s Bayesian Analysis with Python, which is very good :ok_hand:
Hope this helps :vulcan_salute: