Hi, more of a general question about how to approach a problem than a specific question about coding here!
Given a cohort of users, I’m trying to predict how many will eventually convert through a process that takes anywhere from 1-40 days per user. My previous method was to use scikit-learn to build a classifier, predicting the conversion binary outcome. Then for all users still going through the process, use the predicted conversion likelihood (0, 1). For all users that have completed the process, use their actual outcome [0 or 1]. Finally, draw a few thousand bernoulli RVs for each user based on their conversion probability and build a credible interval from the sum of each trial.
The problem with that approach is that I am treating all individual conversion probabilities as known, when in fact there is a variance to my estimation that I am essentially discarding. My hope is that I can use pymc3 to build a hierarchical model to predict the mean and credible interval for the number of conversions in the cohort from the individual conversion probabilities. Is that doable?
So far I’ve built a logistic regression as per the below. Any idea on how to turn this into a population prediction?
X_shared = theano.shared(X_train.values)
simple_model = f"{target} ~ X1 + X2 + X3"
with pm.Model() as logistic_model:
pm.glm.GLM.from_formula(simple_model,
X_shared,
family=pm.glm.families.Binomial())
trace = pm.sample(tune=1000, draws=1000, chains=4, init='adapt_diag', cores=4)
X_shared.set_value(incomplete_data)
ppc = pm.sample_ppc(incomplete_data,
model=logistic_model,
samples=100)
In other words, I managed to follow a tutorial . Now I’m lost! How do I turn these predictions on an individual level into population level inference?