Hi there,
I have a half theoretical / half pymc3 question.
I am trying to predict the conversion rates of certain online ads. Each advertisement gets only few clicks so the conversion rate I observe is quite degenerate with a lot of zeros. This does not mean the true conversion rate is zero we might just haven’t sampled enough. I thought of building a probabilistic model like the following.
import pymc3 as pm
import numpy as np
import theano.tensor as tt
import pymc3 as pm
import pandas as pd
n = 10000
cate1 = [np.random.randint(0, 10) for _ in range(n)]
num1 = [np.random.rand() for _ in range(n)]
X = pd.DataFrame(zip(cate1, num1), columns=['cate1', 'num1'])
X = pd.get_dummies(X, columns=['cate1'], sparse=False) # TODO change me to sparse
clicks = [1 + np.random.poisson(2) for _ in range(n)]
conversions = [np.random.binomial(c, .02) for c in clicks]
m = X.shape[1]
with pm.Model() as model:
mu1 = pm.Normal('mu1', 0, 10, shape=m)
mu0 = pm.Normal('mu0', 0, 1)
gamma1 = pm.Normal('gamma1', 0, 10, shape=m)
gamma0 = pm.Normal('gamma0', 0, 1)
alpha = pm.Deterministic('alpha', np.exp(mu0 - pm.math.dot(X, mu1)))
beta = pm.Deterministic('beta', np.exp(gamma0 - pm.math.dot(X, gamma1)))
y = pm.BetaBinomial('obs', n=clicks, alpha=alpha, beta=beta, observed=conversions)
trace = pm.sample(1000)
with model:
p = pm.Beta('p', alpha, beta, shape=n)
ppc = pm.sample_ppc(trace, 1000, vars=[p])
Note that despite the output of my model is the number of successes, I am really interested here on the the p parameter.
The first obvious theoretical question is whether this model make sense.
The second one is more pymc3 related: the estimation does not progress and I get stuck in the first sampling loop:
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [gamma0, gamma1, mu0, mu1]
What could be the part that is causing the slow down? In my use case both n and m are big (millions) so I am not sure this is a good start .
For each observation I observe:
- the number of successes (conversions)
- the number of trials (clicks)
- A bunch of features
In the past I simply trained a logistic regression with conversions / clicks being my dependent variable but this would neglect the fact too few samples might have been drawn to reveal the true conversion rate.