I have a half theoretical / half pymc3 question.
I am trying to predict the conversion rates of certain online ads. Each advertisement gets only few clicks so the conversion rate I observe is quite degenerate with a lot of zeros. This does not mean the true conversion rate is zero we might just haven’t sampled enough. I thought of building a probabilistic model like the following.
import pymc3 as pm import numpy as np import theano.tensor as tt import pymc3 as pm import pandas as pd n = 10000 cate1 = [np.random.randint(0, 10) for _ in range(n)] num1 = [np.random.rand() for _ in range(n)] X = pd.DataFrame(zip(cate1, num1), columns=['cate1', 'num1']) X = pd.get_dummies(X, columns=['cate1'], sparse=False) # TODO change me to sparse clicks = [1 + np.random.poisson(2) for _ in range(n)] conversions = [np.random.binomial(c, .02) for c in clicks] m = X.shape with pm.Model() as model: mu1 = pm.Normal('mu1', 0, 10, shape=m) mu0 = pm.Normal('mu0', 0, 1) gamma1 = pm.Normal('gamma1', 0, 10, shape=m) gamma0 = pm.Normal('gamma0', 0, 1) alpha = pm.Deterministic('alpha', np.exp(mu0 - pm.math.dot(X, mu1))) beta = pm.Deterministic('beta', np.exp(gamma0 - pm.math.dot(X, gamma1))) y = pm.BetaBinomial('obs', n=clicks, alpha=alpha, beta=beta, observed=conversions) trace = pm.sample(1000) with model: p = pm.Beta('p', alpha, beta, shape=n) ppc = pm.sample_ppc(trace, 1000, vars=[p])
Note that despite the output of my model is the number of successes, I am really interested here on the the p parameter.
The first obvious theoretical question is whether this model make sense.
The second one is more pymc3 related: the estimation does not progress and I get stuck in the first sampling loop:
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [gamma0, gamma1, mu0, mu1]
What could be the part that is causing the slow down? In my use case both n and m are big (millions) so I am not sure this is a good start .
For each observation I observe:
- the number of successes (conversions)
- the number of trials (clicks)
- A bunch of features
In the past I simply trained a logistic regression with conversions / clicks being my dependent variable but this would neglect the fact too few samples might have been drawn to reveal the true conversion rate.