Hierarchical betabinomial for conversion rate prediction

Hi there,

I have a half theoretical / half pymc3 question.

I am trying to predict the conversion rates of certain online ads. Each advertisement gets only few clicks so the conversion rate I observe is quite degenerate with a lot of zeros. This does not mean the true conversion rate is zero we might just haven’t sampled enough. I thought of building a probabilistic model like the following.

import pymc3 as pm
import numpy as np
import theano.tensor as tt
import pymc3 as pm
import pandas as pd

n = 10000
cate1 = [np.random.randint(0, 10) for _ in range(n)]
num1 = [np.random.rand() for _ in range(n)]
X = pd.DataFrame(zip(cate1, num1), columns=['cate1', 'num1'])
X = pd.get_dummies(X, columns=['cate1'], sparse=False)  # TODO change me to sparse
clicks = [1 + np.random.poisson(2) for _ in range(n)]
conversions = [np.random.binomial(c, .02) for c in clicks]
m = X.shape[1]

with pm.Model() as model:
    mu1 = pm.Normal('mu1', 0, 10, shape=m)
    mu0 = pm.Normal('mu0', 0, 1)
    gamma1 = pm.Normal('gamma1', 0, 10, shape=m)
    gamma0 = pm.Normal('gamma0', 0, 1)
    alpha = pm.Deterministic('alpha', np.exp(mu0 - pm.math.dot(X, mu1)))
    beta = pm.Deterministic('beta', np.exp(gamma0 - pm.math.dot(X, gamma1)))
    y = pm.BetaBinomial('obs', n=clicks, alpha=alpha, beta=beta, observed=conversions)
    trace = pm.sample(1000)

with model:
    p = pm.Beta('p', alpha, beta, shape=n)
    ppc = pm.sample_ppc(trace, 1000, vars=[p])

Note that despite the output of my model is the number of successes, I am really interested here on the the p parameter.

The first obvious theoretical question is whether this model make sense.

The second one is more pymc3 related: the estimation does not progress and I get stuck in the first sampling loop:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [gamma0, gamma1, mu0, mu1]

What could be the part that is causing the slow down? In my use case both n and m are big (millions) so I am not sure this is a good start :confused: .

For each observation I observe:

  • the number of successes (conversions)
  • the number of trials (clicks)
  • A bunch of features

In the past I simply trained a logistic regression with conversions / clicks being my dependent variable but this would neglect the fact too few samples might have been drawn to reveal the true conversion rate.


  1. I think I’d start with a simple Binomial regression (i.e getting rid of the mixture part implied by the Beta). It’ll be easier to iterate, improve and understand. And in the end, you’ll maybe realize you don’t need a Beta-Binomial likelihood, especially if you’re using a hierarchical model – which is basically allowing the same flexibility as the BetaBinomial mixture but is even more flexible and often easier to sample from.

  2. For the moment, your model is not hierarchical: your group-level parameters (mu1 and gamma1) would need one more level to enable pooling of information. But again, for the moment I’d keep it simple and add a hierarchical structure later.

  3. In the same vein, I’d build and improve the model on a subset of the dataset – running on millions of data points each time you change the model will probably take a lot of time!

  4. I’m not sure your parameters are interpretable as such – at least I can’t interpret them, but I’m no Beta-Binomial expert and you know your use-case better than me. I think I remember there is a more interpretable parametrization of the Beta-Binomial, where you deal directly with the probability of success p – which would serve you well here:

with pm.Model() as m12_1:
    a = pm.Normal("a", 0.0, 1.5, shape=2)
    phi = pm.Exponential("phi", 1.0)

    theta = pm.Deterministic("theta", phi + 2.0)
    pbar = pm.Deterministic("pbar", pm.math.invlogit(a[gid]))

    A = pm.BetaBinomial(
        "A", pbar * theta, (1.0 - pbar) * theta, N, observed=admit_df.admit.values

That way, you do your regression on pbar and your model is easier to understand and interpret – I don’t remember what theta means here, but you should find that easily on the internet. This example is from Statistical Rethinking 2, chapter 12.

Hope this helps :vulcan_salute:

Thanks for the suggestion. Indeed simplifying things with a binomial and just one linear regression is making my life simpler.

I am still a bit uncertain on how to scale things up to categorical features and more observations but I’ll probably comment / check similar questions first.

1 Like