Modeling bivariate beta distributions

DISCLAIMER
This is a repost of a CrossValidated question which has got no answers till now.

Goal

My goal is to perform a bayesian A/B test of probabilities of success in two groups considering a hypothesis about non-zero covariance between those probabilities.

Bivariate beta distribution

I am trying to implement a bivariate beta distribution model as proposed by Barry C.Arnold, Hon Keung Tony Ng in “Flexible bivariate beta distributions” using PyMC3. The main formulae is denoted (1) in the article:

Bivariate random variable (X, Y) is defined as a function of five Gamma distributed variables U_i with scale parameters equal to 1:

U_i \sim Gamma(\alpha_i, 1)

X = \frac{U_1+U_3}{U_1+U_3+U_4+U_5}; Y = \frac{U_2+U_4}{U_2+U_3+U_4+U_5}

When constructed this way, random variables (X, Y) are meant to present a (possibly) covariated pair of Beta random variables.

The marginal distributions of X and Y are beta distributions with parameters (\alpha_1+\alpha_3, \alpha_4+\alpha_5) and (\alpha_2+\alpha_4, \alpha_3+\alpha_5), respectively.

Data
I have a small (generated) dataset containing daily numbers of trials (denoted trials_a and trials_b) and successes (denoted successes_a and successes_b) in two groups (a/b testing). Daily conversion rates are correlated and I’d like to take that fact into account while sampling from posterior beta distributions to obtain more relevant estimation of difference between conversion rates.

Also I have some prior knowledge about marginal betas, e.g. from analyzing data obtained before the start of a/b test. Let’s consider prior_alpha=20 and prior_beta=4980 which are equal for both marginals.

The code in python for generating a dataset is as following:

from scipy import stats
import numpy as np
import pandas as pd


DAYS = 30

# params for generating daily number of trials
TRIALS_MEAN = 5000
TRIALS_STD = 200**2
TRIALS_COV_FACTOR = 0.999

# prior knowledge about probability of success in both groups
PRIOR_ALPHA = 20
PRIOR_BETA = TRIALS_MEAN - PRIOR_ALPHA

# (unknown) probability of success in group A should be a little less than in prior
PROB_A_ALPHA = PRIOR_ALPHA - 1
PROB_A_BETA = TRIALS_MEAN - PROB_A_ALPHA

# (unknown) probability of success in group B should be a little more than in prior
PROB_B_ALPHA = PRIOR_ALPHA + 1
PROB_B_BETA = TRIALS_MEAN - PROB_B_ALPHA

# some value which allows covariance in probabilities via normal copula
BETAS_COPULA_COV_FACTOR = 0.5

np.random.seed(0)


def make_data():
    def make_trials():
        """Creates strongly correlated daily number of trials"""
        mvnorm = stats.multivariate_normal(
            mean=[TRIALS_MEAN, TRIALS_MEAN],
            cov=[[TRIALS_STD, TRIALS_COV_FACTOR*TRIALS_STD],
                 [TRIALS_COV_FACTOR*TRIALS_STD, TRIALS_STD]])
        return mvnorm.rvs(DAYS).astype(int)

    def make_probs():
        """Creates correlated daily probs of success using normal copula"""

        def covariate(rv1: stats.rv_continuous, rv2: stats.rv_continuous, cov_factor: float, n: int):
            def make_copula():
                mvnorm = stats.multivariate_normal(mean=[0, 0], cov=[[1., cov_factor],
                                                                     [cov_factor, 1.]])
                # Generate random samples from multivariate normal with covariance
                x = mvnorm.rvs(n)
                return stats.norm.cdf(x=x)

            copula = make_copula()
            s1 = rv1.ppf(copula[:, 0])
            s2 = rv2.ppf(copula[:, 1])
            return np.vstack((s1, s2))

        beta_a = stats.beta(PROB_A_ALPHA, PROB_A_BETA)
        beta_b = stats.beta(PROB_B_ALPHA, PROB_B_BETA)
        return covariate(rv1=beta_a, rv2=beta_b, cov_factor=BETAS_COPULA_COV_FACTOR, n=DAYS).T

    trials = make_trials()
    probs = make_probs()
    # creates daily number of successes by sampling from binomial with corresponding number of trials
    # and prob of success

    successes = np.random.binomial(trials, probs)

    df = pd.DataFrame(
        np.hstack(
            (trials, probs, successes)
        ),
        columns=['trials_a', 'trials_b', 'prob_a', 'prob_b', 'successes_a', 'successes_b']
    )
    return df


df = make_data()
print(df.head())
# output:
#   trials_a  trials_b    prob_a    prob_b  successes_a  successes_b
#0    4645.0    4649.0  0.004429  0.004510         15.0         28.0  
#1    4794.0    4814.0  0.005251  0.003992         28.0         34.0  
#2    4630.0    4622.0  0.003775  0.003821         13.0         22.0  
#3    4810.0    4809.0  0.004845  0.005806         19.0         29.0  
#4    5018.0    5022.0  0.004427  0.004914         17.0         30.0 

Model in theory

As mentioned above, I have some prior knowledge about the marginal distributions of betas:

X_{margin} = Y_{margin} = Beta(prior\_alpha, prior\_beta)

And we have their expressions in terms of \alpha_i from the paper:

X_{margin} = Beta(\alpha_1+\alpha_3, \alpha_4+\alpha_5); Y_{margin} = Beta(\alpha_2+\alpha_4, \alpha_3+\alpha_5)

Unfortunately I don’t know anything about covariance prior. Therefore my prior knowledge about \alpha s can be expressed like this (I am very unsure about correctness of these lines):

some\_latent\_var = Uniform(0.001, min(prior\_alpha, prior\_beta))

\alpha_3, \alpha_4 \sim some\_latent\_var

\alpha_1, \alpha_2 \sim prior\_alpha - some\_latent\_var

\alpha_5 \sim prior\_beta - some\_latent\_var

So then I incorporate the model mentioned above:

U_i \sim Gamma(\alpha_i, 1)

X = \frac{U_1+U_3}{U_1+U_3+U_4+U_5}; Y = \frac{U_2+U_4}{U_2+U_3+U_4+U_5}

And finally in test groups a and b I have N (N = 30 in the example) independent and identically distributed binomial variables which generate daily number of succeses from daily number of trials (the latter are considered to be non-random). Probability of success in groups a and b follow the X and Y distributions respectively. Denoting day d's and group k's number of trials trials_{k, d} and number of successes successes_{k, d}:

\forall d \in [1, N]: successes_{a, d} \sim Binomial(trials_{a, d}, X)

\forall d \in [1, N]: successes_{b, d} \sim Binomial(trials_{b, d}, Y)

I have both successes and trials in both groups in my dataset. Therefore I can construct this model in PyMC3 and sample from it to obtain posteriors for X and Y.

Model in practice

I tried to follow the “Model in theory” as close as I could:

import pymc3 as pm
import theano.tensor as tt

with pm.Model() as model_five_params:
    # hyperparam's priors (just like in "Model in theory")
    some_latent_var = pm.Uniform('some_latent_var', lower=0.001, upper=min(PRIOR_ALPHA, PRIOR_BETA))
    alpha_3 = alpha_4 = some_latent_var
    alpha_1 = alpha_2 = PRIOR_ALPHA - some_latent_var
    alpha_5 = PRIOR_BETA - some_latent_var
    
    # five Gammas
    u_1 = pm.Gamma('u_1', alpha=alpha_1, beta=1)
    u_2 = pm.Gamma('u_2', alpha=alpha_2, beta=1)
    u_3 = pm.Gamma('u_3', alpha=alpha_3, beta=1)
    u_4 = pm.Gamma('u_4', alpha=alpha_4, beta=1)
    u_5 = pm.Gamma('u_5', alpha=alpha_5, beta=1)
    
    # construction of X and Y from Gammas
    # pm.Deterministic doesn't do anything special;
    # It just does the arithmetics and labels results
    # so that we can get traces for X and Y easily
    X = pm.Deterministic('X', (u_1+u_3)/(u_1+u_3+u_4+u_5))
    Y = pm.Deterministic('Y', (u_2+u_4)/(u_2+u_3+u_4+u_5))

    # I hoped this trick of stacking X and Y together would help
    # PyMC3 to understand that observations are linked too
    beta = tt.stack([X, Y])

    # Introduce the final step of Binomials to the model
    # and pass the observations to it
    # Numbers of trials and numbers of successes are passed in shapes (N, 2)
    obs = pm.Binomial(
        'obs',
        n=df[['trials_a', 'trials_b']].values,
        p=beta,
        observed=df[['successes_a', 'successes_b']].values
    )

with model_five_params:
    # sampling process of the model which results in a 'trace' object
    # 'trace' contains samples from posterior distributions of every
    # variable defined in model. We are especially interested in X and Y
    step = pm.Metropolis()
    trace_five_params = pm.sample(200000, tune=20000, step=step, init='advi')

Analyzing the results

The posterior traces of X and Y are very similar to beta posteriors obtained analytically as following:

\forall k, \forall d: fails_{k, d} = trials_{k, d} - successes_{k, d}
X = Beta(prior\_alpha + \sum_{d=1}^Nsuccesses_{a, d}, prior\_beta + \sum_{d=1}^Nfails_{a, d})
Y = Beta(prior\_alpha + \sum_{d=1}^Nsuccesses_{b, d}, prior\_beta + \sum_{d=1}^Nfails_{b, d})

beta_a_exact = stats.beta(
    a=PRIOR_ALPHA + df['successes_a'].sum(),
    b=PRIOR_BETA + df['trials_a'].sum() - df['successes_a'].sum()
)
beta_b_exact = stats.beta(
    a=PRIOR_ALPHA + df['successes_b'].sum(),
    b=PRIOR_BETA + df['trials_b'].sum() - df['successes_b'].sum()
)

The fact that posterior traces of X and Y do match the analytical posteriors can be illustrated with a plot combaining the histograms of traces and PDFs of Beta distributions obtained analytically.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(nrows=1, ncols=1)

x_0 = np.linspace(beta_a_exact.ppf(0.00001), beta_a_exact.ppf(0.99999), 1000)
ax.plot(x_0, beta_a_exact.pdf(x_0), label='Analytical $X$')
ax.hist(trace_five_params.get_values('X', burn=50000), density=True, bins=50, alpha=0.5, label='Trace for $X$')

x_1 = np.linspace(beta_b_exact.ppf(0.00001), beta_b_exact.ppf(0.99999), 1000)
ax.plot(x_1, beta_b_exact.pdf(x_1), label='Analytical $Y$')
ax.hist(trace_five_params.get_values('Y', burn=50000), density=True, bins=50, alpha=0.5, label='Trace for $Y$')

ax.legend()
ax.set_xlabel('$p$')
ax.set_ylabel('$f(p)$')

plt.show()

the image cannot be inserted due to new user’s limitation on number of images

This result is a somewhat desired output because posterior traces were expected to match the analytical solutions for marginals.

But, unfortunately, the model doesn’t show any correlation between two betas:

fig, ax = plt.subplots()
ax.hist2d(
    trace_five_params.get_values('X', burn=50000),
    trace_five_params.get_values('Y', burn=50000),
    bins=50
)
ax.set_xlabel('$X$ trace')
ax.set_ylabel('$Y$ trace')
plt.show()

correlation of traces

The expected output: 2d histogram shows correlation between X and Y because daily probabilities of success in example dataframe are correlated and model was supposed to catch this correlation.

What I got: 2d histogram shows no correlation.

The questions are:

  1. Is the theoretical model correct and should it actually covariate X and Y?

  2. If the answer is “yes” then how do I modify my PyMC3 model so it actually catches the correlation between conversion rates?

1 Like

Thanks for the detail question, the important point here to realized here is that, you need multiple “observation” in the latent space to capture correlation, this means that X and Y are random variable with shape m (m being the number of trials) that generated from a distribution.

Something like below might work:

with pm.Model() as m:
  us = pm.Gamma('u_i', np.ones(5), 1, shape=(len(count), 5))
  x = pm.Deterministic('x', (us[..., 0]+us[..., 2])/(us[..., 0]+us[..., 2]+us[..., 3]+us[..., 4]))
  y = pm.Deterministic('y', (us[..., 1]+us[..., 3])/(us[..., 1]+us[..., 2]+us[..., 3]+us[..., 4]))
  observed = pm.Binomial('obs', np.ones(2)*20, tt.stack([x, y]).T, observed=count)
  trace = pm.sample(1000, tune=1000)

_, ax = plt.subplots(1, 1, figsize=(5, 5))
az.plot_kde(trace['x'].mean(axis=0),
            trace['y'].mean(axis=0),
            ax=ax);

See https://colab.research.google.com/drive/1iloMYZBYyV4AG0_7xXOSpkR7FH2TbDUn for the full code

1 Like

Thank you for your answer.

It seems to me that your model considers a different probability of success for each day in both groups.

My intention was to find distributions of just two probabilities of success each of which is static for all days: the first one (X) generates successes from trials in group a, and the second one (Y) - in group b.
This approach was inspired by this series of posts and this post with pymc3 implementation.

The simplest model showing what I’m trying to achieve is the following:

with pm.Model() as model_simple:
    beta_a = pm.Beta('beta_a', alpha=PRIOR_ALPHA, beta=PRIOR_BETA)
    beta_b = pm.Beta('beta_b', alpha=PRIOR_ALPHA, beta=PRIOR_BETA)
    
    obs_a = pm.Binomial('obs_a', n=df['trials_a'].values, p=beta_a, observed=df['successes_a'].values)
    obs_b = pm.Binomial('obs_b', n=df['trials_b'].values, p=beta_b, observed=df['successes_b'].values)

    step = pm.Metropolis()
    trace_simple = pm.sample(200000, tune=20000, step=step, init='advi')

As you can see, there is only two betas, not a one for each day in each group.
The only disadvantage of this model is that betas are not correlated, which is why I’m trying to implement the model from the paper.

Exactly, the problem here is that it is difficult (i am tempted to say impossible) to recover latent correlation from samples only, when correlation is not model as a parameter. Imagine trying to model correlated Gaussian with identity Gaussian as prior - you will not be able to recover the correlation:

with pm.Model() as m: 
   x = pm.Normal('x', 0., 100., shape=2) 
   pm.Normal('y', x, 1., observed=np.random.multivariate_normal(np.zeros(2), np.array([[1., .25],[.25, 2.]]), 1000))
   trace = pm.sample() 

Yes, this is exactly why the model denoted model_simple in my previous reply is not able to catch the correlation. But isn’t the model from the paper designed to actually have parameters for correlation in it?

Okay, maybe I should have started with a simpler model instead of a bivariate beta. Beta can be approximated with normal so let’s replace multivariate beta with multivariate normal for simplicity:

prior_mean = PRIOR_ALPHA / (PRIOR_ALPHA + PRIOR_BETA)
with pm.Model() as normal_model:
    prob_of_success = pm.MvNormal(
        'prob_of_success',
        mu=np.array([prior_mean, prior_mean]), 
        cov=np.array([[0.002, 0.0005], [0.0005, 0.002]]),
        shape=2
    )
    
    obs = pm.Binomial(
        'obs',
        n=df[['trials_a', 'trials_b']].values,
        p=prob_of_success,
        observed=df[['successes_a', 'successes_b']].values
    )

    step = pm.Metropolis()
    trace_normal = pm.sample(20000, tune=20000, step=step, init='advi')

This model shows no correlation of probabilities of success, too. Why?

np.corrcoef(trace_normal.get_values('prob_of_success').T)
# array([[1.        , 0.00471091],
#        [0.00471091, 1.        ]])

I am really missing something…

I’ve run into this modeling problem before. It took me a long time to wrap my head around this, and I still get confused occasionally. It’s a problem I’ve been struggling with for a few months now, so I apologize if this is incorrect, too.

The model you’ve described in post #6 doesn’t have enough data to learn the correlations. Suppose you have 100 trials, and 30 successes for A and 70 for B. This says little about the probability of both trials succeeding; the following could all be the case:

fail A succeed A fail A succeed A fail A succeed A
succeed B 9 21 0 30 30 0
fail B 21 49 30 40 0 70

These all have different correlations. Using a binomial likelihood hides the correlation from the model, because all it sees are the aggregate counts (30 successes for A, 70 for B). The model does not have enough information to recover the (latent) correlation when just given the aggregates.

With all that said, I am still not totally satisfied by the above explanation. Intuitively, it seems to me that by drawing the (logit) probability p from an MvNormal, I should be specifying that there’s a possible correlation between p_a and p_b, and that the model should recover that. In practice, I think “possible” is the operant word in that sentence, and in the absence of more data the problem gets reduced to two independent Binomials.

If you have the individual trial data, you might try using that. This would be a model where you draw logit p from an MVN (or p from correlated Betas) then pass those to a Bernoulli, instead of a Binomial. I’ve had trouble fitting this kind of model in practice, though, so if you have that data and succeed I’d love to hear about it.

Late to this post, but this recent (2021) preprint seeks to extend the literature on modeling multivariate Beta distributions and is quite similar to the approach taken by Arnold and Ng. Specifically, section 2.4 discusses ways to constrain parameters to “enable unambiguous parameter inference”. You may find it helpful:

1 Like