Beginner question - Comparing two posterior predictive distributions with different number of observed data


Hope you are having a nice day!

I am a PYMC beginner, and I am trying to wrap my head around using PYMC to compare two data groups, in the context of A/B testing.


I have two groups of data - group A and group B. Group A has 400 datapoints, group B has 390. Each datapoint is a single float number representing revenue from that particular datapoint. This data is stored in pandas DataFrames called data_A and data_B.

I want to build a simple model that would allow me to compare the posteriors of these two groups, and make claims like “Group A has higher revenue than Group B X% of the time, and distribution of the difference between the two groups looks like this”.

In order to do that, I define a simple model:

with pm.Model() as revenue_model:
    sigma_A = pm.HalfNormal("sigma_A", 1000)
    sigma_B= pm.HalfNormal("sigma_A", 1000)
    mean_A = pm.Normal("mean_A", mu=5000, sigma=1000)
    mean_B = pm.Normal("mean_B", mu=5000, sigma=1000)
    revenue_A= pm.Normal('revenue_A', mu=mean_A, sigma=sigma_A, observed = data_A['revenue'])
    revenue_B= pm.Normal('revenue_B', mu=mean_B, sigma=sigma_B, observed = data_B['revenue'])
    trace = pm.sample()

I can then plot the trace to see the posterior sigma and mean parameters for A and B. All four will have chains x draws shape (in my case with defaults it’s 4 x 1000).

But when I draw samples from my posterior predictive, I get revenue_A to have a shape of 4 x 1000 x 400, and revenue_B to have a shape of 4 x 1000 x 350. I understand that PYMC draws a sample from posterior predictive for each observed datapoint, hence why the posterior predictives have that shape.

However, I can’t compare these (e.g. take one from another to get a distribution of difference) because they are different shapes.

Two questions:

  • Is this the right approach to answer the question I am asking?
  • If so, how could I go about comparing posterior distribution of the revenue?

Thank you very much in advance for you advice and guidance!

edit Any suggestions about “best practices” and structuring my code differently are most welcome too!

Hi @NikitaK

I am not sure if I understand fully but I think what you’re trying to do is a comparison of the means between the two groups, but you’re encountering a situation where the posterior of the parameter has too many dimensions (it should be as you say 4 x 1000, assuming defaults in pm.sample. This may be related to the dimensions of the observed data, which have caused me problems in the past. Try something like this:

with pm.Model() as revenue_model:
    # Priors for variances
    sigma_A = pm.HalfNormal("sigma_A", 1000)
    sigma_B= pm.HalfNormal("sigma_A", 1000)
    # Priors for the means
    mean_A = pm.Normal("mean_A", mu=5000, sigma=1000)
    mean_B = pm.Normal("mean_B", mu=5000, sigma=1000)

    # Datasets
    # Casting to numpy forces Series to be a 1Dimensional array
    groupA = pm.MutableData('groupA', data_A['revenue'].to_numpy())
    groupB = pm.MutableData('groupB', data_B['revenue'].to_numpy())
    # Likelihoods
    revenue_A= pm.Normal('revenue_A', mu=mean_A, sigma=sigma_A, observed = groupA)
    revenue_B= pm.Normal('revenue_B', mu=mean_B, sigma=sigma_B, observed = groupB)

    # Compute the difference and store it in the trace
    mean_diff = pm.Deterministic('mean_diff', mean_A - mean_B)
    sigma_diff = pm.Deterministic('sigma_diff', sigma_A - sigma_B)

    idata = pm.sample()

Hopefully that works. The issue might be that the observed data - when passed as a Series - gets interpreted as N x 1 vector and so an extra dimension is broadcast into the posterior, when you should provide them as just N.

Once done you can obtain your desired probabilities easily with arviz, e.g.:

az.plot_posterior(idata, var_names=['mean_diff', 'sigma_diff'], ref_val=0)

this should tell you the probability that the mean of A is greater than B, or indeed if the SD of A is greater than B (implying A is more variable than B). hope that helps!

Thank you very much! I will try this an report back with any comments/questions :grin:

Re-reading your post above I think I misunderstood a little, in that you were trying to compare posterior predictive samples (which will indeed have the 4 x 1000 x N shape you describe) to one another, but I thought that something had happened with the shape of the posterior of the parameters (not the datapoints!). Either way I think the above is the way to go, as you can compare the means of the two distributions and make claims about how likely it is one is on average higher/lower.

You could also try generating the posterior predictive distribution for each group, and for each simulated dataset, take the average for each group, and compare the means within each draw, but see how you get on first!

Thank you very much for your advice!

Would this be similar to BEST notebook in PYMC examples gallery? Bayesian Estimation Supersedes the T-Test — PyMC example gallery

I was worried that by comparing the mean parameter and not the posterior sample distribution I would be ignoring (or at least decreasing) the impact the sigma parameter has on this.

But I think I agree with you, this is already very useful, thank you again for the advice :slight_smile:

Yes, its the same idea as the BEST notebook, just looking at the difference in the parameters! And of course its very easy to just look at the difference in the sigma parameter between the distributions too since it gets estimated, so easy to see whether one group is more variable than another. You are welcome!

I agree with this, see this video by Richard McElreath for a discussion. You can handle it in PyMC by using pm.MutableData. Once you have the estimated mu and sigma, the number of draws you take from your estimated distribution is completely arbitrary. When you do posterior predictive sampling, you get back the number of observations because PyMC works with statically-shaped objects*. Here’s a work-around:

import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

true_sigmas = abs(np.random.normal(scale=1000, size=2))
true_mus = np.random.normal(loc=5000, scale=1000, size=2)
df_A = pd.DataFrame(np.random.normal(loc=true_mus[0], scale=true_sigmas[0], size=1000), columns=['revenue'])
df_B = pd.DataFrame(np.random.normal(loc=true_mus[1], scale=true_sigmas[1], size=900), columns=['revenue'])

with pm.Model() as revenue_model:
    data_A = pm.MutableData('data_A', df_A.revenue.values)
    data_B = pm.MutableData('data_B', df_B.revenue.values)
    sigma_A = pm.HalfNormal("sigma_A", 1000)
    sigma_B= pm.HalfNormal("sigma_B", 1000)
    mean_A = pm.Normal("mean_A", mu=5000, sigma=1000)
    mean_B = pm.Normal("mean_B", mu=5000, sigma=1000)
    revenue_A= pm.Normal('revenue_A', mu=mean_A, sigma=sigma_A, observed = data_A)
    revenue_B= pm.Normal('revenue_B', mu=mean_B, sigma=sigma_B, observed = data_B)
    idata = pm.sample()

After sampling, you can use pm.set_data to update data_A and data_B to be whatever you want. Note that in this case, these quantities are not used at all to compute the predictive samples! The process for getting a predictive sample is 1) sample mu, 2) sample sigma, 3) sample y ~ N(mu, sigma), There only thing PyMC is going to take from the MutableData containers is the shape. So in this case, you can just pass an array of zeros for the new data, with the desired shape:

with revenue_model:
#     min_rev = min(df_A.revenue.min(), df_B.revenue.min())
#     max_rev = max(df_A.revenue.max(), df_B.revenue.max())
#     rev_grid = np.linspace(min_rev * 0.5, max_rev * 1.5, 1000)
    rev_grid = np.zeros(1000)
    pm.set_data({'data_A':rev_grid, 'data_B':rev_grid})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True,
                                               idata_kwargs = {'dims':{'revenue_A':['revenue'],

I left in some commented code that computes a grid of values around the observed values, you can run the code twice and convince yourself the observed values don’t matter. I also chose 1000, which is totally arbitrary, you can get as many or as few posterior predictive draws as you like. There’s also a trick with idata_kwargs. By default, arviz has no way to know that the new posterior predictive samples will have the same index (and we can’t set them the same in our model, because that would force their shapes to be the same during sampling). So the fix is to manually tell arviz they’re the same using the idata_kwargs argument. If you don’t do this, life gets a bit harder down the road.

Finally, I saved the draws in the predictions group, rather than the posterior_predictive group, by passing predictons=True. This is 100% entirely an aesthetic choice. It does not do anything special related to prediction.

So once you do that you can plot the two posterior predictive distributions together:

fig, ax = plt.subplots(figsize=(14,4), dpi=144)
                  var_names=['revenue_A', 'revenue_B'], 
                  ax=[ax, ax],
ax.set_title('Comparison of Posterior Predictive Distributions')

But it’s strongly recommended to plot the difference:

idata['predictions']['revenue_diff'] = idata['predictions'].revenue_A - idata['predictions'].revenue_B

fig, ax = plt.subplots(figsize=(6,4), dpi=144)
ax.set_title('Difference Between Posterior Predictive Samples,\nRevenues A and B ')

So as you intuited, even though the means are quite different, the difference between the posterior predicted distributions is quite spread, and includes zero in the 95% HDI.

1 Like

Ooo this is a very cool application - makes total sense to just draw equal sized samples. Obvious in hindsight! Thanks @jessegrabowski learned something here for the future!

Thank you very much @jessegrabowski , this is a great explanation!