Aggregation of trace in models? - Probabilities for max p of binomial dists


So I’m trying to get probabilities for which posterior variable is the max of a set. Example could be, 3 different UIs in a Multi-Varient Test for a website, and attempting to predict conversion rate.

My goal is, given 3 versions with unknown success probabilities p1, p2, and p3, and observations N1, obs1, … N3, obs3, where Nx is the number of trails and obsx is the number of successes, derive that probability for each version that it has the max p.

Here is a gist of a .ipynb where I attempt this:

I set up a simple model to find the posteriors for each p, and then just using the burned trace for the p’s I calculate the relative frequencies of argmax of them.

My question is then: is this correct methodology to calculate those probabilities from the resultant trace? Or should I have attempted to build those (relative frequencies of argmax) as variables in the model which would be computed during MCMC? Then generalising, can one have aggregations of variables within a PyMC model

My tentative predicted answer would be: you can’t have an aggregated variable (such as relative frequencies of argmax of a set of variables) in a model, because it relies on aggregating over numerous points on the posterior surface, but when running MCMC we can only output the values of variables that exist in the space we are sampling in (the posterior surface).


You currently solution is what I would do as well.

I think an aggregation is also possible if you put a hyper prior on the rate p - you can then model each of the three specific rate using a beta distribution parameterized with the mean (i.e. the hyperprior) and sd (maximum being mu*(1-mu)).

Doing so can regulate the prior, especially with unbalanced data (eg, n=10, 100, 100). However, I think the result would be more or less identical to what you are currently doing.

Glad to hear you would go for a similar approach, but I’m still very interested in the alternative you described but I’m struggling to understand it (I’m new to using PyMC).

Could you provide a bit of code showing what you mean exactly regarding the hyper prior and where exactly the aggregation would happen? - that would be super appreciated.

Thanks again for all the help!

What I meant is something like this:

with pm.Model() as model:
    mu = pm.Uniform('mu', lower=0, upper=1)
    sd = pm.Uniform('sd', lower=0, upper=tt.sqrt(mu * (1 - mu)))
    p = pm.Beta('p', mu=mu, sd=sd, shape=3)
    obs = pm.Binomial('obs', observations[0,:], p, observed=observations[1,:])
    trace = pm.sample(1000, tune=1000, nuts_kwargs=dict(target_accept=.95))

and then aggregate the same way as you did:

res = np.unique(np.argmax(trace['p'], axis=1), return_counts=True)

However, this model returns some diverging samples, so it will still need some twisting somewhere…

Hi again @junpenglao,

I think I understand your hyper-prior now, adding another level of RVs. But I don’t understand what the benefits of this would be, and specifically what it means to “regulate the prior”?

It seemed to me that the model was already taking into account unbalanced sample sizes, as can be seen in the wider distribution for p3 (with N3 and obs3), in the notebook above.

Thanks again.

You can think of the regulation effect of hyper-prior as pulling all the posterior rate p to a more concentrated space. The idea here is that since you measure some similar processes here (if they are completely unrelated to each other you will not be comparing them), it would make sense to think that they are originated from some common “average process”, which could be coded in the hyper-prior. More explanation of the hierarchical modeling idea could be found eg here:

(First, thanks again for the prompt and repeated responses - super helpful)

Ok so I think this example of the hyper-prior has shown big gaps in my understanding of PyMC - (I at first thought that the posteriors for each p would be identical…!).

So a few things to check my understanding now that I’ve played with the hierarchical model and read about GLMs;

  • Even though the p random variables are “defined” by their parent variables(mu, sd), they are still stochastic hence can differ between them each other - right?
  • This is odd though as when if I run
for RV in model.basic_RVs:
   print(, RV.logp(model.test_point))

then it on get on p test value e.g. p_logodds__ -3.4341896575482 - why don’t I get three test_points values, one for each p?

  • I find it conceptual hard to interpret this probability density space and the effect of the hyper-priors (maybe this is normal and ok). Is it simply that each p continually shares a changing (as mu changes while sampling) Beta prior that influences the densities (that are combined with the density influences from the observed data for) each p? This would serve to “pull them together” or add in a sense a regularisation the model?
  • If actually I’m agnostic to the underlying mechanics of what causes each p, but only care about the “best” prediction of each p for comparing them, would then the non-hierarchical be better?

Thanks again for your responses, each one is making a jump in my understanding of PyMC and Bayesian modelling, and hopefully future readers will find them helpful too.

@DBCerigo you are welcome! I think hyper-prior and informative prior are magical things that are getting more and more appreciation! So a few more thought of your questions:

Yes. They are still stochastic nodes in the model; the different is that now they are related to each other through a common mean (you can think of this hyper mean as of the average of the individual mean of each binomial).

RV.logp(model.test_point) returns the sum of the logp of each p. To get the logp of each p you can do RV.logp_elemwise

Indeed this serves to “pull the value together”, and sometimes this is called shrinkage. In a more frequentistic point of view, in the non-hierarchical model you are inferring the binomial rate $p_i$ for each of the 3 cases using an estimator (here it is the average), and in principle, you can not generalise to a new case. To generalise to a new case you might pull the information in all the 3 seen cases by computing ($p_1$ + $p_2$ + $p_3$)/3. But since the trial number is not the same in these 3 cases, you might want to give more weight to the case with more trials. The hierarchical model above, the hyperprior could think of as a clear way to get this weighted average.
I would interpret the hyper-prior as the common average across cases.

I think hierarchical or not it is not directly related to prediction performance. It is a way to incorporate information. If mechanically the three cases is non-related to each other, it might not make sense for them to share a common hyper mean.