How to use the posterior distribution of one model as a prior distribution for another model

Hi all,

If I want to use the posterior distribution of one model as a prior distribution for another model, how should I implement this in PyMC5?

I would be thankful if you could help me with this question.

Hi. The simplest way is to use the summary of your posterior as parameters for your new prior. For instance, you have a Gaussian posterior with mean = 5 and standard deviation = 2, then your new informed prior could be a Normal(5, 2). There are some caveats to that I guess, maybe it’s a bit too rudimentary. You can see some discussion on using informed priors here: Bayesian Analysis Reporting Guidelines | Nature Human Behaviour

I’m sure others may answer you with a more sophisticated approach than the one I mentioned above.

1 Like

There is an Interpolated distribution that allows you to use samples from arbitrary distributions as a prior. Here’s the tutorial, which is badly out of date, but should work – just make sure to import pymc (NOT pymc3) and pytensor.tensor as pt (NOT theano.tensor as tt).

There’s a reason this hasn’t been updated though – it’s because it doesn’t do what you think it does. It will set the marginal posteriors as priors, but it discards all the information about the correlations between parameters in the joint posterior. The more updated approach would be to use the prior_from_idata helper in pymc-experimental. This uses a multivariate normal approximation over the entire joint posterior in the provided samples, so it preserves the dependency structure between the variables. Much preferred to the marginal-only approach.

That said, I actually endorse @Simon’s approach, with the caveat that you shouldn’t look at the posteriors one by one. If you’re only interested in reusing a subset of the prior parameters, you can compute the mean and covariance of the samples in your posterior and use that to parameterize a multivariate normal. Less fancy than using prior_from_idata, but it would get the job done.


Based on Simon’s answer and jesse’s answer perhaps one another way could be to calculate correlations between your variables from your trace and then set your new priors to be a multivariate normal with mean and non-trivial correlation matrix. This would assume that your joint posterior looks sufficiently like a mvnormal


@Simon @jessegrabowski @iavicenna

Thank you for your quick response. All the content is very informative!!

I found an example of using histogram_approximation in a previous thread.
What is the difference between histogram_approximation and a multivariate normal approximation?
Would I use histogram_approximation to approximate a discrete distribution?

A histogram approximation lets you approximate any arbitrary distribution by slicing it up into little bins and computing the probability of seeing data in a bin. It doesn’t make any assumptions about the shape of the distribution, which is the difference between it and the MvN approximation.

I guess you could use it on data that is already discrete? But I also have no idea, I’ve never done anything like that.

I’d recommend starting with MvN though, as all well-behaved posteriors should be converging to something normalish anyway.

1 Like

I Understood. Thank you for all your support!

1 Like

I have tried using prior_from_idata, but there is one point I don’t understand.
What is the difference between trace1.prior.trace_prior_ and trace1.prior.a etc?

Hi @Yamaguchi,

this example was actually a question (unanswered), I do not know whether it works or not. The point is, apparently you have the choice between:

  • have an empirical approximation of a distribution in the 1-D case (using Interpolated), but if with more than one dimension you loose the correlation (so this is generally not acceptable).
  • use a normal approximation of the n-D distribution, which is also not acceptable in general (e.g. one of your variables has a long tail…)

So this is pretty unsatisfying. @jessegrabowski pointed me to this histogram_approximation that looks promising, but we’d need someone with knowledge of the methods to comment on whether it would work or not (or just try and test it). Maybe @ricardoV94 ?

1 Like

@ferrine implemented prior_from_idata, perhaps he could help

1 Like

These approaches have proved quite useful for a case federated use I’m working in (where restrictions are very high, so no data sharing and no port openings are allowed, impeding anything like federated learning or sharing likelihoods [i.e. reconstruction risk], etc.). @perrette points regarding correlations are quite important, though for some cases where we’re interested only on estimates based one or two parameters which parametrise the sampling distribution (e.g. mu and sigma of y = pm.Normal(‘y’, mu, sigma, observed=obs)) I imagine lacking correlations is reasonable as long as the model/question are simple. What would be the most detrimental problem for inference in such cases?

The prior_from_idata from experimental comes quite handy in this regard as well. Though I’m a bit confused on how to report it. Would this be appropriate notation? (based on what is here: pymc_experimental.utils.prior — pymc_experimental 0.0.18 documentation):

π_{p}^{(s)} = θ_{p}^{(s-1)} + L_{p}^{(s-1)}B_{p}
α^{(s)}, β^{(s)}, σ^{(s)} = π_{p}^{(s)}
μ^{(s)} = α^{(s)} + β^{(s)}
y_{i}^{(s)} ~ N(μ^{(s)}, σ^{(s)})

Where π_{p}^{(s)} are the new priors, θ_{p}^{(s-1)} is the joint posterior mean taken from the joint posterior (p=1…3 posteriors from parameters: α^{(s-1)}, β^{(s-1)}, σ^{(s-1)}) obtained from previously sampled model (s-1), L_{p}^{(s-1)} is the Cholesky decomposition of the covariance matrix taken from the same joint posterior, and B_{p} is a base normal distribution with standard deviation equal to one and mean equal to matrix of zeros with same size the joint posterior mean.

A bit of a silly example, but it’s just to illustrate the question. Also, how can I cite PyMC experimental? Just cite the general PyMC reference? Many thanks.

Well I guess my first approach would be to implement the two different approaches:

1- Approximate priors individually by univariate normals (uncorrelated)
2- Approximate priors collectively by multivariate normals (where correlation will be computed from previous optimizations trace)

and see if there are any differences. I guess this would already be obvious without fitting, if your parameters look correlated, probably second approach is closer to what you want.

These are of course subject to posteriors looking reasonably normal. I guess one can also check whether the collection of sampled parameters look like a multivariate normal right? Or if not try something else like student-t.

I am wondering, is there anything inherently wrong by taking your parameters from your trace and fitting couple multivariate distributions to it and taking the best one?

I’d cite the new PyMC paper. If you want to credit a specific feature in experimental, you could use the bibtex format suggested in the r2d2m2 prior example

1 Like

FYI I added additional thoughts to my original thread: How best to use a posterior sample as a prior for a future integration in another model? - #14 by perrette

Basically, I take inspiration from two articles on copulas (here and there), and suggest to a) transform the marginal posterior distributions from modelA into normal distributions, b) model the normalized parameters as a MvNormal distribution (and transform back) in modelB. In the case of a long-tailed LogNormal distribution, this is simple (it involves shifting and taking the logarithm), and in the general case it is more complicated but doable provided a parameterized distribution can fit the marginal (and the copula example can be applied: transform to uniform to normal and back with cdf and inverse cdf).

I think the question is now “solved” for the common cases of Normal and LogNormal distributions. The more general approach with any kind of parameterizable distribution is one step away, by using the intermediate step of transforming to a uniform distribution. I don’t need it for my work so I leave it to others to implement. Here it is:

@perrette that would be a great addition to pymc-experimental. I understand if you don’t have the time to do it yourself, but I would suggest opening an issue with the idea and links to your experimentations. Perhaps someone else will pick it one day.

Hi @ricardoV94 ,
OK, I can do that, if you think it could be useful there. Will do ASAP, probably in a few days.

Thank you! This works great for one of my use cases.
However, I used it in a bit of an unconventional way. I noticed that the way you used the MvNormal is not too different from the prior_from_idata implementation, so it may be possible to use it this way:

     mu_m, mu_s =[i].posterior)['mu'].values)
     sigma_m, sigma_s =[i].posterior)['sigma'].values)
     prior = prior_from_idata(idatas[i], var_names=['mu', 'sigma'])
     mu = prior['mu'] * mu_s + mu_m
     sigma = prior['sigma'] * sigma_s + sigma_m 

As the prior_from_idata would apply the Mvnormal approx. to the whole joint posterior and then we can extract it per each singular posterior (i.e. prior[‘mu’], prior[‘sigma’]). Then, the mean and SD obtained from, as in your approach, could be directly applied to the prior_from_idata output. Please correct me if I got this wrong. Maybe I got it working just by chance.

Hi @Simon, I guess you can do that sure. Just do not forget to transform the trace before passing to prior_from_idata. Note it is only necessary (and useful at all in the first place) if your data has long tails, otherwise you can just use prior_from_idata without any modification.

x = idatas[i].posterior['mu'].values  # assuming the data is log distributed
x[:] = (np.log(x - mu_loc) - mu_m)/mu_s

For the rest it looks fine to me. (provided prior is then transformed back to the log-normal distribution).

If your data is already normally-distributed though (as seems the case from the example you shared), you won’t have any advantage in doing that, as I think prior_from_idata already samples in way that is normalized (related to next paragraph). The function I wrote is only adding value for posteriors with long tails. And it paves the way to extending the approach to any distribution, provided the marginal can be appropriately transformed (and more critical: appropriately transformed back with pymc tensors !).

Something else: one thing I see in prior_from_idata is that it samples from a Normal distribution and dot-multiplies the samples with the cholesky factor, which might be better behaved than using MvNormal. That’s also what I am currently testing as I am having convergence issues (the issues I am experiencing may be unrelated, but I read elsewhere that sampling from IID variables can help to avoid “funnels”). I am talking about replacing the line:

mv = pm.MvNormal(label+'_mv', mu=np.zeros(len(names)), chol=chol, dims=dim)


coparams = pm.Normal(label+'_iid', dims=dim)
mv = chol@coparams

(I’m still testing it)

PS: sorry for the many edits to whoeever read the reply immediately

1 Like

I started an issue on pymc-experimental.

1 Like