How best to use a posterior sample as a prior for a future integration in another model?

I am modelling sea-level rise, which is made of various components. I have a model for glaciers contributions, and a model for thermal expansion of the water. I can tune these models separately, independently from each other, because there are observations of glacier retreat, and observations of ocean warming. Using MCMC, I have posterior distributions of the model parameters, and say, of 20th century contribution for both these sources.

with pm.Model() as glaciers_model:
    glacier_param1 = pm.Normal(...)
    glacier_param2 = pm.Normal(...)
    glaciers_slr = pymc3_wrapper_for_glaciers(glacier_param1, glacier_param2, ...)
    pm.Normal("glaciers_obs", mu=glaciers_slr, sigma=glaciers_std, observed=glaciers_obs)
    trace_glaciers = pm.sample(...)
with pm.Model() as thermal_model:
    thermal_param1 = pm.Normal(...)
    thermal_param2 = pm.Normal(...)
    thermal_slr = pymc3_wrapper_for_thermal(thermal_param1, thermal_param2, ...)
    pm.Normal("thermal_obs", mu=thermal_slr, sigma=thermal_std, observed=thermal_obs)
    trace_thermal = pm.sample(...)

Now I want to also use actual sea-level observations, which I model as being the sum of my two components: glaciers + thermal expansion. How best can I do that, in an efficient manner (model integration time is not negligeable)?

Of course, the simplest way would be to have one big model with glaciers and thermal and sum, and use observations for all three. But the integration time appears to be quite long and in one attempt I got a warning (effective ensemble size low for some parameters). That’s because in fact I have more components (four), with a total of 10 parameters for the full model, and 15 observations.

with pm.Model() as (waste?)full_model:
    glacier_param1 = pm.Normal(...)
    glacier_param2 = pm.Normal(...)
    thermal_param1 = pm.Normal(...)
    thermal_param2 = pm.Normal(...)
    glaciers_slr = pymc3_wrapper_for_glaciers(glacier_param1, glacier_param2, ...)
    pm.Normal("glaciers_obs", mu=glaciers_slr, sigma=glaciers_std, observed=glaciers_obs)
    thermal_slr = pymc3_wrapper_for_thermal(thermal_param1, thermal_param2, ...)
    pm.Normal("thermal_obs", mu=thermal_slr, sigma=thermal_std, observed=thermal_obs)
    total_slr = glaciers_slr + thermal_slr
    pm.Normal("slr_obs", mu=total_slr, sigma=slr_std, observed=slr_obs)
    trace_full = pm.sample(...)

For efficiency reasons, it would be better if I could re-use the posterior samples from the glaciers model, and from the thermal expansion model, and use these samples to apply the sum constraint. A classical bayesian update, so to say. But across models.

One approach I can think of, is to first tune my sub-models separately, and in a second step, have a “sum” model which is simply a categorical sampling for the ensemble members indices. In the glaciers + thermal cases, I’d have then:

with pm.Model() as total_resampling:
  glaciers_i = pm.Categorical(f"glaciers_index", np.arange(sample_size))
  thermal_i = pm.Categorical(f"thermal_index", np.arange(sample_size))
  
  slr = pymc3_wrapper_to_sample_posteriors_with_indices_and_sum_them(glaciers_i, thermal_i)
  pm.Deterministic("slr_total", slr)  # keep track for output
  pm.Normal("slr_obs", mu=slr, sigma=slr_obs_error, observed=slr_obs)
  pm.sample(...)

At the moment my wrapper function defines a theano Op to have free hands in indexing the posterior distributions of glaciers and thermal expansion parameters (1000 samples each). Possibly I could do without the wrapper, using thano.shared(np.asarray(trace_glaciers.posterior.glaciers_slr[0]))[glaciers_i].

I feel that should work OK, but that is not very elegant (some ensemble members for the sub-terms will appear several times). I guess there are other approaches out there that I cannot think of.

Later I will yet re-use thesse samples and update it with additional data at various locations, where new processes come into play…

I wonder if that is what people call a “hierarchical model”.

If I would work in pure numpy/scipy, I would fit a kde kernel to my posterior and resample from it. That would also work. Not sure if that can be done in pymc3.

Thanks for anyone coming across and considering the question!

Can you get the posterior mean mu and sigma (Normal) from your model 1 and then insert them as priors in model 2?

1 Like

Hmm. That is certainly the most straightforward answer to my question… in cases where the posterior distribution follows a normal law. I would prefer a more general solution, some models are non-linear, so there is no reason why the posterior should follow a normal distribution, despite normal prior and likelihood. Thanks anyway.

I see it is possible to use Kernel Density Estimation for the more general case:
https://docs.pymc.io/en/v3/pymc-examples/examples/pymc3_howto/updating_priors.html

I suppose this is it.

2 Likes

I have the same question. It’s surprising that there isn’t a more natural way to do this given how bayesian it is.

It’s probably because the NUTS sampler requires gradient calculation, and that can only be done from a parametric distribution, not a bunch of samples… Not convenient but when you think about it, what are the alternative, from the point of view of the solver? So to sum up, there are two possible ways:
a) put all models together in one big model (it can become computationally expansive)
OR
b) approximate the posterior in some way – the simplest way is a Normal distribution, the most flexible way is KDE – and feed that as prior in another model. I don’t have the links at hand but there are various threads around for the newer PyMC versions.

thanks for the quick response. You’re right. I was thinking more from an API standpoint. There could be a builtin method to learn a parametric distribution from a set of samples, perhaps there is.

I found this recommendation from @junpenlao

I want to look further into the comparative efficiency vs fidelity of these methods, but at first glance, the method with pm.Potential looks like the best option.

How does it relate to the posterior as prior question?

As far as I understand the potential is only an alternative way of defining the likelihood. The only way I could see a use here is if you could define an empirical, custom potential from your posterior samples… (you’d need to use an uninformative prior such as a broad enough Uniform distribution).
Say by interpolating from the empirical cumulative distribution.

By the way in the unidimensional case there is the Interpolated distribution that you can use directly as prior. But it does not work in multiple dimensions (Parameterize a multi-dimensional pymc.Interpolated distribution).

1 Like

You should be careful even using Interpolated distributions for uni-dimensional priors, because what you interpolate is the marginal posterior, discarding information about correlations in the joint posterior. See this blog post by @OriolAbril for an extended discussion.

There is a joint histogram prior in pymc-experimental that tries to solve this by building an approximate prior from the entire joint posterior.

3 Likes

Hi @jessegrabowski,
thanks for the links. Have I understood the histogram_approximation method right that you could do something like:

import pymc as pm
import pymc_experimental as pmx
import pytensor.tensor as pt
import arviz

with modelA:
    a = pm.Normal('a')
    b = pm.Normal('b')
    obs = pm.Normal('obs', a + b, obs_error, observed=obs_mean)
    traceA = pm.sample()

postA = arviz.extract(traceA.posterior)
postA_ab = np.stack([postA["a"], postA["b"]])

with modelB:
    a = pm.Uniform('a', -1000, 1000)  # ignorant priors
    b = pm.Uniform('b', -1000, 1000)  
    c = pt.concatenate([a, b])  # 2-dimensional random variable
    pot = pmx.distributions.histogram_approximation(
        "pot_model_A", c, observed=postA_ab.T) 
    
    new_obs = pm.Normal('new_obs', a + b, new_obs_error, observed=new_obs_mean)
    traceB = pm.sample()

And that would be equivalent to:

with modelAB:
    a = pm.Normal('a')
    b = pm.Normal('b')
    obs = pm.Normal('obs', a + b, obs_error, observed=obs_mean)
    new_obs = pm.Normal('new_obs', a + b, new_obs_error, observed=new_obs_mean)
    traceAB = pm.sample()

?

I haven’t experimented with it, but I shared the wrong link. I was relating Junpeng’s recommendation in this thread:

DensityDist could probably be an option but you still need to define the logp and possibly random to use it as prior (would be neater – and probably faster – than having to draw from a broad Uniform prior), and logp should be differentiable to work with NUTS. Here it is no different than using a Potential, as far as I can tell, except that the Potential cannot be used as prior (it cannot be sampled from, just enters in the likelihood calculation). Ultimately the question is and remain, how to define the logp in a differentiable way?

As we noted before, and as reminded by @jessegrabowski, Interpolated only works for univariate distributions. I assumed that a multivariate KDE distribution was provided but apparently it is not. I wonder how to use histogram_approximation (I’d be pleased but surprised if my example above worked). As far as I am concerned, my model was linear enough that a multivariate normal distribution was just fine to use as prior, so I did not experiment further.

PS: I actually just tried the example above (well I had to fixed a few typos), with the lastest 5.12.0 pymc version, and it does not work (shape mismatch). Looking at the latest doc, and the shapes in the error message, it is clear that it cannot work because pmx.distributions.histogram_approximation operates on the first dimension only → it returns univariate histograms for each obs. It does not account for correlations.

The issue is probably best described with copulas. However, the linked notebook does not seem to address the issue directly, but works to transform the data to a normal distribution first. It does not look like a general solution, but I may be mistaken.

PPS (EDITED to rule out Interpolated): the copula example (and the original post) does point to a rather general solution, provided one can find a pymc distribution that approximate the marginals well enough and has the icdf method implemented. Unfortunately, this rules out Interpolated since it does not implements icdf. So here is how it would go:

  1. After running modelA, identify marginal distributions that describe the posterior of modelA well – in the general case, that could be the Interpolated distribution, but it lacks the icdf method so you need to find another distribution such as LogNormal.
  2. transform the variables using the cdf method of that distriubtion, to obtain uniformly distributed variables
  3. transform again with the normal distribution’s inverse cdf method (ppf in scipy, icdf in pymc)
  4. fit a multivariate normal distribution to the result of 3 => the resulting mu and cov (or chol) will be your prior in modelB

Note the steps 1-4 need not be done in pymc.

And in modelB (needs to be done within a pymc model):
5. Define a MvNormal distribution with the mu and chol parameters obtained in 4
6. Transform the variables back to their original, custom shape by reverting 3. (use the normal distribution cdf method) and 4. (use the Interpolated distribution icdf method)

It feels a bit involved, and I wonder if 5) and 6) could cause convergence issues (assuming they are tracked in the pymc graph and NUTS has to calculated gradients on these) – the linked notebook above mentions convergence issues --, but it could be tried. In practical settings, it is probably best to apply the transformations to only these variables that have a long tail and are clearly not normally distributed, and leave the rest untransformed.

EDIT PPPS: in the case of a LogNormal distribution, it’s enough to just take the logarithm of that variable and use that in the sampling, without the intermediate steps of converting to a uniform distribution.

EDIT: use chol @ ... instead of MvNormal

FYI, here is an implementation that I tested and which seems to work. It is designed to deal with Normal and LogNormal distributions, which should already cover many practical cases.

import numpy as np
from scipy.stats import norm, lognorm
import pymc as pm
import arviz

def define_params_from_trace(trace, names, label, log_params=[]):
    """Define prior parameters from existing trace

    (must be called within a pymc.Model context)

    Parameters
    ----------
    trace: pymc.InferenceData from a previous sampling
    names: parameter names to be redefined in a correlated manner
    label: used to add a "{label}_params" dimension to the model, and name a new "{label}_iid" for an i.i.d Normal distribution used to generate the prior
    log_params: optional, list of parameter names (subset of `names`)
        this parameters are assumed to have a long tail and will be normalized with a log-normal assumption
    
    Returns
    -------
    list of named tensors defined in the function

    Aim
    ---
    Conserve both the marginal distribution and the covariance across parameters

    Method
    ------
    The specified parameters are extracted from the trace and are fitted 
    with a scipy.stats.norm or scipy.stats.lognorm distribution (as specified by `log_params`).
    They are then normalized to that their marginal posterior distribution follows ~N(0,1).
    Their joint posterior is approximated as MvNormal, and the parameters are subsequently 
    transformed back to their original mean and standard deviation 
    (and the exponent is taken in case of a log-normal parameter)
    """

    # normalize the parameters so that their marginal distribution follows N(0, 1)
    params = arviz.extract(trace.posterior[names])
    fits = []
    for name in names:
        # find distribution parameters and normalize the variable
        x = params[name].values
        if name in log_params:
            sigma, loc, exp_mu = lognorm.fit(x)
            mu = np.log(exp_mu)
            x[:] = (np.log(x - loc) - mu) / sigma
            fits.append((mu, sigma, loc))
        else:
            fitted = mu, sigma = norm.fit(x)
            x[:] = (x - mu) / sigma
            fits.append((mu, sigma))

    # add a model dimension 
    model = pm.modelcontext(None)
    dim = f'{label}_params'
    if dim not in model.coords:
        model.add_coord(dim, names)

    # Define a MvNormal
    # mu = params.mean(axis=0) # zero mean
    cov = np.cov([params[name] for name in names], rowvar=True)
    chol = np.linalg.cholesky(cov)
    # MvNormal may lead to convergence issues
    # mv = pm.MvNormal(label+'_mv', mu=np.zeros(len(names)), chol=chol, dims=dim)
    # Better to sample from an i.i.d R.V.
    coparams = pm.Normal(label+'_iid', dims=dim)
    # ... and introduce correlations by multiplication with the cholesky factor
    mv = chol @ coparams

    # Transform back to parameters with proper mu, scale and possibly log-normal
    named_tensors = []
    for i,(name, fitted) in enumerate(zip(names, fits)):
        if name in log_params:
            mu, sigma, loc = fitted
            tensor = pm.math.exp(mv[i] * sigma + mu) + loc
        else:
            mu, sigma = fitted
            tensor = mv[i] * sigma + mu
        named_tensors.append(pm.Deterministic(name, tensor))

    return named_tensors
2 Likes

And here the more general case with any distribution:

import numpy as np
import arviz
from scipy.interpolate import interp1d, InterpolatedUnivariateSpline
from scipy.stats import norm, lognorm
import pytensor.tensor as pt
import pymc as pm
from pymc.distributions.continuous import Interpolated, SplineWrapper

def compute_empirical_cdf(data):
    """Compute the empirical CDF of the given data"""
    sorted_data = np.sort(data)
    cdf = np.arange(.5, len(sorted_data)) / len(sorted_data)
    return sorted_data, cdf

def define_empirical_params_from_trace(trace, names, label):
    """Define prior parameters from existing trace

    (must be called from pymc.Model context)

    Parameters
    ----------
    trace: pymc.InferenceData from a previous sampling
    names: parameter names to be redefined in a correlated manner
    label: used to add a "{label}_params" dimension to the model, and name a new "{label}_mv" MvNormal distribution
    
    Returns
    -------
    list of named tensors defined in the function

    Aim
    ---
    Conserve both the marginal distribution and the covariance across parameters

    Method
    ------
    The specified parameters are extracted from the trace and resampled using normalization 
    of the marginal distribution to N(0,1) and a MvNormal distribution for the joint posterior.
    
    The parameters are subsequently transformed back:
    - Normal's cdf to transform from N(0,1) to U(0,1)
    - the inverse CDF of the interpolated distribution is applied from U(0, 1) to the original (empirical) distribution
    """

    # normalize the parameters so that their marginal distribution follows N(0, 1)
    params = arviz.extract(trace.posterior[names])

    fits = []

    for name in names:
        # find distribution parameters and normalize the variable
        x = params[name].values

        # First transform the marginal distribution to a standard normal
        sorted_data, sorted_cdf = compute_empirical_cdf(x)

        # Interpolate the inverse CDF
        cdf = interp1d(sorted_data, sorted_cdf, bounds_error=False, fill_value=(sorted_data[0], sorted_data[-1]))

        # Transform the data to a standard normal 
        x_uniform = cdf(x) # U(0,1) from the empirical distribution
        x_normal = norm.ppf(x_uniform) # N(0,1)

        # Replace the original value (in-place)
        x[:] = x_normal

        # Create the inverse CDF function with tensor values (thanks to SplineWrapper)
        icdf = SplineWrapper(InterpolatedUnivariateSpline(sorted_cdf, sorted_data, k=1, ext="const"))

        fits.append(icdf)

    # add a model dimension 
    model = pm.modelcontext(None)
    dim = f'{label}_coparams'
    if dim not in model.coords:
        model.add_coord(dim, names)

    # Define a MvNormal
    cov = np.cov([params[name] for name in names], rowvar=True)
    chol = np.linalg.cholesky(cov)
    coparams = pm.Normal(label+'_iid', dims=dim)
    mv = chol@coparams

    # Transform back to the empirical distribution
    named_tensors = []
    for i,(name, fitted) in enumerate(zip(names, fits)):
        icdf = fitted
        uniform_tensor = pm.math.exp(pm.logcdf(pm.Normal.dist(), mv[i]))  # Normal's cdf (IS THERE ANYTHING BETTER THAN THIS?)
        tensor = icdf(uniform_tensor)
        named_tensors.append(pm.Deterministic(name, tensor))

    return named_tensors

This seems to work reasonably, however in my test every new parameter I add seems to take a toll on the sampling speed, i.e. 3 min when a log-normal distribution is fitted to two parameters like in the function I shared previously versus 5 min with the new empirical distribution (and 6 min when fitted to three parameters). I wonder if that can be made more efficient. Also I think the empirical distribution could handle the bounds better, by adding 0 and 1 instead of currently 0.5/n and 1-0.5/n. I’d be glad to have feedback how to improve this function.

Below a comparison between the parameter distributions resampled from the function above, and the resulting time-series. The three parameters are a, aT0 and q and the time-series as follows: a*df['tas']-aT0 + q*df['tas2'], where df['tas'] and df['tas2'] are two numpy arrays (time-series). The blue color (scatter plot) or left-hand side (time-series) is the posterior of a first sampling, which introduces strong correlations between a and q especially, and the orange color (scatter plot) or right-hand side (time-series) is the prior sampling from a new simulation that uses the function above (before applying new constraints). (EDIT: a previous version indicated differences in the upper percentile, but that was due to the sample size of 4000 vs 500.)

So it looks pretty similar, even though details|500x500 might differ.

For now I consider this a problem nearly solved.

@ricardoV94 @Yamaguchi

3 Likes