Updating multivariate priors

Dear all,

There’s a very helpful notebook on Updating Priors in which the prior distributions of three variables are updated as new data becomes available. Importantly, these prior distributions are assumed independent. In practice however, priors are often correlated.

How would one consider such a case? From what I’ve seen, the “Interpolated” distribution is only applicable to univariate problems, and I have not seen any examples of multivariate distributions being updated as in the above notebook.

Any thoughts or inputs on this would be highly appreciated.

PS. I’m not attaching any concrete model as I believe the aforementioned notebook already contains a perfect MWE. All I’d like to change there is to consider correlation between e.g. alpha and beta0. Thanks a lot in advance.

Last time the similar question comes up there is no good solution: Updating Priors with Posterior for Vector-Valued Distributions, Updating priors for Gaussian Mixture Model

I think one thing worth to explore is to use some kind of approximation (MvNormal, for example). It would be an extension of: Performance speedup for updating posterior with new data and code here: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/Update%20prior%20with%20Interpolation.ipynb (see last cell)

3 Likes

Would this be different from full rank ADVI? (How easy is it to pull out the the parameters into a new pm.MvNormal?)

In a way - but it would also involve expressing your priors in MvNormal.

First of all, many thanks for the prompt replies. I really appreciate. It makes sense to me to use an approximation. Ultimately, I’ll be interested in non-negativity constrained parameters, and thus in something like a Dirichlet distribution approximation (as I read that the Wishart distribution is not recommended within PyMC).

As a first step, I have tried to implement a receding horizon estimator based on a Multivariate Normal-distribution, as suggested by @junpenglao. Based on other examples that I’ve found across the web, I think I got the overall structure right. But I’m not sure exactly how to update the parameters that characterize the distribution of the covariance matrix.

Below is a MWE, in which two lines are tagged with a FIXME note (the same lines appear a bit later again). Any help with correctly specifiying these update statements would be highly appreciated.

import numpy as np
import pymc3 as pm
import theano
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

μ_actual = np.array([1, -2])
Σ_actual = np.array([[0.5, -0.3],
                     [-0.3, 1.]])

def getNewObservations(N=100):
    return np.random.multivariate_normal(μ_actual, Σ_actual, size=N)

def printResults(trace):
    #mean
    print("deviation in μ: ", np.mean(trace["μ"],0) / μ_actual - 1)

    #covariance
    L = np.mean(trace["L"],0)
    print('deviation in Σ: ', L.dot(L.T) / Σ_actual - 1)

initialObservations = getNewObservations(1000)

with pm.Model() as model:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L = pm.Deterministic("L", pm.expand_packed_triangular(2, packed_L))

    μ = pm.Normal('μ', mu=0., sd=10., shape=2, testval=initialObservations.mean(axis=0))
    obs = pm.MvNormal('obs', μ, chol=L, observed=initialObservations)

    stepNUTS = pm.NUTS(vars=[packed_L, μ])
    trace = pm.sample(cores=1, step=[stepNUTS])

printResults(trace) #DEBUG

#### rolling-horizon model
traces = [trace]

#FIXME
eta_cov = theano.shared(???, name="eta_cov")
#FIXME
sd_dist_cov_beta = theano.shared(???, name="sd_dist_cov_beta")

mu_mean = theano.shared(trace['μ'].mean(axis=0), name="mu_mean")
sd_mean = theano.shared(trace['μ'].std(axis=0), name="sd_mean")

new_observations = getNewObservations()
observations_shared = theano.shared(new_observations)

#iteration 0
with pm.Model() as update_model:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=eta_cov, sd_dist=pm.HalfCauchy.dist(sd_dist_cov_beta))
    L = pm.Deterministic("L", pm.expand_packed_triangular(2, packed_L))
    μ = pm.Normal('μ', mu=mu_mean, sd=sd_mean, shape=2, testval=observations_shared.mean(axis=0))

    obs = pm.MvNormal('obs', μ, chol=L, observed=observations_shared)

    stepNUTS = pm.NUTS(vars=[packed_L, μ])
    trace = pm.sample(cores=1, step=[stepNUTS])

    traces.append(trace)

    printResults(trace) #DEBUG

#iterations 1,2,...
for _ in range(1,3):
    #retrieve new observations
    observations_shared.set_value(getNewObservations())

    #FIXME
    eta_cov.set_value(?)
    sd_dist_cov_beta.set_value(?)

    mu_mean.set_value(trace['μ'].mean(axis=0))
    sd_mean.set_value(trace['μ'].std(axis=0))

    with update_model:
        stepNUTS = pm.NUTS(vars=[packed_L, μ])
        trace = pm.sample(cores=1, step=[stepNUTS])
        traces.append(trace)

    printResults(trace) #DEBUG

If there’s anything else wrong, I’d equally appreciate any comments. Many thanks in advance.

Hi flurinus-

I don’t the parameters are fully capturing what you want. In particular, updating the eta parameter of the LKJ distribution specifies the concentration of the correlations – but not the form of the matrix. That is, in pymc3’s implementation, each entry of L is treated as exchangeable by the LKJ (there are other parametrizations which take in both eta and a covariance matrix). After observing some data, it is almost surely the case that the entries of L are not exchangeable, in which case you would want to model their joint distribution more explicitly.

A simple way would be a multivariate approximation to L itself:

L_μ = trace['packed_L'].mean(axis=0)
L_Σ = np.cov(trace['packed_L']) + 1e-3 * np.diag(np.ones(L_μ.shape[0])) # a little regularization
L_cholL = np.linalg.cholesky(L_Σ)
with pm.Model() as update_model:
    packed_L = pm.MvNormal('packed_L', mu=L_μ, chol=L_cholL)
    L = pm.Deterministic("L", pm.expand_packed_triangular(2, packed_L))
    ...
1 Like

Dear Chris,

Many thanks for your prompt reply. Your suggestion to directly approximate L makes a lot of sense to me. I have tried to implement it in my MWE as follows:

import numpy as np
import pymc3 as pm
import theano
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

μ_actual = np.array([1, -2])
Σ_actual = np.array([[0.5, -0.3],
                     [-0.3, 1.]])


def getNewObservations(N=100):
    return np.random.multivariate_normal(μ_actual, Σ_actual, size=N)


def printResults(trace):
    # mean
    print("deviation in μ: ", np.mean(trace["μ"], 0) / μ_actual - 1)

    # covariance
    L = np.mean(trace["L"], 0)
    print('deviation in Σ: ', L.dot(L.T) / Σ_actual - 1)


initialObservations = getNewObservations(1000)

with pm.Model() as model:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L = pm.Deterministic("L", pm.expand_packed_triangular(2, packed_L))

    μ = pm.Normal('μ', mu=0., sd=10., shape=2, testval=initialObservations.mean(axis=0))
    obs = pm.MvNormal('obs', μ, chol=L, observed=initialObservations)

    stepNUTS = pm.NUTS(vars=[packed_L, μ])
    trace = pm.sample(cores=1, step=[stepNUTS])

# printResults(trace) #DEBUG


#### rolling-horizon model

traces = [trace]

mu_mean = theano.shared(trace['μ'].mean(axis=0), name="mu_mean")
mu_std = theano.shared(trace['μ'].std(axis=0), name="mu_std")

packed_L_mean = theano.shared(trace['packed_L'].mean(axis=0), name="packed_L_mean")

packed_L_cov = theano.shared(np.cov(trace['packed_L'], rowvar=False) + 1e-3 * np.diag(np.ones(packed_L_mean.get_value().shape[0])), name="packed_L_cov") # a little regularization
packed_L_cholL = theano.shared(np.linalg.cholesky(packed_L_cov.get_value()), name="packed_L_cholL")

new_observations = getNewObservations()
observations_shared = theano.shared(new_observations)

#iteration 0
with pm.Model() as update_model:
    packed_L = pm.MvNormal('packed_L', mu=packed_L_mean, chol=packed_L_cholL)
    L = pm.Deterministic("L", pm.expand_packed_triangular(2, packed_L))

    μ = pm.Normal('μ', mu=mu_mean, sd=mu_std, shape=2, testval=observations_shared.mean(axis=0))
    obs = pm.MvNormal('obs', μ, chol=L, observed=observations_shared)

    stepNUTS = pm.NUTS(vars=[packed_L, μ])
    trace = pm.sample(cores=1, step=[stepNUTS])

    traces.append(trace)

    printResults(trace) #DEBUG

#iteration 1+
for _ in range(1,3):
    #retrieve new observations

    mu_mean.set_value(trace['μ'].mean(axis=0))
    mu_std.set_value(trace['μ'].std(axis=0))

    packed_L_mean.set_value(trace['packed_L'].mean(axis=0))
    packed_L_cov.set_value(np.cov(trace['packed_L'], rowvar=False) + 1e-3 * np.diag(np.ones(packed_L_mean.get_value().shape[0])))
    packed_L_cholL.set_value(np.linalg.cholesky(packed_L_cov.get_value()))

    observations_shared.set_value(getNewObservations())

    with update_model:
        stepNUTS = pm.NUTS(vars=[packed_L, μ])
        trace = pm.sample(cores=1, step=[stepNUTS])
        traces.append(trace)

    printResults(trace) #DEBUG

The first line in the update_model (line 57)

packed_L = pm.MvNormal('packed_L', mu=packed_L_mean, chol=packed_L_cholL)

yields the following error:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Do you have any idea where this error comes from, and how I could solve it? I am still a bit confused about when to work with L and L_packed, but I don’t think this is the issue above.

Many thanks in advance!

I don’t know about that specific error; but your sampling is going to get killed without a non-centered parametrization. I think doing this will likely fix your problem too. I have some data:

with pm.Model() as model_init:
    lvec = pm.LKJCholeskyCov('lvec', n=d, eta=1, sd_dist=pm.HalfCauchy.dist(10.))
    m = pm.Cauchy('mu', 0., d, shape=(d,))
    Yhat = pm.MvNormal('Yhat', m, chol=pm.expand_packed_triangular(d, lvec), 
                       observed=Ycopy[:,:25].T)
    init = pm.sample(250, cores=1, chains=3, tune=500)


pm.traceplot(init);

I want to approximate posteriors as (block) multivariate normal, use these as new priors, and update based on new samples:

Lvec_marginal = (np.mean(init['lvec'], axis=0), np.cov(init['lvec'].T))
m_marginal = (np.mean(init['mu'], axis=0), np.cov(init['mu'].T))
chol_m = np.linalg.cholesky(m_marginal[1] + np.diag(np.ones(4))*1e-3)
Lvec_chol = np.linalg.cholesky(Lvec_marginal[1] + np.diag(np.ones(10))*1e-4)
with pm.Model() as update:
    # non-centered parametrization
    mean_err = pm.Normal('mean_err', 0, 1, shape=4)
    m = pm.Deterministic('mu', m_marginal[0] + tt.dot(chol_m, mean_err))
    lvec_err = pm.Normal('lvec_err', 0, 1, shape=10)
    lvec = pm.Deterministic('lvec', Lvec_marginal[0] + tt.dot(Lvec_chol, lvec_err))
    Yhat = pm.MvNormal('Yhat', m, chol=pm.expand_packed_triangular(d, lvec), 
                       observed=Ycopy[:,50:75].T)
    updated = pm.sample(250, tune=550, chains=4, cores=1)
    
pm.traceplot(updated)

2 Likes

Non-centered:

100%|██████████| 800/800 [00:04<00:00, 193.50it/s]
100%|██████████| 800/800 [00:03<00:00, 232.90it/s]
100%|██████████| 800/800 [00:02<00:00, 317.78it/s]
100%|██████████| 800/800 [00:02<00:00, 338.01it/s]

Centered (MvNormal) [note: this even has a “Bad initial energy” problem]:

100%|██████████| 800/800 [00:26<00:00, 36.55it/s]
100%|██████████| 800/800 [00:29<00:00, 27.56it/s]
100%|██████████| 800/800 [00:28<00:00, 21.51it/s]
100%|██████████| 800/800 [00:31<00:00, 24.25it/s]

so a 10x speedup even on a toy example.

1 Like

Many thanks once more for your helpful reply :slight_smile: . I’ve managed to get the MWE up and running (see below).

Two things are still unclear to me:

  • You write that you approximate posteriors as “block multivariate normal.” What do you mean by “block”? That only variables in the same block are correlated? (But then I only see a single block.)
  • What is the rational behind setting the regularization for chol_m and Lvec_chol to 1e-3 and 1e-4, respectively?

Thanks a lot! :blush:

    import numpy as np
    import pymc3 as pm
    import theano
    import theano.tensor as tt
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import warnings
    from scipy import stats
    warnings.simplefilter(action='ignore', category=FutureWarning)

    dimension = 2
    numIterations = 8

    packedMatrixDimension = {2:3}

    np.random.seed(30)

    μ_actual = np.random.normal(0, 1, size=(dimension,))
    randMatrix = np.random.normal(0, 1, size=(dimension, dimension))
    Σ_actual = (randMatrix + randMatrix.T) / 2 + dimension * np.diag(np.ones(dimension)) #symmetric and positive semi-definite

    def getNewObservations(N=1000):
        return np.random.multivariate_normal(μ_actual, Σ_actual, size=N)

    #initial model
    with pm.Model() as init_model:
        L_packed = pm.LKJCholeskyCov('L_packed', n=dimension, eta=2, sd_dist=pm.HalfCauchy.dist(2.5))
        mu = pm.Cauchy('mu', 0., dimension, shape=(dimension,))
        obs = pm.MvNormal('obs', mu, chol=pm.expand_packed_triangular(dimension, L_packed),
                          observed=getNewObservations())
        trace = pm.sample(cores=1)

    traces = [trace]

    mu_marginal_mean = theano.shared(np.mean(trace['mu'], axis=0), name="mu_marginal_mean")
    mu_cholesky = theano.shared(np.linalg.cholesky(np.cov(trace['mu'].T) + np.diag(np.ones(dimension)) * 1e-3), name="mu_cholesky")

    L_packed_marginal_mean = theano.shared(np.mean(trace['L_packed'], axis=0), name="L_packed_marginal_mean")
    L_packed_cholesky = theano.shared(np.linalg.cholesky(np.cov(trace['L_packed'].T) + np.diag(np.ones(packedMatrixDimension[dimension])) * 1e-4), name="L_packed_cholesky")

    observations = theano.shared(getNewObservations(), name="observations")

    with pm.Model() as update_model:
        # non-centered parametrization
        mean_error = pm.Normal('mean_error', 0, 1, shape=dimension)
        mu = pm.Deterministic('mu', mu_marginal_mean + tt.dot(mu_cholesky, mean_error))

        L_packed_error = pm.Normal('L_packed_error', 0, 1, shape=packedMatrixDimension[dimension])
        L_packed = pm.Deterministic('L_packed', L_packed_marginal_mean + tt.dot(L_packed_cholesky, L_packed_error))

        obs = pm.MvNormal('obs', mu, chol=pm.expand_packed_triangular(dimension, L_packed), observed=observations)

    #iteration 1
    with update_model:
        trace = pm.sample(cores=1)
        traces.append(trace)


    #iterations 2+
    for _ in range(2, numIterations):

        #update model parameters based on previous trace
        mu_marginal_mean.set_value(np.mean(trace['mu'], axis=0))
        mu_cholesky.set_value(np.linalg.cholesky(np.cov(trace['mu'].T) + np.diag(np.ones(dimension)) * 1e-3))

        L_packed_marginal_mean.set_value(np.mean(trace['L_packed'], axis=0))
        L_packed_cholesky.set_value(np.linalg.cholesky(np.cov(trace['L_packed'].T) + np.diag(np.ones(packedMatrixDimension[dimension])) * 1e-4))

        #update observations
        observations.set_value(getNewObservations())

        with update_model:
            trace = pm.sample(cores=1)
            traces.append(trace)

    cmap = mpl.cm.autumn

    params = ['mu', 'L_packed']

    plt.figure(figsize=(20,10))

    for paramId, param in enumerate(params):
        plt.subplot(len(params), 1, paramId+1)

        for update_i, trace in enumerate(traces):
            samples = trace[param]

            for scalar in zip(*samples):
                smin, smax = np.min(scalar), np.max(scalar)
                x = np.linspace(smin, smax, 100)
                y = stats.gaussian_kde(scalar)(x)
                plt.plot(x, y, color=cmap(1 - update_i / len(traces)))

        if param == "mu":
            for mu_scalar in μ_actual:
                plt.axvline(mu_scalar, c="k")

        elif param == "L_packed":
            for L_scalar in np.linalg.cholesky(Σ_actual)[np.tril_indices(dimension)]:
                plt.axvline(L_scalar, c="k")

        plt.ylabel('Frequency')
        plt.title(param)

    plt.show()

Figure_1

1 Like

There are two blocks: mu and Lvec. The full multivariate approximation would allow for correlations between \mu_i and L_{jk}, and there may be useful relationships to model there (full-rank ADVI is precisly this). For simplicity I broke these parameters into blocks.

The overall rationale is just to deal with the slight chance of there being numerical issues in the decomposition (for instance, if the number of posterior samples is fewer than the number of variables). The specific values are entirely arbitrary; but small with respect to the diagonal of cov(init['mu']) (etc).

1 Like

Many thanks once more for the clarifying reply! :blush: