Calculating conditional posterior predictive samples in high-dimensional observation spaces

Question: assuming that I have a rather high-dimensional posterior predictive sample, because my observational vector is a vector of k continuous detector observations where k can get quite big, is there an efficient way to calculate the conditional posterior predictive in a case where one of these detectors is not working, i.e. where one of the elements of the observational vector becomes the unknown?

Details: I model a network of k spatially distributed detectors. It assumes that the observations, which are in the reals, made by all detectors at a given time t can be described as a multivariate normal with mean vector mu and covariance matrix cov. The goal is to infer both the elements of mu, and of the scale vector and correlation matrix that together form cov. All of this works fine, I get convergence and the posterior predictive plot nicely matches my observations.

Following this, I was motivated to try and estimate the most likely signal value at a given detector should that detector not be working. My thought - based on this discussion (https://discourse.pymc.io/t/samples-from-conditional-posterior-distribution/9138) - was to keep sampling the posterior predictive and then calculate the conditional probability of one unknown detector at time t given that I know all but one detector signal at time t. However, it takes a lot of processing time to obtain the posterior predictive conditional on k-1 detector signals, which I use to predict the unobserved detector. Mainly because I need to sample a lot to obtain sufficient statistics in k dimensions, which was not a problem in the linked discussion above.

So my question boils down to: does it make sense to predict in this way or not? Or is there a better way to go about this within the context of PyMC? At heart, I am a phycisist and not a statistician; and I’ve also just started out in this field. So if this question speaks to some missing basic knowledge I would be happily pointed in the right direction:)

1 Like

It makes perfect sense to ask about missing data. Can we get some more details about your model? It kind of sounds like a kalman filter setup, which has a natural method for dealing with missing values/disabled sensors (use the process model to fill in the missing value). Is this a counter-factual exercise (i.e. the sensor is actually working for a given datapoint, and you would like to know what if it were not) or a missing data problem (you have datapoints for which there is missing sensor data and you want to fill them in)?

I guess the model is expensive to sample? Have you tried using an alternative sampler (nutpie or numpyro)? That might give you enough speed boost that you can do things the “hard” way.

Is this a counter-factual exercise (i.e. the sensor is actually working for a given datapoint, and you would like to know what if it were not) or a missing data problem (you have datapoints for which there is missing sensor data and you want to fill them in)?

It is a counter-factual exercise at present because I test on the same that that I trained on, but in the end the model might also be used in test situations where there are missing data.

Can we get some more details about your model?

There’s little else to the PyMC model itself. The likelihood is MvNormal(mu,chol, observed=X), where I define an exponential prior for mu, and a k x k LKJPrior(eta=1) on the covariance matrix. My observations X are an N x k matrix where N is the number of time steps. I use the default NUTS sampler via idata=pm.sample() to obtain the posterior and then compute the posterior predictive via pm.sample_posterior_predictive(idata). I then construct the prediction as follows.

I guess the model is expensive to sample?

The main problem in my current formulation is not in making the posterior predictive sample, which can be done very quickly. The problem is the following: say I want to predict a series of values for detector k, D[k], based on observations by detectors 1…k-1, obs[1]…obs[k-1]. The process of going through the entire posterior predictive sample afterwards to find the samples which allow to build up the distribution

f( D[k] | D[1]=obs[1]+/-delta,…,D[k-1]=obs[k-1]+/-delta)

becomes very painful for large k. In other words: the joint posterior predictive distribution becomes so big that a naive sampling approach becomes very slow as k becomes big, because the required sample size grows with the power of k and it takes forever to evaluate the statement above.

do things the “hard” way

So I guess, really, what I ultimately wonder is the following: is this simply a hard problem (where ‘hard’ relates to the dimensionality) or is the way that I construct my model ill-considered?

So are there any modeled connections between time steps or sensors? Are the sensor sequential? I am looking at this equation:

f(D_k | D_1) = y_1 + \delta

Where does the conditioning on D_1 come in? I think you should be building these connections between the sensors into the PyMC model, not computing them from samples afterwards. Is \delta \sim N(0, \sigma)?

I think you should be building these connections between the sensors into the PyMC model

The connections (in space) between sensors are captured by the elements of the covariance matrix, while consecutive time steps are assumed uncorrelated. So these connections, to my understanding, are already in the model? In short, I assume:

\vec{D}=[D_1,...,D_k]^T \sim \mathcal{N}_k(\vec{\mu},\mathbf{\Sigma})

Where does the conditioning on D_1 come in?

As per the above D_1, during training, is part of the observation vector so the way that it relates to D_2 through D_k is inferred during the initial sampling.

But is there any structure built into the covariance matrix based on the (relative?) locations of the sensors? If \Sigma is totally unstructured, as in \Sigma \sim LKJ(\eta), then each element is free to be whatever it needs to be, there’s not really any specifically spatial information being encoded. I was naively expecting some kind of covariance function that would explicitly encode spatial information into the prior.

Am I right to say the workflow is something like:

  1. initialize a D_0
  2. Model D_1, obtain samples
  3. Replace D_0 with samples (or sufficient statistics) from D_1
  4. Model D_2, obtain samples
  5. etc?

So there are N of these D vectors (each one is the rows of X)?

Sorry for so many questions, I’m just trying to wrap my head around what needs to be done to solve the actual question you asked. Sharing model code would help me a lot, if you’re able to do that.

No need to apologise, I’m very glad you’re invested :grinning: If anything, I should be the one to say sorry for being unclear.

Forget that I called it spatial covariance, by the way. That makes sense from the underlying physics but I agree that that has sort of been idealised out in this model.

This would be minimal example of my model (with synthetic observations for three detectors).

import numpy as np
import pymc as pm
import matplotlib.pyplot as plt

# 1. Set up synthetic observations

ndetectors = 3
nsamples = 1000

mu_true = np.array([50, 200, 170])
cov_true = np.array([[3,5,1],[5,10,3],[1, 3, 10]])

obs = np.random.multivariate_normal(mu_true, cov_true,size=nsamples)
obs_mean = mu_true

# 2. Construct PyMC model, sample posterior and construct posterior predictive

with pm.Model() as model:
    sd_dist = pm.Exponential.dist(1/0.1,  shape = ndetectors)
    chol, corr, sigmas = pm.LKJCholeskyCov('chol', eta=1, n=ndetectors, sd_dist=sd_dist,compute_corr=True)
    mu = pm.Exponential('mu',lam=1/obs_mean, shape=(ndetectors,))
    D = pm.MvNormal('D',mu=mu, chol=chol, observed=obs)
    idata = pm.sample(1000,cores=1,chains=4)

    pm.sample_posterior_predictive(idata, extend_inferencedata=True)

D = idata.posterior_predictive['D'].to_numpy()
D_list = np.reshape(D,[-1, ndetectors]) #this generates an array of size: all_samples_in_posterior_predictive x ndetectors

# 3. Predict signal on detector 0 based on new observations for detector 1 (d1=195) and detector 2 (d2=166)

delta = 0.1
d1 = 195
d2 = 166

mask_in_range_det1 = (D_list[:,1]> d1 - delta)*(D_list[:,1]< d1 + delta)
mask_in_range_det2 = (D_list[:,2]> d2 - delta)*(D_list[:,2]< d2 + delta)
det_0_pred = D_list[mask_in_range_det1*mask_in_range_det2][:,0]

# 4. Plot probability distribution for detector 0 conditional on d1=195 and d2=166
plt.figure()
plt.hist(det_0_pred,bins=10)

This is my workflow. You need to repeat this for each individual set of observations {d1,d2} that you want to analyse, and it works nicely if there are not too many detectors. For many detectors, calculating the Boolean masks becomes quite expensive (because you need to go through millions of samples to have enough statistics). And you need to repeat this for each individual set of observations.

And then I return to my question: If we accept for the moment that this Bayesian model correctly captures my system (i.e. consider steps 1 and 2 okay), does the way that I make predictions in step 3 make sense?

I’m reading this thread in diagonal, very fast, but isn’t Gaussian Processes the tool you want to use here? Getting the conditional probability of some normally-distributed missing data from normally-distributed data at other positions is pretty much what it does.

https://www.pymc.io/projects/docs/en/stable/api/gp/generated/classmethods/pymc.gp.Marginal.conditional.html#pymc.gp.Marginal.conditional

1 Like

I think GP would be the tool if my assumptions about spatial dependence were right, but they’re not :wink:

Thanks for sharing the model. The reason why I was asking so many questions about modeled dependencies between the detectors is because after the model is fit, new “observed” data doesn’t matter, unless you’re updating the estimated model parameters. If you’re just making a prediction, you need to have the information about the detectors somewhere in the model.

An illustration with linear regression. Suppose you have a univariate regression, y_i \sim N(\alpha + \beta \cdot x_i, \sigma). You provide y and X as data, and estimate posterior distributions for \alpha, \beta, \sigma. Now you can make predictions for unseen y_i using \hat y_i = \alpha + \beta \cdot x_i.

In your case, you have y_i \sim N(\mu, \Sigma), and no corresponding x_i. So for any unseen y_i, your estimate is the corresponding component of \mu. As an illustration, I use the pm.set_data API to do posterior prediction using different observed data values. Here is your model, adjusted to use labeled dimensions:

# 2. Construct PyMC model, sample posterior and construct posterior predictive
coords = {'detector':np.arange(ndetectors, dtype='int'),
          'detector_aux':np.arange(ndetectors, dtype='int')}

with pm.Model(coords=coords) as model:
    model.add_coord('time', np.arange(nsamples, dtype='int'), mutable=True)
    
    obs_data = pm.MutableData('obs_data', obs, dims=['time', 'detector'])
    sd_dist = pm.Exponential.dist(1/0.1)
    chol, corr, sigmas = pm.LKJCholeskyCov('chol', eta=1, n=ndetectors, sd_dist=sd_dist, compute_corr=True)
    
    # This will be useful later, maybe
    cov = pm.Deterministic('cov', chol @ chol.T, dims=['detector', 'detector_aux'])
    
    mu = pm.Exponential('mu',lam=1/obs_mean, dims=['detector'])
    D = pm.MvNormal('D',mu=mu, chol=chol, observed=obs_data, dims=['time', 'detector'])
    idata = pm.sample(idata_kwargs={"dims": {"chol_stds": ["detector"], "chol_corr": ["detector", "detector_aux"]}})

    pm.sample_posterior_predictive(idata, extend_inferencedata=True)

Now we can do posterior_predictive_sampling with any values of obs we desire. I set the value of all the detectors to one billion, for comedic effect:

with model:
    pm.set_data({'obs_data':np.array([[1e9, 1e9, 1e9]])}, coords={'time':[0]})
    missing_idata = pm.sample_posterior_predictive(idata)
    
fig, ax = plt.subplots(1, 3, figsize=(14,4), dpi=144)
for d in coords['detector']:
    missing_idata.posterior_predictive.D.stack(sample=['chain', 'draw']).sel(detector=d, time=0).plot.hist(bins=25, ax=fig.axes[d])

You can verify for yourself that you could plug anything into the data and you will get the same posterior predictive. This is because the posterior predictive is independent of the observed data, now that the model is fit.

I think all this might be missing the point, but I want to cover all my bases. What you might want to do is us the properties of the multivariate normal to predict the marginal distribution of the unobserved detector, conditioned on the observed. Specifically, if you have:

D = \begin{bmatrix} d_1 \\ d_2 \\ d_3 \end{bmatrix}

And suppose that d_1 is unobserved. So partition D into observed and unobserved parts: so D = \begin{bmatrix} d_u \\ d_o \end{bmatrix} (u for unobserved and o for observed). Similarly partition your estimated mu and sigma so that \mu = \begin{bmatrix} \mu_u \\ \mu_o \end{bmatrix} and \Sigma = \begin{bmatrix} \Sigma_{uu} & \Sigma_{uo} \\ \Sigma_{ou} & \Sigma_{oo} \end{bmatrix}, where each \Sigma_{xx} is a block of the estimated covariance matrix, it is known that:

d_u \sim N(\hat \mu_{u}, \hat \Sigma_{uu}), with

\begin{align} \hat \mu_{u} &= \mu_u + \beta (d_o - \mu_o)\\ \hat \Sigma_{uu} &= \Sigma_{uu} - \beta \Sigma_{oo} \beta^T \\ \beta &= \Sigma_{uo} \Sigma_{oo}^{-1} \end{align}

So all you have to do is use the elements of your posterior (not the posterior predictive!) to compute these quantities. Let me preface this next code block by saying that it’s probably not the most intuitive thing in the world if you’re not used to working with labeled dimension (which I’m not, it took me a bit of googling to figure it all out). The important thing is just that you implement those equations. For me, np.einsum would be a bit more intuitive, but I wanted to make an effort to use xarray.

import arviz as az
import xarray as xr

post = az.extract(idata, 'posterior')

# In the general case you need to sort these into observed and unobserved, 
# but I assume that the 0th element is unobserved and the others are observed.
Σ = post.cov
μ = post.mu

# Rename all the dims to make it (slightly) more obvious how the broadcasting works with xarray 
μ_u = μ.sel(detector=0).rename({'detector':'detector_unobs'})
μ_o = μ.sel(detector=[1,2]).rename({'detector':'detector_obs'})

Σ_uu = Σ.sel(detector=[0], detector_aux=[0]).rename({'detector':'detector_unobs', 'detector_aux':'detector_unobs_aux'})
Σ_ou = Σ.sel(detector=[1,2], detector_aux=[0]).rename({'detector':'detector_obs', 'detector_aux':'detector_unobs'})
Σ_uo = Σ.sel(detector=[0], detector_aux=[1,2]).rename({'detector':'detector_unobs', 'detector_aux':'detector_obs'})
Σ_oo = Σ.sel(detector=[1,2], detector_aux=[1,2]).rename({'detector':'detector_obs', 'detector_aux':'detector_obs_aux'})

# This is not at all obvious how to do, I followed this:
# https://github.com/pydata/xarray/discussions/7503
Σ_oo_inv = xr.apply_ufunc(np.linalg.inv,
                          Σ_oo, 
                        input_core_dims=[["detector_obs", "detector_obs_aux"]],
                        output_core_dims=[["detector_obs", "detector_obs_aux"]],
                        exclude_dims=set(("detector_obs", "detector_obs_aux")))

β = Σ_uo.dot(Σ_oo_inv, dims=['detector_obs']).rename({'detector_obs_aux':'detector_obs'})

new_obs = np.array([195, 166])[:, None]
μ_u_hat = μ_u + β.dot(new_obs - μ_o, dims=['detector_obs'])
Σ_u_hat = Σ_uu - β.dot(Σ_oo, dims=['detector_obs']).dot(β.rename({'detector_obs':'detector_obs_aux'}), dims=['detector_obs_aux'])


# Since the unobserved part is 1d, I use np.random.normal. In the general case, use np.random.multivariate_normal, and concatenate the results
d_u = np.array([np.random.normal(loc=mu, scale=sigma) for mu, sigma in zip(μ_u_hat.values.squeeze(), Σ_u_hat.values.squeeze())])
plt.hist(d_u, bins=100);

This gives you the following histogram:
image

Evidently, seeing the values 195 and 166 on detectors 1 and 2 cause the expected value of detector 0 to go down by about 3 units (compare this to the posterior predictive shown above). I would say that is expected, because the values 195 and 166 are both below the mean for detectors 1 and 2 (again, comapre to above), and looking at the (expected) covariance matrix, detector 0 has positive covariance with both 1 and 2. Thus, they should all go up or down together:

import seaborn as sns
sns.heatmap(post.cov.mean(dim=['sample']).to_dataframe().unstack('detector_aux')['cov'], annot=True)

image

Now, is all this necessary? By that I mean, “can’t PyMC just do this for us?”. Fair question, and I’m not sure. As @julien noted, this marginalization of unobserved components is exactly what Gaussian Processes do, but that’s more of a special case. I’m also not positive if this is conjugate updating of the posterior or something else. But I think it might be what you have in mind?

2 Likes

What you might want to do is us the properties of the multivariate normal to predict the marginal distribution of the unobserved detector, conditioned on the observed.

That scratches an itch. I think I saw that formula on Wikipedia at some point, but I hadn’t appreciated that it applies here.

But I think it might be what you have in mind?

Very much so!

Jesse, thank you very much for your time, valuable insights and clear coding example! I’ll mark this as the solution.

For future projects, I’ll also certainly entertain @julien 's notion of Gaussian Processes, which sounds - if not exactly what I needed now - certainly like a valuable framework to get acquainted with.

1 Like

OK I had lunch and the answer to “can PyMC do this for you” is an emphatic “YES”. Special thanks to this blog post by @ricardoV94 and @tcapretto , which shows how you can use pm.sample_posterior_predictive to extend and recycle parts of your model in new and interesting ways.

Here’s the much improved code, then I’ll go over it:

import pytensor.tensor as pt
# I wish we could just use np.nan, but pm.MutableData doesn't support it
# So pick a number here that will never be observed in reality
MISSING_VALUE = -1

with pm.Model(coords=coords) as marginal_model:
    obs_data = pm.MutableData('obs_data', np.array([MISSING_VALUE, 195, 166]), dims=['detector'])
    missing_idx = pt.nonzero(pt.eq(obs_data, MISSING_VALUE), return_matrix=False)[0]
    obs_idx = pt.nonzero(pt.neq(obs_data, MISSING_VALUE), return_matrix=False)[0]
    sorted_idx  = pt.concatenate([missing_idx, obs_idx])
    reverse_idx = pt.argsort(sorted_idx)
    
    n_missing = missing_idx.shape[0]
    n_obs = obs_idx.shape[0]
    
    data_sorted = obs_data[sorted_idx]
    
    # Declare model variables to be the same as the old model
    sd_dist = pm.Exponential.dist(1/0.1)
    chol, corr, sigmas = pm.LKJCholeskyCov('chol', eta=1, n=ndetectors, sd_dist=sd_dist, compute_corr=True)
    cov = pm.Deterministic('cov', chol @ chol.T, dims=['detector', 'detector_aux'])
    mu = pm.Exponential('mu',lam=1/obs_mean, dims=['detector'])
    
    # Do the marginalization math
    mu_sorted = mu[sorted_idx]
    cov_sorted = cov[sorted_idx, :][:, sorted_idx]
    
    mu_u = mu_sorted[:n_missing]
    mu_o = mu_sorted[n_missing:]
    
    cov_uu = cov_sorted[:n_missing, :][:, :n_missing]
    cov_uo = cov_sorted[:n_missing, :][:, n_missing:]
    cov_oo = cov_sorted[n_missing:, :][:, n_missing:]
    
    cov_oo_inv = pt.linalg.solve(cov_oo, pt.eye(n_obs))
    beta = cov_uo @ cov_oo_inv
    
    mu_missing_hat = mu_sorted[:n_missing] + beta @ (data_sorted[n_missing:] - mu[obs_idx])
    Sigma_missing_hat = cov_uu - beta @ cov_oo @ beta.T    
    
    pm.Deterministic('mu_missing_hat', mu_missing_hat)
    pm.Deterministic('Sigma_missing_hat', Sigma_missing_hat)
    
    # Make a random variable 
    marginal_missing = pm.MvNormal('marginal_missing', mu=mu_missing_hat, cov=Sigma_missing_hat)
    
    # Sample the new random variable, but note that we ***pass the OLD idata***!!!
    idata_marginal = pm.sample_posterior_predictive(idata, var_names=['marginal_missing'])

Basically, what we want to do is take the computational graph we had before (the one that estimated \mu and \Sigma in the first place) and extend it to include some new math. We are absolutely allowed to do this in PyMC, provided we follow a couple rules:

  1. Give the new model a new name: with pm.Model(coords=coords) as marginal_model
  2. Recycle the variable names of anything we want to save. Note that kept the names of the random variables cov, mu, etc. This is because when we forward sample the model, we want PyMC to use draws of the original model for these values. More on this later.
  3. Don’t call pm.sample, but instead call pm.sample_posterior_predictive, providing the OLD idata !

(3) is the key. Everything we add is going to just be deterministic computations of the random variables that are already living in idata. So, when PyMC does pm.sample_posterior_predictive, it will go look for the names of the variables living in idata, and replace random variables in our graph with those samples. Since we followed rule (2) and recycled the names, it will find and replace exactly what we want. All the quantities downstream of those nodes will be computed as we want, no need to juggle named dimensions. By the way, following rule (1) lets us recycle names in the first place (otherwise it will throw an error that the variable was already declared).

Why is this code so much better? Well, I think you agree it’s much easier to read. But we also don’t have to do any numpy sampling, it’s all automatic. Here’s the same plot again, as a one-liner:

idata_marginal.posterior_predictive.marginal_missing.plot.hist(bins=100);

image

But what’s even better is that we can now get conditional predictions for any combination of observed/missing we want, without making a new model, by using pm.set_data. Here’s an example:

with marginal_model:
    pm.set_data({'obs_data':[MISSING_VALUE, MISSING_VALUE, 500]})

    # Still (always!) passing the original idata
    idata_marginal_2 = pm.sample_posterior_predictive(idata, var_names=['marginal_missing'])

missing_idx = idata_marginal_2.posterior_predictive.marginal_missing_dim_2.values
fig, ax = plt.subplots(1, missing_idx.shape[0], figsize=(14,4), dpi=144)
for d in missing_idx:
    idata_marginal_2.posterior_predictive.marginal_missing.sel(marginal_missing_dim_2=d).plot.hist(bins=100, ax=fig.axes[d]);

So very quickly we can get the conditional predicted values for sensors 1 and 2, given that sensor 3 observes 500. Fast and easy.

3 Likes

Apologies for the off-topic, @jessegrabowski is this kind of conditional multivariate something we could support in general along these lines: Implement some predictive model transformations by ricardoV94 · Pull Request #141 · pymc-devs/pymc-experimental · GitHub or is it very model specific/ ambiguous how to automate?

We could definitely support it, we could just need to figure out an API for how to indicate which elements of the distribution are observed (and should thus be marginalized to form the conditional predictive distribution). The equations for the conditional distributions are always the same (given multivariate normal likelihood), and it’s a common enough procedure in the literature (I guess, since it has its own section in the wiki on MVN).

I’m also curious if it’s possible to dynamically re-assign coords based on which components are observed? It would be nice if (referencing the code above) I could take missing_idx and use it to index the SharedVariable associted with the detector coord, then set this as the coords for marginal_missing.

Do we lose any functionality if it always grows to the “left”? In that case we would only need to know the size of the previous RV/ observed data.

This sounds a lot like what’s done by the GP conditional method. @bwengals and I were thinking about this sometime ago, when checking if we could implement GPs directly as PyTensor objects.