Sampling plausible posterior x parameter values in linear model for given y

Hi,

I am fitting a model with a few parameters to explain a single outcome variable, somewhat similar to the introductory linear regression example here:
https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/GLM_linear.html#glm-linear

(I have more than just a single x, but I think the structure is similar enough).

Sampling seems to work fine. I can then use
pm.set_data({...})
to update x and sample from the posterior with pm.sample_posterior_predictive(...) and plot samples for y using az.plot_posterior(...).

What I am interested in however is to specify y and then sample values for x that are likely to have produced the specific y. Just using pm.set_data({...}) to specify y does not seem to change the output at all.

I suspect that I need to actually change the structure of my model somehow, but I am not sure where to start and would appreciate any pointers. I am more of a hacker than a statistician, so I also want to apologise for my lack of more precise language :wink:

Thanks,
Fabian

Your intuition is right. If you want to learn something about x you need to model some structure behind it. @junpenglao had a nice talk on this topic you might find useful as a starting point: Partial Missing Multivariate Observation and What to Do With Them by Junpeng Lao

I have created a simplified example notebook where I show what I am trying to do:

I also started to study the blog post Out of model predictions with PyMC - PyMC Labs but I am still working on understanding what exactly is going on in the sections following “Making predictions on different models”.

@ricardoV94 thanks for the pointer! I watched the tutorial and found it very interesting.

However I believe that there should be something simpler than keeping around the original data and then adding a sample with the target y and missing x, then running a full .sample() again.
I am struggling to understand how to do something similar using .sample_posterior_predictive() instead.

I added a gist with a notebook and a link to the blogpost “Out of model predictions with PyMC” above. Would be great to know if I am on the right track there.

You’re going to have to put some more structure into the model to able to back out p(x | y) (to answer the question in the gist, yes that’s the thing you’re looking for).

In a normal regression, you take the right-hand variables to be exogenous, so you can’t really say anything about them (by construction). To say something, we have to model x and y simultaneously. Maybe the simplest way is to just say that:

\begin{align} y &= \alpha_y + \beta_y x_t + \varepsilon_y \\ x &= \alpha_x + \beta_x y_t + \varepsilon_x \end{align}

We can write this as a single equation by defining z = \begin{bmatrix} x \\ y \end{bmatrix}, a = \begin{bmatrix} \alpha_x \\ \alpha_y \end{bmatrix} and B = \begin{bmatrix} 0 & \beta_x \\ \beta_y & 0 \end{bmatrix}, and \epsilon = \begin{bmatrix} \varepsilon_x \\ \varepsilon_y\end{bmatrix} then:

z = a + Bz + \epsilon

We will assume \epsilon \sim N(0, \Sigma). We do this 1) because it makes the math easier, but more importantly 2) the Multivariate Normal admits clean conditional distributions over unobserved variables in the case when you observe some outputs. That is, if you observe x, we can work out the conditional distribution for y. Details in a minute.

We need to know the mean of the distribution given the structural equations we’ve defined. Call the mean \mu = \mathbb E[z], and take expectations of both sides of the equation:

\begin{align}\\ \mathbb E[z] &= \mathbb E[a + Bz] \\ \mu &= \mathbb E[a] + \mathbb E[Bz] \\ \mu &= a + B\mathbb E[z] \\ \mu &= a + B\mu \\ \mu - B \mu &= a \\ (I - B) \mu &= a \\ \mu &= a (I - B)^{-1} \end{align}

Where I is the identity matrix. So this gives us a multivariate normal model, where we can model both x and y simultaneously. We just need to put priors on \alpha_x, \alpha_y, \beta_x, \beta_y, \Sigma. Well, sort of. Actually, the model isn’t identified – you have two equations and 4 unknowns. In general, it would be better if you had extra exogenous variables Z, so that the mean model would be \mu = (a + \Gamma Z)(I - B)^{-1} (derivation left as an exercise to the reader), then you could put some zeros in the \Gamma matrix (i.e.some of Z act on x but not y).

I was lazy, so I resolved the identification problem by setting \alpha_y = 0 and \beta_x = 0.

Anyway, here’s a PyMC model:

import pymc as pm
import pytensor.tensor as pt

coords = {'variable':['x', 'y'], 
           'variable_bis': ['x', 'y'],
           'obs_idx':np.arange(100).astype(int)}

with pm.Model(coords=coords) as m:
    alpha_x = pm.Normal('alpha_x', sigma=10)
    alpha_y = 0.0
    alpha = pm.Deterministic('alpha', pt.stack([alpha_x, alpha_y]), dims=['variable'])
    
    beta_x = 0.0
    beta_y = pm.Normal('beta_y')
    beta = pm.Deterministic('beta', pt.stack([beta_x, beta_y]), dims=['variable'])
    B = pt.stack([pt.stack([0.0, beta_x]), pt.stack([beta_y, 0.0])])
    
    mu = pm.Deterministic('mu', 
                          pt.linalg.solve(pt.eye(2) - B, alpha, check_finite=False),
                          dims=['variable'])

    sd_dist = pm.Exponential.dist(0.5)
    chol, *_ = pm.LKJCholeskyCov('chol', eta=1, n=2, sd_dist=sd_dist)
    cov = pm.Deterministic('cov', chol @ chol.T, dims=['variable', 'variable_bis'])
    
    obs = pm.MvNormal('obs', mu=mu, chol=chol, dims=['obs_idx', 'variable'])
    prior_idata = pm.sample_prior_predictive()

This should run fine and give you a posterior. But the real magic is in the Multivariate normal model, which has a closed form for partial observations. If we know \mu and \Sigma (which we do after sampling), then if you observe y, you know the conditional distribution for x. See here for discussion, or read the quite nice wiki.

Suppose you observe that x = 23, then you can get the conditional predictions for y as follows:

MISSING_VALUE = -9999

with pm.Model(coords=coords) as prediction_model:
    obs_data = pm.Data('obs_data', np.array([23.0, MISSING_VALUE]), dims=['variable'])
    
    # This seems like overkill, but it allows you to generalize the code to any number of 
    # variables, and any combination of missing/observed. It will also let us reuse
    # the model, changing what is observed and unobserved -- see below
    missing_idx = pt.nonzero(pt.eq(obs_data, MISSING_VALUE), return_matrix=False)[0]
    obs_idx = pt.nonzero(pt.neq(obs_data, MISSING_VALUE), return_matrix=False)[0]
    sorted_idx  = pt.concatenate([missing_idx, obs_idx])
    reverse_idx = pt.argsort(sorted_idx)
    
    n_missing = missing_idx.shape[0]
    n_obs = obs_idx.shape[0]
    
    data_sorted = obs_data[sorted_idx]
    
    # Dummy flat priors so we get an error if the model tries to sample them
    # (they should be sampled from the posterior we provide)
    mu = pm.Flat('mu', dims=['variable'])
    cov = pm.Flat('cov', dims=['variable', 'variable_bis'])
    
    # Do the marginalization math
    mu_sorted = mu[sorted_idx]
    cov_sorted = cov[sorted_idx, :][:, sorted_idx]
    
    mu_u = mu_sorted[:n_missing]
    mu_o = mu_sorted[n_missing:]
    
    cov_uu = cov_sorted[:n_missing, :][:, :n_missing]
    cov_uo = cov_sorted[:n_missing, :][:, n_missing:]
    cov_oo = cov_sorted[n_missing:, :][:, n_missing:]
    
    cov_oo_inv = pt.linalg.inv(cov_oo)
    beta = cov_uo @ cov_oo_inv
    
    mu_missing_hat = mu_sorted[:n_missing] + beta @ (data_sorted[n_missing:] - mu[obs_idx])
    Sigma_missing_hat = cov_uu - beta @ cov_oo @ beta.T    
    
    pm.Deterministic('mu_missing_hat', mu_missing_hat)
    pm.Deterministic('Sigma_missing_hat', Sigma_missing_hat)
    
    # Make a random variable 
    marginal_missing = pm.MvNormal('marginal_missing', mu=mu_missing_hat, cov=Sigma_missing_hat)
    
    # Sample the new random variable, but note that we ***pass the OLD idata***!!!
    idata_marginal = pm.sample_posterior_predictive(idata, 
                                                    var_names=['marginal_missing'])

Make a plot:

import arviz as az
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
az.plot_posterior(idata_marginal.posterior_predictive, var_names=['marginal_missing'], ax=ax)
ax.set_title(r'Predicted $y$ given $x = 23$')
plt.show()

image

It’s so simple! :fearful: But now you could reuse the prediction_model with pm.set_data to change which variable is not observed (set it to the MISSING_VALUE constant) or change the observed values. For example, we can quickly get the conditional distribution for y given x=1000:

with prediction_model:
    pm.set_data({'obs_data': np.array([MISSING_VALUE, 1000.0])})
    idata_2 = pm.sample_posterior_predictive(idata, var_names=['marginal_missing'])
    
fig, ax = plt.subplots()
az.plot_posterior(idata_2.posterior_predictive, var_names=['marginal_missing'], ax=ax)
ax.set_title(r'Predicted $x$ given $y = 1000$')
plt.show()

image

Thanks a lot for the help and the detailed answer!

I added a copy of the notebook fitting my example data with your code in the gist. I made a small adjustment to actually include the observed data and indeed it samples and inference is possible for both p(x|y) and p(y|x), very cool!

Sampling at first had divergences, but after I adjusted the priors for alpha_x and beta_y those went away and sampling got faster. :grinning:

Now how do I continue from here?
My real problem has a handful of predictors and they have added structure (interactions, nonlinear transformations). I guess I would need to extend this to a higher dimensional B and then use the output of the MvNormal as priors to the rest of the model?

I also don’t fully understand (and it seems like it might be related):

1 Like

It depends – do you want to make predictions for all of the covariates? If yes, then yes, you will need to grow the B matrix. For example if we have a 3rd equation we care to make predictions for, the system becomes:

\begin{align} x &= \alpha_x + \beta_{x,y} y_i + \beta_{x,z} z_i + \varepsilon_x \\ y &= \alpha_y + \beta_{y,x} x_i+ \beta_{y,z} z_i + \varepsilon_y \\ z &= \alpha_z + \beta_{z,x} x_i+ \beta_{z,y} y_i + \varepsilon_z \\ \end{align}

And then B = \begin{bmatrix} 0 & \beta_{x,y} & \beta_{x,z} \\ \beta_{y,x} & 0 & \beta_{y,z} \\ \beta_{z,x} & \beta_{z,y} & 0 \end{bmatrix}

On the other hand, suppose you hand some variables you don’t want conditional distributions for, then you don’t need to grow the number of equations in the system, just add more variables:

\begin{align} x &= \alpha_x + \beta_{x,y} y_i + \gamma_{x,1} z_{1,i} + \gamma_{x,2} z_{2,i} + \varepsilon_x \\ y &= \alpha_y + \beta_{y,x} x_i+ \gamma_{y,1} z_{1,i} + \gamma_{y,2} z_{2,i} + \varepsilon_y \\ \end{align}

Now with some abuse of notation (sorry), we can define a new vector z = \begin{bmatrix} z_{1,i} \\ z_{2,i} \end{bmatrix} and x = \begin{bmatrix} x_i \\ y_i \end{bmatrix}, so we now have:

x = a + Bx + \Gamma z

Where B is defined as before, and \Gamma = \begin{bmatrix} \gamma_{x,1} & \gamma_{x,2} \\ \gamma_{y, 1} & \gamma_{y,2} \end{bmatrix}. As before, solve for \mathbb E[x] = \mu and you should get \mu = (a + \Gamma z)(I - B)^{-1}. z will contain all of your extra features you don’t care about (expect insofar as they explain x) and \Gamma is a matrix of linear coefficients linking these factors to the things you care about. You will also put your identifying restrictions into the \Gamma matrix (instead of the silly stuff I did) to make the model more interpretable (or maybe not!)

Of course you can also grow in both directions at the same time, that’s the beauty of the setup. But remember for each new equation you need one more identifying restriction.