Solving inverse problem for a Bayesian linear regression?

Hi, I have a training dataset X,Y with n rows, j features and k targets, i.e. One thing to note is that these are simulation results, where each of the variables in X are i.i.d. and pulled from uniform random distributions over each of the features.

So we have:

f(\mathbf{x}):\mathbb{R}^j \to \mathbb{R}^k

X \sim U(0,1)^{n\times j}

Y \in \mathbb{R}^{n\times k} =f(X)

We are fitting a simple linear model for now:

Y=AX+b+\epsilon

I additionally have a dataset Y' with m rows where only the targets are known, and not the features. The goal is to get distributions/posterior samples over what the corresponding X' should look like.

So far, I’ve been able to set up the problem without any issues really by just specifying a prior over the unknown features and solving the linear regression and the inverse design simultaneously:

import numpy as np
import pandas as pd
import pymc as pm

n = 4000
n_features = 5
n_targets = 3
n_unseen = 1

# Generate synthetic data
X = np.random.uniform(low=0, high=1, size=(n, n_features))
coeffs = np.random.uniform(low=-10, high=10, size=(n_features, n_targets))
intercepts = np.random.uniform(low=-10, high=10, size=n_targets)
noise_mag = np.random.uniform(low=0, high=0.3, size=n_targets)
noise = np.random.normal(0, noise_mag, size=(n, n_targets))
X_lin = np.dot(X, coeffs)
Y = X_lin + intercepts + noise


with pm.Model() as m:
    # Priors for intercepts
    b = pm.Normal("b", 0, 10, shape=(n_targets,)) 

    # Priors for coefficient matrix
    a = pm.Normal("a", 0, 10, shape=(n_targets, n_features))

    # Priors for measurement noise standard deviations
    eps = pm.Exponential("eps", 1, shape=n_targets)

    # assume that the last n_unseen rows are the ones we want to solve the inverse problem for
    x_seen = X[:-n_unseen]
    # prior for the unseen features
    X_unseen = pm.Uniform("X_unseen", lower=0, upper=1, shape=(n_unseen, n_features))

    # compute the expected values for the seen and unseen
    x = pm.math.concatenate([x_seen, X_unseen])
    mu = pm.math.dot(x, a.T) + b

    # Likelihoods for observed data
    for i in range(n_targets):
        y_obs = pm.Normal(f"y{i}_obs", mu=mu[:, i], sigma=eps[i], observed=Y[:, i])
    # y_obs = pm.Normal("y_obs", mu=mu, sigma=eps, observed=Y)

    # Sampling
    trace = pm.sample(1000, chains=2, tune=1000, return_inferencedata=True)

This is working pretty much as intended. It’s a little slow once n gets large or n_features gets large, but that seems expected. However, because I need to be able to use this model in production to solve the inverse design problem many times, being slow (i.e. > 2min) is a non-starter. For context, the real datasets I will be training these models on have about 20k-60k rows in it.

It seems reasonable to me that I should be able to split this up into two steps.

  1. Train the bayesian linear regression model (which can take arbitrarily long since it only needs to done once)
  2. When it’s time to solve an inverse design problem, use the trained bayesian linear regression model to discover the posteriors for the unknown features.

I haven’t quite been able to figure out the right way to architect that with PyMC. Should it be two separate models, where one of them somehow ingests the posterior from the first model? Should it be done with a single model using data containers, where X_seen is and Y_seen are immutable and Y_unseen is mutable? If I set it up this way, will it avoid trying to relearn the posteriors for all the coefficients and intercepts/noise magnitudes?

I couldn’t find a good clear example of setting up a re-usable model for this.

I know one trivial solution is to just fit a regular linear regression model, and then it’s easy to set up the inverse design problem in a very re-usable way, but I was having trouble figuring out the right way to split up the Bayesian linear regression training and the inverse design. Thanks in advance!

It looks like no one else is going to answer, so I’ll do the best I can to help! There are a couple of comments I have looking at your code. In short, I don’t think you can split the inference and inverse design without making some approximations. If you can get it to work, using your existing model will probably be easiest, and I think there could be ways to speed it up. If that doesn’t work for you, I have an idea how you could splits the model up.

Speed up Existing Model

  1. You’re using a for-loop in your likelihood. You could probably get some speed up by replacing that for-loop with vectorized expression. You might want to use PyMC’s excellent support for named dimensions to help with this. See this example: Distribution Dimensionality — PyMC 5.18.2 documentation for more info. I made an example to show what I mean
coords = {'targets': np.arange(n_targets), 'features': np.arange(n_features), 'obs': np.arange(n), 'unseen': np.arange(n_unseen)}


with pm.Model(coords=coords) as m:
    # Priors for intercepts
    b = pm.Normal("b", 0, 10, dims='targets')

    # Priors for coefficient matrix
    a = pm.Normal("a", 0, 10, dims=('targets', 'features'))

    # Priors for measurement noise standard deviations
    eps = pm.Exponential("eps", 1, dims='targets')

    # assume that the last n_unseen rows are the ones we want to solve the inverse problem for
    x_seen = X[:-n_unseen]
    # prior for the unseen features
    X_unseen = pm.Uniform("X_unseen", lower=0, upper=1, dims=('unseen', 'features'))

    # compute the expected values for the seen and unseen
    x = pm.math.concatenate([x_seen, X_unseen])
    mu = pm.math.dot(x, a.T) + b

    # Likelihoods for observed data
    y_obs = pm.Normal("y_obs", mu=mu, sigma=eps, observed=Y, dims=('obs', 'targets'))

    # Sampling
    trace = pm.sample(1000, chains=4, tune=1000, return_inferencedata=True)
  1. You seem to be using the default PyMC sampler. You could likely get a nice speed boost with one of the other samplers: Faster Sampling with JAX and Numba — PyMC example gallery.
  2. You could also try ADVI which will be a lot faster than NUTS sampling at the cost of some accuracy: Introduction to Variational Inference with PyMC — PyMC example gallery.

Separate Training and Inverse Design

If the above aren’t good enough. I think the best way to split up the models is to approximate the posterior distribution and use those in the 2nd model. To approximate the distributions, you could try PreliZ which has support for this kind of thing: Direct elicitation in 1D — PreliZ 0.12.0 documentation.

Once you have the approximated distributions you want to set up the inverse problem. From the case above, you would have

\textbf{X} = \textbf{A}^{-1}(\textbf{Y} - b)

I think you could then do something like this:

with pm.Model() as inv_m:

    inv_a = # some distribution you fitted
    b = # some other distribution you fitted
    Y = # your observed data 
    
    sigma = # some model error
    
    X = pm.Normal('inverse design', mu=inv_a * (Y - b), sigma=sigma)
    inverse_design = pm.sample_prior_predictive()

I think that you could actually get away with just doing prior sampling because you’ve inferred all the parameters you would want.

Hope this helps, I’d be happy to answer any questions!

3 Likes

What you’re talking about is often called “amortized inference” and it’s usually done with a neural network to give you enough power to learn the mapping from data to posterior estimates and variances.

If you stick with a simple linear model and assume an inverse gamma prior on variance (eps**2) and a normal prior on the regression coefficients, then the posterior is analytical. If you only care about posterior means, you can also work those out analytically.

assume that the last n_unseen rows are the ones we want to solve

Technically, the inverse problem only involves x_seen—that gives you a posterior over parameters a, b, eps which you can then plug-in to solve for Y_unseen directly—you don’t need to use MCMC for this as your solution will be doing. In fact, if you stick to the linear case, everything will come out analytical here.

In math, what you’re trying to do is sample from the posterior predictive distribution:

\displaystyle p(y^\text{unseen} \mid y^\text{seen}) = \int_{\mathbb{R}^3} p(y^\text{unseen} \mid a, b, \epsilon) \cdot p(a, b, \epsilon \mid y^\text{seen}) \, \text{d}(a, b, \epsilon).

You can do this by taking draws a^{m}, b^{m}, \epsilon^m \sim p(a, b, \epsilon \mid y^\text{seen}) and using the plug-in estimator

\displaystyle \hat{a}, \hat{b}, \hat{\epsilon} = \frac{1}{M} \sum_{m=1}^M p(y^\text{unseen} \mid a^m, b^m, \epsilon^m).

The last formula should have had p(y^\text{unseen} \mid y^\text{seen}) on the left hand side as the estimated and an approximately relation, so the full last line should be

\displaystyle p(y^\text{unseen} \mid y^\text{seen}) \approx \frac{1}{M} \sum_{m=1}^M p(y^\text{unseen} \mid a^m, b^m, \epsilon^m).

Hi @bob-carpenter,

Reading through your solution, I think you may have misinterpreted my goal, due to my admittedly poor naming convention.

To be more explicit: (X_seen, Y_seen) are known ahead of time, and can be thought of as the training data which determines the regression model’s posterior. It seems like you thought I was saying (X_unseen, Y_unseen) were both unknown, which is sensible interpretation given the naming… in actuality, Y_unseen is observed; it was solely named that to indicate that even though we observed these Y_unseen, we did not get to observe the corresponding X_unseen values that generated them. The goal is purely to find the posteriors for X_unseen, ie what were the plausible values for X_unseen that would generate the Y_unseen data.

A better naming convention might have been:

  • X_seen (observed)
  • Y_xseen (observed
  • X_unseen (unobserved)
  • Y_xunseen (observed)

In the code snippet I originally posted, you can see the logic, where:

X_seen comes from fixed data
X_unseen comes from priors

X is the concatenation of those two

Y comes from fixed data and contains Y_seen and Y_unseen in the corresponding order as X

Sorry about that! As they say in CS, there are only two things that are hard, naming, memory locality, and off by one errors.

Following Andrew Gelman et al.'s notation from Bayesian Data Analysis, I’d be tempted to use (x, y) for observed data and (\tilde{x}, \tilde{y}) for held out data. Then you can split \tilde{x} into \tilde{x}^\text{obs} and \tilde{x}^\text{miss}.

How precise do your estimates need to be and what ESS are you getting after 2 min? It’s possible you could shave a lot of that down by using fewer iterations, and if possible, warm starting if you’re solving similar regressions multiple times.

You can use the same model structure in PyMC (this is something you cannot do in Stan, where you’d have to rewrite everything). The difference is that you’ll have one instance with some variables observed (x, y) and you’ll be inferring, for example \alpha, \beta, \sigma. Then you plug in the estimates \widehat{\alpha}, \widehat{\beta}, \widehat{\sigma} and \tilde{x}^\text{obs}, \tilde{y} as data in the second instance and then impute \tilde{x}^\text{miss}.

I’m not sure if this is easy in PyMC, but you can also treat this like a “cut”, where you get posterior draws rather than estimates, then plug them in. There’s a paper by Martyn Plummer (developer of JAGS) on this: Cuts in Bayesian graphical models. One way to do it is to run MCMC for each of the draws, but that’ll get expensive again.

The model you’re laying down for covariates (uniform in a hypercube) means that you can’t really learn any conditional information about them. Another way to do this would be to set up a multivariate normal and then transform with inverse logit. That will let you learn dependencies among the covariates. Again, no idea how easy this would be in PyMC.