Hi, I have a training dataset X,Y with n rows, j features and k targets, i.e. One thing to note is that these are simulation results, where each of the variables in X are i.i.d. and pulled from uniform random distributions over each of the features.
So we have:
f(\mathbf{x}):\mathbb{R}^j \to \mathbb{R}^k
X \sim U(0,1)^{n\times j}
Y \in \mathbb{R}^{n\times k} =f(X)
We are fitting a simple linear model for now:
Y=AX+b+\epsilon
I additionally have a dataset Y' with m rows where only the targets are known, and not the features. The goal is to get distributions/posterior samples over what the corresponding X' should look like.
So far, I’ve been able to set up the problem without any issues really by just specifying a prior over the unknown features and solving the linear regression and the inverse design simultaneously:
import numpy as np
import pandas as pd
import pymc as pm
n = 4000
n_features = 5
n_targets = 3
n_unseen = 1
# Generate synthetic data
X = np.random.uniform(low=0, high=1, size=(n, n_features))
coeffs = np.random.uniform(low=-10, high=10, size=(n_features, n_targets))
intercepts = np.random.uniform(low=-10, high=10, size=n_targets)
noise_mag = np.random.uniform(low=0, high=0.3, size=n_targets)
noise = np.random.normal(0, noise_mag, size=(n, n_targets))
X_lin = np.dot(X, coeffs)
Y = X_lin + intercepts + noise
with pm.Model() as m:
# Priors for intercepts
b = pm.Normal("b", 0, 10, shape=(n_targets,))
# Priors for coefficient matrix
a = pm.Normal("a", 0, 10, shape=(n_targets, n_features))
# Priors for measurement noise standard deviations
eps = pm.Exponential("eps", 1, shape=n_targets)
# assume that the last n_unseen rows are the ones we want to solve the inverse problem for
x_seen = X[:-n_unseen]
# prior for the unseen features
X_unseen = pm.Uniform("X_unseen", lower=0, upper=1, shape=(n_unseen, n_features))
# compute the expected values for the seen and unseen
x = pm.math.concatenate([x_seen, X_unseen])
mu = pm.math.dot(x, a.T) + b
# Likelihoods for observed data
for i in range(n_targets):
y_obs = pm.Normal(f"y{i}_obs", mu=mu[:, i], sigma=eps[i], observed=Y[:, i])
# y_obs = pm.Normal("y_obs", mu=mu, sigma=eps, observed=Y)
# Sampling
trace = pm.sample(1000, chains=2, tune=1000, return_inferencedata=True)
This is working pretty much as intended. It’s a little slow once n
gets large or n_features
gets large, but that seems expected. However, because I need to be able to use this model in production to solve the inverse design problem many times, being slow (i.e. > 2min) is a non-starter. For context, the real datasets I will be training these models on have about 20k-60k rows in it.
It seems reasonable to me that I should be able to split this up into two steps.
- Train the bayesian linear regression model (which can take arbitrarily long since it only needs to done once)
- When it’s time to solve an inverse design problem, use the trained bayesian linear regression model to discover the posteriors for the unknown features.
I haven’t quite been able to figure out the right way to architect that with PyMC. Should it be two separate models, where one of them somehow ingests the posterior from the first model? Should it be done with a single model using data containers, where X_seen is and Y_seen are immutable and Y_unseen is mutable? If I set it up this way, will it avoid trying to relearn the posteriors for all the coefficients and intercepts/noise magnitudes?
I couldn’t find a good clear example of setting up a re-usable model for this.
I know one trivial solution is to just fit a regular linear regression model, and then it’s easy to set up the inverse design problem in a very re-usable way, but I was having trouble figuring out the right way to split up the Bayesian linear regression training and the inverse design. Thanks in advance!