I have several models that all work decently well independently. In my actual workflow I would like to have the output of the first model be a predictor in my second, and the same for a third model. I have reason to believe these are all sequential processes. The main thing is that I want to correctly propagate the uncertainty (rather than just compute the likelihood mean and pass that to the next model). Additionally, I would like to be able to perturb the model but exploring the impact of changing one variable, while holding all other inputs constant (or at least to some specified value).
I have a first model that is a basic linear regression c ~ a + b
.
# Define features and target
m1_features = ["a", "b"]
# Define target
m1_target = "c"
coords = {"features": m1_features}
with pm.Model(coords=coords) as m1:
# Define X data
m1X = pm.MutableData("m1X", train[m1_features].values)
m1y = train[m1_target].values
# Define weights
beta = pm.StudentT("beta", mu=0, sigma=10, nu=7, dims="features")
# Define intercept
alpha = pm.StudentT("alpha", mu=0, sigma=10, nu=7)
# Define linear model
lm = alpha + (m1X*beta).sum(axis=-1)
# Define model error
err1 = pm.Exponential("err1",0.5)
# Define likelihood for model 1
like1 = pm.StudentT("like1", mu=lm, sigma=err1, nu=7, observed=m1y, shape=m1X.shape[0])
I then have a second model (BART) that takes the output of model1 as an input. e ~ Normal(mu, sigma)
where mu ~ BART(c,d)
(probably not correct nomenclature but hopefully the idea is clear).
m2_zscore_features = ["c", "d"]
m2_target = "e"
with pm.Model() as m2:
# Define X data
m2X = pm.MutableData("m2X", train[m2_zscore_features].values)
m2y = train[m2_target].to_numpy()
# mu
mu = pmb.BART("mu",X=m2X, Y=np.log(m2y+0.01), m=100)
# Define model error
err2 = pm.Exponential("err2",0.5)
# Define likelihood for model 2
like2 = pm.Normal("like2", mu=pm.math.exp(mu), sigma=err2, observed=m2y, shape=mu.shape[0])
Finally, the output of model 2 is passed into a third model, which uses a HSGP.
m3_features = "e"
m3_target = "f"
train = train.sort_values(by=m3_features,ascending=True)
coords = {
"x": train[m3_features].reset_index().index.tolist()
}
with pm.Model(coords=coords) as m3:
# Define data
m3X = pm.MutableData("m3X", train[m3_features].to_numpy().reshape(-1,1))
m3y = train[m3_target].to_numpy()
# Set prior on GP hyperparameters
eta = pm.Exponential("eta", lam=1)
ell = pm.InverseGamma("ell", alpha=2, beta=20)
# Specify covariance function and GP
cov = eta**2 * pm.gp.cov.Matern52(1, ls=ell)
# HSGP
gp = pm.gp.HSGP(m=[20], c=3, cov_func=cov)
# Function prior
ff = gp.prior("ff", X=m3X, dims="x")
# Model err
err3 = pm.HalfNormal("err3", sigma=3)
like3 = pm.Normal("like3", mu=ff, sigma = err3, observed=m3y)
I started to try to combine them into a single combined model, but I can’t figure out how to get the model to “see” that the likehood of a prior model represents one of the inputs to the next. It gets further complicated because I need any input data to be within a mutable data container so I can make predictions. It’s a bit complicated since I know that BART
is fussy about data containers.
# Model 1 features and target
m1_features = ["a", "b"]
m1_target = "c"
# Model 2 features and target
m2_zscore_features = ["c", "d"]
m2_target = "e"
# Model 3 features and target
m3_features = "e"
m3_target = "f"
coords = {"features" : m1_features, # For model 1
"x" : train[m3_features].reset_index().index.tolist()} # for GP
with pm.Model(coords=coords) as m1:
# Define data
m1X = pm.MutableData("m1X", train[["a", "b"]].values)
m1y = train["c"].values
m2X = pm.MutableData("m2X", train[["c", "d"]].values)
m2y = train["e"].to_numpy()
m3X = pm.MutableData("m3X", train["e"].to_numpy().reshape(-1,1))
m3y = train["f"].to_numpy()
# --------- Model 1 ---------- #
# Define weights
beta = pm.StudentT("beta", mu=0, sigma=10, nu=7, dims="features")
# Define intercept
alpha = pm.StudentT("alpha", mu=0, sigma=10, nu=7)
# Define linear model
lm = alpha + (m1X*beta).sum(axis=-1)
# Define model error
err1 = pm.Exponential("err1",0.5)
# Define likelihood for model 1
like1 = pm.StudentT("like1", mu=lm, sigma=err1, nu=7, observed=m1y, shape=m1X.shape[0])
# --------- Model 2 ---------- #
# mu
mu = pmb.BART("mu",X=m2X, Y=np.log(m2y+0.01), m=100)
# Define model error
err2 = pm.Exponential("err2",0.5)
# Define likelihood for model 2
like2 = pm.Normal("like2", mu=pm.math.exp(mu), sigma=err2, observed=m2y, shape=mu.shape[0])
# --------- Model 3 ---------- #
# Set prior on GP hyperparameters
eta = pm.Exponential("eta", lam=1)
ell = pm.InverseGamma("ell", alpha=2, beta=20)
# Specify covariance function and GP
cov = eta**2 * pm.gp.cov.Matern52(1, ls=ell)
# HSGP
gp = pm.gp.HSGP(m=[20], c=3, cov_func=cov)
# Function prior
ff = gp.prior("ff", X=m3X, dims="x")
# Model err
err3 = pm.HalfNormal("err3", sigma=3)
like3 = pm.Normal("like3", mu=ff, sigma = err3, observed=m3y)
My questions are:
- How do I pass the estimate of a model as an input to the next model?
- How do I manage the data containers in a way that allows all the values that represent predictors can be set with
set_data()
. I know bothBART
andHSGP
s allow this for prediction so it seems like it should be possible. - Third, and less important for this question, do the inputs to GPs have to be ordered? It makes sense when the X axis represents time or another sequential axis, but what about when the space is just a set of values between two points? When I initially modeled the data I did not sort the data by the values that are on the x axis (
e
in my dataframe). I was having problems with divergences and weird posterior predictions. When I ordered the data it cleared things up…
I saw this related post, but it doesn’t look like it was ever addressed.
I’ve attached a copy of the data to help if needed. Just FYI all the variables are standardized except e
, which is normalized between 0 and 1.
RANDOM_SEED = 101
rng = np.random.default_rng(RANDOM_SEED)
# Sample some portion of the data as training
train_idx = data.sample(frac=0.7,random_state=RANDOM_SEED, replace=False).index.tolist()
# Assign to train set
data = data.assign(is_training_set = 0)
data.loc[train_idx, "is_training_set"] = 1
# Split train and test data
train = data.loc[data.is_training_set==1,:]
test = data.loc[data.is_training_set==0,:]
data.csv (22.7 KB)