I’d like to implement a linear regression model where the response variable is modeled via log-normal distribution.
- What I can achieve: build the model and obtain the correct posterior estimates on the regression coefficients.
- The challenge: sampling the predictive posterior produces predictions much lower than the ground-truth.
Below is an MWE:
import numpy as np
import seaborn as sns
import pymc as pm
import arviz as az
from matplotlib import pyplot as plt
# generate the training data, with 5 features
np.random.seed(42)
num_features = 5
coef_values = np.random.random(num_features)
X = np.random.random((5000, num_features))
y = np.exp(np.dot(X, coef_values))
# the function to create the linear regression model with log-normally distributed response
def get_model(X_arr, y):
coords = {
"example": np.arange(X_arr.shape[0]),
"feature": [f"feature_{i}" for i in range(X_arr.shape[1])],
}
with pm.Model() as model:
x = pm.Data("x", X_arr, dims=["example", "feature"])
observed = pm.Data("observed", y, dims="example")
sigma = pm.HalfNormal("sigma", 1)
coef = pm.Normal("coef", dims="feature")
pm.Lognormal(
"y",
mu=pm.math.dot(x, coef),
sigma=sigma,
observed=observed,
)
return model
# fitting the model
with get_model(X, y) as model:
trace = pm.sample(target_accept=0.95)
# checking the fitted coefficients
# which match the ground truth
print(coef_values)
_ = az.plot_posterior(trace)
# prediction and evaluation
test_X = np.random.random((1000, num_features))
test_y = np.exp(np.dot(test_X, coef_values))
with get_model(test_X, test_y) as model:
pm.set_data({"x": test_X})
pm.sample_posterior_predictive(
trace, extend_inferencedata=True
)
pred_y = (
trace.posterior_predictive.mean(["chain", "draw"])
.to_array()
.squeeze()
.to_numpy()
.reshape(-1)
)
# plot the prediction against the true values
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
ax.scatter(pred_y, test_y)
ax.set_xlabel("pred")
ax.set_ylabel("test")
ax.set_xscale("log")
ax.set_yscale("log")
ax.plot([0, 15], [0, 15], c="orange")
The underlying coefficients are:
array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864])
And the posterior matches it:
However, the predictions are systemically lower than the truth!
What am I missing?