Linear regression with log-normal model: posterior predictive is quite off

I’d like to implement a linear regression model where the response variable is modeled via log-normal distribution.

  • What I can achieve: build the model and obtain the correct posterior estimates on the regression coefficients.
  • The challenge: sampling the predictive posterior produces predictions much lower than the ground-truth.

Below is an MWE:

import numpy as np
import seaborn as sns
import pymc as pm
import arviz as az

from matplotlib import pyplot as plt

# generate the training data, with 5 features
np.random.seed(42)
num_features = 5

coef_values = np.random.random(num_features)

X = np.random.random((5000, num_features))
y = np.exp(np.dot(X, coef_values))

# the function to create the linear regression model with log-normally distributed response
def get_model(X_arr, y):
    coords = {
        "example": np.arange(X_arr.shape[0]),
        "feature": [f"feature_{i}" for i in range(X_arr.shape[1])],
    }
    with pm.Model() as model:
        x = pm.Data("x", X_arr, dims=["example", "feature"])
        observed = pm.Data("observed", y, dims="example")

        sigma = pm.HalfNormal("sigma", 1)
        coef = pm.Normal("coef", dims="feature")

        pm.Lognormal(
            "y",
            mu=pm.math.dot(x, coef),
            sigma=sigma,
            observed=observed,
        )
    return model

# fitting the model
with get_model(X, y) as model:
    trace = pm.sample(target_accept=0.95)

# checking the fitted coefficients
# which match the ground truth
print(coef_values)
_ = az.plot_posterior(trace)

# prediction and evaluation
test_X = np.random.random((1000, num_features))
test_y = np.exp(np.dot(test_X, coef_values))

with get_model(test_X, test_y) as model:
    pm.set_data({"x": test_X})

    pm.sample_posterior_predictive(
        trace, extend_inferencedata=True
    )

pred_y = (
    trace.posterior_predictive.mean(["chain", "draw"])
    .to_array()
    .squeeze()
    .to_numpy()
    .reshape(-1)
)

# plot the prediction against the true values
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
ax.scatter(pred_y, test_y)
ax.set_xlabel("pred")
ax.set_ylabel("test")
ax.set_xscale("log")
ax.set_yscale("log")
ax.plot([0, 15], [0, 15], c="orange")

The underlying coefficients are:

array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864])

And the posterior matches it:

However, the predictions are systemically lower than the truth!

What am I missing?

Meanwhile, using the mean of the posterior of the coefficients to make predictions works fine.

# get the mean of coefficients
coef_est = trace.posterior.coef.mean(["chain", "draw"]).to_numpy().reshape(-1)

# predict
pred_y = np.exp(np.dot(test_X, coef_est))

# plot
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
ax.scatter(pred_y, test_y)
ax.set_xlabel("prediction")
ax.set_ylabel("truth")
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_title("prediction vs truth")
ax.plot([0, 15], [0, 15], c="orange")

Maybe something is wrong with my use of trace.posterior_predictive

One workaround is to work with the coefficients sampled from the posterior directly.

pred_y_log_samples = np.matmul(
    test_X, np.transpose(trace.posterior.coef.to_numpy(), [0, 2, 1])
)
pred_y_log = pred_y_log_samples.mean(axis=0).mean(axis=1)
pred_y = np.exp(pred_y_log)

Your sigma parameter doesn’t seem to have converged, and you are ignoring that when working with the expectation using the coefficients directly

Your simulated data doesn’t have noise in it, you have to add some or the posterior will be ill defined/ nearly impossible to sample

1 Like

Thanks for your reply!

I updated the above example by adding random noise to the observed responses.

Notebook

However, the aforementioned issue persists:

  1. predictions tend to be lower than the observed values, if trace.posterior_predictive is used for inference
  2. meanwhile, predictions align better with observed values if we operate on the posterior samples and do the calculation outside of the model


You’re still comparing expectation with posterior predictive draws (which have noise around them). It’s perhaps surprising that the expectation aligns so much better than the posterior predictive draws. May be a consequence of the model.

It should not indicate a bug, unless if you also add noise manually to your predictions you get different results.

I see. A less related question: what are the reasonable ways to obtain point estimates of predictions? I understand this is less Bayesian, but this format would be helpful when point-based evaluation metrics (e.g., mean squared error) are used

You can take the mean of the posterior predictive or you can look at the mean parameter directly, if you wrap it in a pm.Deterministic. You would get the same thing that you computed manually