Linear regression with log-normal model: posterior predictive is quite off

han.xiao · August 14, 2024, 8:42am

I’d like to implement a linear regression model where the response variable is modeled via log-normal distribution.

What I can achieve: build the model and obtain the correct posterior estimates on the regression coefficients.
The challenge: sampling the predictive posterior produces predictions much lower than the ground-truth.

Below is an MWE:

import numpy as np
import seaborn as sns
import pymc as pm
import arviz as az

from matplotlib import pyplot as plt

# generate the training data, with 5 features
np.random.seed(42)
num_features = 5

coef_values = np.random.random(num_features)

X = np.random.random((5000, num_features))
y = np.exp(np.dot(X, coef_values))

# the function to create the linear regression model with log-normally distributed response
def get_model(X_arr, y):
    coords = {
        "example": np.arange(X_arr.shape[0]),
        "feature": [f"feature_{i}" for i in range(X_arr.shape[1])],
    }
    with pm.Model() as model:
        x = pm.Data("x", X_arr, dims=["example", "feature"])
        observed = pm.Data("observed", y, dims="example")

        sigma = pm.HalfNormal("sigma", 1)
        coef = pm.Normal("coef", dims="feature")

        pm.Lognormal(
            "y",
            mu=pm.math.dot(x, coef),
            sigma=sigma,
            observed=observed,
        )
    return model

# fitting the model
with get_model(X, y) as model:
    trace = pm.sample(target_accept=0.95)

# checking the fitted coefficients
# which match the ground truth
print(coef_values)
_ = az.plot_posterior(trace)

# prediction and evaluation
test_X = np.random.random((1000, num_features))
test_y = np.exp(np.dot(test_X, coef_values))

with get_model(test_X, test_y) as model:
    pm.set_data({"x": test_X})

    pm.sample_posterior_predictive(
        trace, extend_inferencedata=True
    )

pred_y = (
    trace.posterior_predictive.mean(["chain", "draw"])
    .to_array()
    .squeeze()
    .to_numpy()
    .reshape(-1)
)

# plot the prediction against the true values
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
ax.scatter(pred_y, test_y)
ax.set_xlabel("pred")
ax.set_ylabel("test")
ax.set_xscale("log")
ax.set_yscale("log")
ax.plot([0, 15], [0, 15], c="orange")

The underlying coefficients are:

array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864])

And the posterior matches it:

However, the predictions are systemically lower than the truth!

What am I missing?

han.xiao · August 14, 2024, 8:51am

Meanwhile, using the mean of the posterior of the coefficients to make predictions works fine.

# get the mean of coefficients
coef_est = trace.posterior.coef.mean(["chain", "draw"]).to_numpy().reshape(-1)

# predict
pred_y = np.exp(np.dot(test_X, coef_est))

# plot
fig, ax = plt.subplots(1, 1, figsize=(6, 5))
ax.scatter(pred_y, test_y)
ax.set_xlabel("prediction")
ax.set_ylabel("truth")
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_title("prediction vs truth")
ax.plot([0, 15], [0, 15], c="orange")

Maybe something is wrong with my use of trace.posterior_predictive

han.xiao · August 14, 2024, 10:33am

One workaround is to work with the coefficients sampled from the posterior directly.

pred_y_log_samples = np.matmul(
    test_X, np.transpose(trace.posterior.coef.to_numpy(), [0, 2, 1])
)
pred_y_log = pred_y_log_samples.mean(axis=0).mean(axis=1)
pred_y = np.exp(pred_y_log)

ricardoV94 · August 15, 2024, 6:41am

Your sigma parameter doesn’t seem to have converged, and you are ignoring that when working with the expectation using the coefficients directly

ricardoV94 · August 15, 2024, 6:44am

Your simulated data doesn’t have noise in it, you have to add some or the posterior will be ill defined/ nearly impossible to sample

han.xiao · August 19, 2024, 6:01pm

Thanks for your reply!

I updated the above example by adding random noise to the observed responses.

Notebook

However, the aforementioned issue persists:

predictions tend to be lower than the observed values, if trace.posterior_predictive is used for inference
meanwhile, predictions align better with observed values if we operate on the posterior samples and do the calculation outside of the model

ricardoV94 · August 22, 2024, 2:13pm

You’re still comparing expectation with posterior predictive draws (which have noise around them). It’s perhaps surprising that the expectation aligns so much better than the posterior predictive draws. May be a consequence of the model.

It should not indicate a bug, unless if you also add noise manually to your predictions you get different results.

han.xiao · August 22, 2024, 3:07pm

I see. A less related question: what are the reasonable ways to obtain point estimates of predictions? I understand this is less Bayesian, but this format would be helpful when point-based evaluation metrics (e.g., mean squared error) are used

ricardoV94 · August 22, 2024, 3:30pm

You can take the mean of the posterior predictive or you can look at the mean parameter directly, if you wrap it in a pm.Deterministic. You would get the same thing that you computed manually

Topic		Replies	Views
Lognormal model for reaction times - how to specify? Questions	4	901	February 5, 2021
How to formulate a Lognormal likelihood using PyMC3? Questions	19	2351	March 10, 2021
How to interpret posterior/prior predictive checks v5 modeling , prediction , model-checking	2	554	September 25, 2023
LogitNormal vs. Beta vs. Logistic Questions	1	1051	August 15, 2018
Posterior does not fit observed well when group and a predictor is included Questions	1	339	October 26, 2020

Linear regression with log-normal model: posterior predictive is quite off

Related topics