Bayesian Regression: Inconsistent Prior Predictive Checks

I’m working on a Bayesian regression problem using PyMC and have encountered an issue with my prior predictive checks. I have some observed x, y data points and have constructed a Bayesian regression model with informative priors. Here’s a version of my model:

with pm.Model() as linear_model_s_t:
    
    # 0 - Observed Data:
    y_obs = pm.MutableData('Y',y_prior,dims='observation')
    X = pm.MutableData('X',x_prior,dims='observation')
    
    # 1 - Prior knowledge about model parameter:
    
    #Intercept
    intercept = pm.LogNormal('Intercept', mu=mu_B, sigma=std_B)
    
    #Slope
    slope = pm.Normal('Slope', mu=mu_A, sigma=sigma_A)
    
    #Standard deviation: observational noise
    # Opt to use the std from y_obs 
    sigma = pm.HalfNormal('sigma', sigma=170)
    
    # 2 - mu -> y = N(mu,sigma**2)
    # Y = Ax + B -> 
    
    y_model = pm.Deterministic('y_model',slope * X + intercept,dims='observation')
    
    # 3 - Define Likelihood 
    
    y = pm.Normal('y', mu=y_model, sigma=sigma, observed = y_obs,shape=X.shape[0],dims='observation')
    
    # 4 -  Sampling
    
    trace = pm.sample(2500, tune=2000,chains=4, cores=2, random_seed=rng,idata_kwargs={"log_likelihood": True})
    
    # 5 - Posterior and Prior Predictions
    
    prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=rng) 
    post_pred = pm.sample_posterior_predictive(trace,extend_inferencedata=True,random_seed=rng)

I’ve checked all the convergence diagnostics for my model, and it appears to be converging well:


image

However, when I perform prior predictive checks, I notice that the prior predictive mean does not follow the distribution of my observed data. It appears flat, even though I’ve created informative priors for the model parameters.

  • Why is the prior predictive mean flat and not following the distribution of my data?
  • What could be the possible reasons behind this behavior, and how can I address it to ensure my model is more aligned with my prior knowledge?

I just want to check what the priors actually are - what number is mu_A, sigma_A, mu_B, sigma_B?

Also, is there a reason why your data doesn’t ever go below zero? Sometimes our measurements are constrained to be non-negative, like with count data. If that is the case with your data, choosing a poisson or logistic regression might be a more plausible model and would prevent you from getting prior predictions below zero.

Hello!!
To provide a clearer understanding, let me explain the context:

  1. The variables x and y represent laboratory data, and it’s essential to note that they can only take positive values. This constraint is because the intercept of the regression holds physical significance, and consequently, the parameter derived from the intercept must also be positive.
  2. I have divided my dataset into subsets, each containing four data points, and applied a regression analysis using SciPy. During this analysis, I enforced a constraint to ensure that the parameter I am interested in remains positive.
  3. Subsequently, I delved into examining the distribution of two key parameters: A and B, representing the slope and intercept, respectively. I discovered that parameter A follows a normal distribution, while parameter B exhibits a skew-normal or log-normal distribution. These observations served as the basis for my prior beliefs.
  4. Using SciPy once more, I conducted fittings for normal and log-normal distributions to estimate the parameters of A and B. This process allowed me to calculate the respective values of sigma and mu for these distributions.
  5. Finally, I constructed my model using PyMC, incorporating the insights gained from the SciPy fitting procedures.

Is that approach ok? Or am I missing something? Should I use a poisson? I guess logistic regression may not be suitable for the case, because it is primarily used for binary classification problems…

Hi. As @daniel-saunders-phil mentioned, it depends on the type of data you have. Logistic regression is not necessarily used for classification problems, but definitely for binary data. For instance, if you have aggregated yes=1 and no=0 responses to an experiment you could use a Binomial likelihood, but if you need to work directly with the 0,1 output, you can use a Bernoulli likelihood.

    intercept = pm.Normal('Intercept', mu=mu_B, sigma=std_B)
    slope = pm.Normal('Slope', mu=mu_A, sigma=sigma_A)   
    y_model = pm.Deterministic('y_model', pm.math.invlogit(slope * X + intercept), dims='observation')    
    y = pm.Bernoulli('y', p=y_model, observed=y_obs)
    

If you have, for instance, count data. Then you could work with a Poisson or NegativeBinomial likelihood, maybe trying something like this:

    intercept = pm.Normal('Intercept', mu=mu_B, sigma=std_B)
    slope = pm.Normal('Slope', mu=mu_A, sigma=sigma_A)
    y_model = pm.Deterministic('y_model', pm.math.exp(slope * X + intercept), dims='observation')    
    y = pm.Poisson('y', mu=y_model, observed=y_obs)

Note that intercept and slope are now Gaussians, as it is the job of the link functions (sigmoid/inv-logit or exponential) to transform the linear model into an appropriate input for either Bernoulli or Poisson. Sorry if I misunderstood something, hope this helps.

As an additional note, unless you’re doing some sort of prior updating (where often data changes every update), it is generally not consider appropriate to fit a model and derive parameters from that to a subsequent model on the same data (that’s usually called double-dipping). The usual best practice recommendation is to derive priors from theory and previous literature (i.e. expert knowledge), if having more informative priors is required. So I think steps 2 to 4 would not be recommended by most, you can directly try your PyMC model with less informative priors, or a hierarchical model, or try to make your priors more informative based on theoretical/expert knowledge.

Thank you for your explanation @Simon !
My data doesn’t follow a binomial distribution at all; it consists of continuous data representing various soil parameters. What I’m attempting to achieve is a simulation of our “professional experience”. To put it differently, in my field of geotechnics, we conduct hundreds of tests, but each time we embark on a new project in the same location, we tend to overlook all the tests conducted previously. Given that we possess this historical data, my objective is to leverage this old data to gain insights into my parameters (I’ve conducted exploratory analysis and fitting using SciPy with this historical data). Subsequently, I intend to make out-of-sample predictions using the new data collected specifically for the current project. Usually, for the ‘current project,’ we only have a few samples (4 or 8 points for the regression), so introducing our prior knowledge would improve the confidence in the estimated parameters. Naturally, in the subsequent project, the “current new data” will be incorporated to update the historical data. In this scenario, where I have two sets of data (old data and current data), am I essentially double-dipping into my model?

Ideally, to avoid double-dipping, we have two options (that I know of, I may be missing other alternatives). 1) To combine the new and old data and sample from a single model, where you could optionally add a hierarchical level to account for moment of sampling (e.g., new and old). 2) If data cannot be combined, you can sample a model from old data (model 1) and use the posteriors from model 1 as priors for the new data model (model 2), i.e. prior updating.

Could you give an example of your data? For deciding on your sampling distribution (likelihood) the main focus is on you observed data y . What type of soil parameter are those data representing (e.g. albedo, moisture, concentration of some minerals)?. I have no experience related with your field, but seems like many of those measures are concentrations of some sort, often a Lognormal distribution is used for those cases (could be a truncated Gaussian, or a Wald distributions as well, depends on the measure and your theoretical motivation). Another option is stadarising your observed data (y_obs - y_obs.mean() / y_obs.std()) and use a Gaussian likelihood with Gaussian priors. If that’s appropriate for your data/inference (sometimes it isn’t).

I think option 2) is what I’m looking for, thank you for the tip! :blush:
I can’t standardize my data to avoid losing physical information.
My soil parameter is resistance, measured in the lab. This is what the real x,y data looks like:


And this is the distribution I obtained from Scipy for the regression parameters, A and B (y = Ax + B):

In that case, I would go for the second option with a model like this:

import pymc as pm
import arviz as az
import numpy as np
import matplotlib.pyplot as plt

#really simplistic simulation of data distributed as two correlated LogNormals
r = 0.8

x0 = np.sort(pm.draw(pm.LogNormal.dist(1, 3), draws=100, random_seed=27))
x = np.sort(pm.draw(pm.LogNormal.dist(1, 3), draws=100, random_seed=77))

y = r*x + np.sqrt(1-r**2)*x0  

plt.scatter(x,y)
plt.ylabel("y")
plt.xlabel("x")

with pm.Model() as mod:
    alpha = pm.HalfNormal("alpha", 1) #intercept
    beta = pm.HalfNormal("beta", 1) #slope
    mu = pm.Deterministic("mu", alpha + beta*x) #location
    sigma = pm.HalfNormal("sigma", 1) #sd
    y_hat = pm.LogNormal("y", mu=mu, sigma=sigma, observed=y)
    idata = pm.sample()
    
az.plot_trace(idata)
plt.tight_layout()

# take the estimated means from a LogNormal
pos_m = az.extract(idata.posterior)['mu'].values.mean(axis=0)
pos_s = az.extract(idata.posterior)['sigma'].values
means = np.exp(pos_m + (pos_s**2)/2)

az.plot_posterior(means)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta, sigma]
 |████████████| 100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 19 seconds.

image

image

Note that alpha approaches 1 and sigma approaches 3, which are the parameters defined for the simulated pm.LogNormal.dist(1, 3). So the model retrieves parameters reasonably well. Please take this with a spoon of salt. Thogh it may work as a very general framework for your data, especially if you have a better generative model other than two correlated LogNormals. One option is creating several models with different likelihoods that may be theoretically appropriate (e.g. replacing LogNormal with Wald, or Gamma) and compare the models.

@Simon, I really appreciate your effort in helping me. However, when I created the plot to check the prior distribution, I still observed a flat prior. I made some modifications to your code so that I could visualize both the prior and posterior predictive distributions:

with pm.Model() as mod:
    alpha = pm.HalfNormal("alpha", 1) #intercept
    beta = pm.HalfNormal("beta", 1) #slope
    mu = pm.Deterministic("mu", alpha + beta*x) #location
    sigma = pm.HalfNormal("sigma", 1) #sd
    y_hat = pm.LogNormal("y", mu=mu, sigma=sigma, observed=y)
    trace = pm.sample(idata_kwargs={"log_likelihood": True})
    prior_pred = pm.sample_prior_predictive(samples=10) 
    post_pred = pm.sample_posterior_predictive(trace,extend_inferencedata=True)


As you can see, the prior predictive mean remains a horizontal line. My primary concern is as follows: In cases where I lack the time or resources to conduct new laboratory tests, or when the available data is limited in size, will my prior information be sufficiently robust to yield reasonable parameter estimates? Consequently, in my real-world scenario, the prior distribution plays a pivotal role.
I made slight adjustments to your example to make the data more closely resemble my actual data:

import pymc as pm
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

#really simplistic simulation of data distributed as two correlated LogNormals
r = 0.8
alpha = 4
sigma = 0.8

x0 = np.sort(pm.draw(pm.LogNormal.dist(alpha, sigma), draws=100, random_seed=27))
x = np.sort(pm.draw(pm.LogNormal.dist(alpha, sigma), draws=100, random_seed=77))

y = r*x + np.sqrt(1-r**2)*x0  

# Plot the data
_, ax = plt.subplots(1,2, figsize=(8, 4))
ax[0].plot(x, y, 'C0.')
ax[0].set_xlabel('x')
ax[0].set_ylabel('y', rotation=0)
az.plot_kde(y, ax=ax[1])
ax[1].set_xlabel('y')
ax[1].set_ylabel('p(y)')
plt.title('Real Data')
plt.tight_layout()

# Create the model

with pm.Model() as mod:
    alpha_m = pm.HalfNormal("alpha", alpha) #intercept
    beta_m = pm.HalfNormal("beta", alpha) #slope
    mu = pm.Deterministic("mu", alpha_m + beta_m*x) #location
    sigma = pm.HalfNormal("sigma", 170) #sd
    y_hat = pm.LogNormal("y", mu=mu, sigma=sigma, observed=y)
    trace = pm.sample(idata_kwargs={"log_likelihood": True})
    prior_pred = pm.sample_prior_predictive(samples=10) 
    post_pred = pm.sample_posterior_predictive(trace,extend_inferencedata=True)

# Plot results

y_pred_post = post_pred.posterior_predictive['y'].mean(dim=["chain", "draw"])
_, ax = plt.subplots(4, 2, figsize=(10, 12))
plt.subplots_adjust(hspace=0.5)

# row 0
az.plot_bpv(post_pred,kind="p_value",ax=ax[0,0])
az.plot_bpv(post_pred, kind="u_value", ax=ax[0,1])

# row 1
ax[1,0].set_title("mean")
az.plot_bpv(post_pred, kind="t_stat", t_stat="mean",ax=ax[1,0])
ax[1,1].set_title("standard deviation")
az.plot_bpv(post_pred, kind="t_stat", t_stat="std", ax=ax[1,1])

# row 2
ax[2,0].set_title("Prior Predictions Distribution")
az.plot_ppc(prior_pred, ax=ax[2,0],alpha=0.01,colors=["gray","blue",'red'], mean=True, legend=True,group='prior')
# ax[2,0].set_xlim([-1,1000])
ax[2,1].set_title("Observed Data Distribution")
ax[2,1].hist(y, bins=20,color='gray',alpha=0.7, density=True)
sns.kdeplot(y, ax=ax[2, 1], color='blue', linestyle='--')


# row 3
ax[3,0].set_title("Posterior Predictions Distribution")
az.plot_ppc(post_pred, ax=ax[3,0],alpha=0.01,colors=["gray","blue",'red'], mean=True, legend=True,group='posterior')
ax[3,1].set_title("Observed x Predicted Data Distribution")
ax[3,1].scatter(x,y,alpha=0.4,color='blue',label='Observed')
ax[3,1].scatter(x,y_pred_post,alpha=0.4,color='gray',label='Predicted')
ax[3,1].legend()

# plot trace

az.plot_trace(trace)
plt.tight_layout()

# plot parameters

def plot_pretty_posteriors(trace):
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    # Plot for 'Intercept'
    az.plot_posterior(trace, var_names='alpha', color="#7A68A6",kind='hist',hdi_prob=0.95,bins=25, ax=axes[0], textsize=12)
    axes[0].set_title('Intercept', fontsize=16)

    # Plot for 'Slope'
    az.plot_posterior(trace, var_names='beta',  color="#A60628",kind='hist',hdi_prob=0.95,bins=25, ax=axes[1], textsize=12)
    axes[1].set_title('Slope', fontsize=16)

    # Plot for 'sigma'
    az.plot_posterior(trace, var_names='sigma',  color="#348ABD",kind='hist',hdi_prob=0.95,bins=25, ax=axes[2], textsize=12)
    axes[2].set_title('sigma', fontsize=16)

    plt.tight_layout()
    plt.show()


plot_pretty_posteriors(trace)

Then, I compared it with my previous model posted at the beginning of this discussion using a leave-one-out validation:
image

and it seems the first model is performing better.

I just need to have the prior predictive mean following a halfnormal and not a horizontal line, but I didn’t figure out how yet :cry:

That’s the point of prior predictive checks, so you can calibrate your priors before running the model. The problem is that your approach goes the opposite way, you’re trying to make your distributions fit on pre-assigned values to prior parameters (obtained from point estimates). That’s not going to work. LogNormal distributions have very long tails, so a great amount amount of zero values happen towards the tail of the distribution, which will push your mean to zero. That does not necessarily mean that the model is worse at inference (same for WAIC or LOO model comparison, which are more indicative of predictive capacity). However, a more constrained model could be better for theoretical reasons (again that’s what you must decide according to your domain knowledge).

A Normal distribution is definitely inappropriate, as it allows for negative values, but maybe a TruncatedNormal could work better. It’s also better to start the prior predictive checks from less informative priors, unless they come from very clear theoretical knowledge, or from a previous Bayesian model (prior update).

import numpy as np
import matplotlib.pyplot as plt

r = 0.8
alpha = 4
sigma = 0.8

x0 = np.sort(pm.draw(pm.LogNormal.dist(alpha, sigma), draws=100, random_seed=27))
x = np.sort(pm.draw(pm.LogNormal.dist(alpha, sigma), draws=100, random_seed=77))

y = r*x + np.sqrt(1-r**2)*x0  

with pm.Model() as mod:
    alpha_m = pm.HalfNormal("alpha", 1) #intercept
    beta_m = pm.HalfNormal("beta", 1) #slope
    mu = pm.Deterministic("mu", alpha_m + beta_m*x) #mean
    sigma = pm.HalfNormal("sigma", 10) #sd
    y_hat = pm.TruncatedNormal("y", mu=mu, sigma=sigma, lower=0, observed=y)

with mod:
    prior_pred = pm.sample_prior_predictive() 

az.plot_ppc(prior_pred, group="prior")
az.plot_ppc(prior_pred, group="prior", kind="cumulative")

image
image

Which definitely looks better than the LogNormal, and allows you to have a more appropriate likelihood (not negative values). Now, it’s important to keep in mind that the simulated data comes from a LogNormal distribution, and that’s why my model in the previous example can recover the parameters with reasonable precision. The TruncatedNormal model does a reasonable job for the mean (alpha), but it’s a bit off for the sd (sigma).

Comparing the 3 models seems to be useful as well (though often not recommended as a method for model selection), but note that now the Normal model has 0 weight and also the LogNormal gives a warning (something may be miss-specified):

rank elpd_loo p_loo elpd_diff weight se dse warning scale
TruncatedNormal 0 -362.1440699 9.12993812 0 0.962512852 15.11492559 0 FALSE log
Normal 1 -362.8778718 9.6356368 0.733801882 0 15.43134132 0.521035978 FALSE log
LogNormal 2 -480.1654312 6.444936509 118.0213612 0.037487148 12.77992679 14.71650608 TRUE log

An important note, is that the point of assessing your models with toy data (or generative model simulations) is to test whether you can retrieve the original parameters by using relatively uninformative priors in your model. So, you should not use the same alpha for your HalfNormal priors. Also beta_m is a prior for the slope, so using alpha there does not seem too appropriate. Additionally, HalfNormal’s default parameter is a scale parameter (sigma), so not the same as the location parameter alpha you used for the LogNormal. All these details seem a bit nitpicky, but ultimately knowing these details very well can greatly help with prior selection. Hope this helps.

3 Likes

“The problem is that your approach goes the opposite way, you’re trying to make your distributions fit on pre-assigned values to prior parameters (obtained from point estimates). That’s not going to work.”

I don’t understand why it’s not going to work. If I’m providing all the information to the model (the pre-assigned values), shouldn’t the model perform even better? There shouldn’t be any issues with recovering what it already knows.

“LogNormal distributions have very long tails, so a great amount of zero values happen towards the tail of the distribution, which will push your mean to zero.”

That makes sense; I need to find a distribution with a shorter tail or less weight in the tail. I’m currently searching for such a distribution.

“It’s also better to start the prior predictive checks from less informative priors, unless they come from very clear theoretical knowledge, or from a previous Bayesian model (prior update).”

Well, I suppose what I’m attempting to do is a “Bayesian prior update,” but I’m clearly going about it the wrong way. I’ve also tried simulating this update by considering a scenario where I only have four samples and then gradually increasing the sample size to 100 to see how it improves the model. The conclusion I’ve drawn is that uncertainty decreases and the R-squared value increases when calculated from the mean of the predicted data compared to the actual observations, nothing new here. Which other metrics could I use to evaluate the sample size effect in my model?

“An important note is that the point of assessing your models with toy data (or generative model simulations) is to test whether you can retrieve the original parameters by using relatively uninformative priors in your model. So, you should not use the same alpha for your HalfNormal priors. Also, beta_m is a prior for the slope, so using alpha there does not seem too appropriate. Additionally, HalfNormal’s default parameter is a scale parameter (sigma), so it’s not the same as the location parameter alpha you used for the LogNormal. All these details may seem a bit nitpicky, but understanding them well can greatly help with prior selection. Hope this helps.”

Okay, now I see why it’s not advisable to provide pre-assigned values to the model as mentioned in the first paragraph. These details you’ve shared are extremely useful;
Thank you for these great instructions!

P.S. I’m writing all my thoughts here because they might be useful for other beginners in Bayesian approaches.

1 Like

Maybe some measure of change between the prior and posterior distributions, like KL-divergence, Wasserstein distance, or a Kolmogorov–Smirnov test? I’ve never seen anyone do that, but it’s an idea I’ve wanted to try for a while.

1 Like

I’m glad I could help. Other possible useful metrics and approaches may depend on the type of model you’re using. For instance, RMSEA (https://journals.sagepub.com/doi/pdf/10.1177/0013164417709314). Others may be used to assess how close your model posteriors posterior-predictive distributions get to the target distribution (the simulated distribution). These are usually similarity indices, such as SDI or H2 (similarly,@jessegrabowski mentioned Wasserstein distance above as well). These latter ones can be very handy when using Bayesian workflows in a precision analysis, such as for estimating sample sizes (https://link.springer.com/article/10.3758/s13423-016-1221-4). Also, they can be very well complemented with LOO model comparisons, which tell you information about predictive fit (https://arxiv.org/pdf/1507.04544.pdf). If useful, I have tried something on those lines here: https://psyarxiv.com/6an4w/.

2 Likes