Baysian hirerchical Linear Regression(Partial Pooling) model using PYMC

Dataset:-


Dataset is of US states , showing mortality rate due to Cardiovascular disease dependency of different features ex. AQI,Obesity etc.
Goal:-
I want to predict Mortality rate(Data_value) using given features.

Model:-
using correlation matrix i found that mortality rate is more correlated with AQI , Obesity , Temperature and Previous year data values .
So i define model as given below

import pymc as pm
import numpy as np

# Preparing the data
cleaned_data['LocationDesc_encoded'] = cleaned_data['LocationDesc'].astype('category').cat.codes

# Define the model
with pm.Model() as hierarchical_model:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0, sigma=1)
    sigma_a = pm.HalfCauchy('sigma_a', beta=1)

    # Priors for individual intercepts
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=len(cleaned_data['LocationDesc_encoded'].unique()))

    # Hyperpriors for group slopes
    mu_b = pm.Normal('mu_b', mu=0, sigma=1)
    sigma_b = pm.HalfCauchy('sigma_b', beta=1)

    # Priors for individual slopes
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=(len(cleaned_data['LocationDesc_encoded'].unique()), 4))

    # Model error
    sigma = pm.HalfCauchy('sigma', beta=1)

    # Expected value
    mu = a[cleaned_data['LocationDesc_encoded'].values] + \
         b[cleaned_data['LocationDesc_encoded'].values, 0] * cleaned_data['obesity_Prevalence'] + \
         b[cleaned_data['LocationDesc_encoded'].values, 1] * cleaned_data['data_value_py'] + \
         b[cleaned_data['LocationDesc_encoded'].values, 2] * cleaned_data['Avg Temp(°F)'] + \
         b[cleaned_data['LocationDesc_encoded'].values, 3] * cleaned_data['AQI']

    # Likelihood
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=cleaned_data['Data_Value'])

    # Sampling from the posterior distribution
    trace = pm.sample(5000, tune=1000, return_inferencedata=True)

I am new to this topic So there are many doubts can someone help me to resolve these :slight_smile:

  1. Is my model correctly defined? if there is any issue then please mention and suggest correction.

  2. As i learned that we calculate accuracy on test data , So how can i predict data values (Mortality rate ) for test data so that i can calculate RMSE.If you can provide any code snippet that will be very helpful to me.

  3. various resources calculate RMSE on the data which they used for finding
    posterior distribution , Is is correct way to Evaluate model.

ThankYou :slightly_smiling_face:

Your model looks mostly correct to me, though you may want to check that your predictors are on the same scale and if not standardise them (your model does not seem to be built to deal with this since priors for slopes seem to have the same sigma for every predictor). If I am not mistaken, you also defined a separate intercept for every predictor. In my experience the more common way is to define a single intercept and multiple slopes (which would be multivariate linear regression basically). See for instance Chapter 18 in Kruscheke talks about these. There is a notebook which has translated the code there to pymc:

https://nbviewer.org/github/cluhmann/DBDA-python/tree/master/Notebooks/

For doing predictions see the following link (I think it has linear regression examples, though not hierarchical, but should easily generalize to your model)

For checking the model, usually I would do some checks roughly in the following order (some of them would be only useful if you do 4 or more chains):

1- You can construct a diagram of your model:
https://www.pymc.io/projects/docs/en/v5.6.1/api/generated/pymc.model_to_graphviz.html
This would make it easier to eyeball your model and see if it is defined the way you imagined it.

2- Prior and posterior checks: Real test of performance for Bayesian models is the posterior predictive check:

https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/posterior_predictive.html

Prior checks can also be used, to see if the priors you have chosen give results within reasonable values. This is also shown in the link above.

Note that when you do

sample_posterior_predictive(trace, extend_inference=True, predictions=True)

then your trace object will contain sampled predictions in trace[“predictions”][“Y_obs”]. This will have the shape number of draws x number of chains x Y_obs.shape. If you want to you can take the mean across the first two dimensions and do a residual plot maybe with error bands. However see 4 below.

3- Use arviz to produce posterior plots and trace plots for your parameters. Your posteriors from different chains should produce similar results. And also I think in a simple linear regression setting one would expect posteriors to look like unimodal normal distributions.

https://python.arviz.org/en/latest/api/generated/arviz.plot_posterior.html
https://python.arviz.org/en/latest/api/generated/arviz.plot_trace.html

4- loo_pit, This is again some manner of comparing your observed distribution to predictive samples. This will in the end produce a transformed distribution which tells you on (weighted) average how well your observation looks like the sampled predictions. weights come from the log probability of each sample. So I would say this is more akin to perhaps a distribution-like version of RMSE over multiple samples. Though, in this case. you want your transformed distribution to be as close to a uniform distribution as possible but using ecdf=True, you can get something which tells your model is good if the resulting statistics is close to 0. A link, again doing this for linear regression, is here:

Here it also actually presents a way to transform your posterior predictive samples for linear regression so posterior predictive plots make more sense.

5- If you want to test different priors and do model selection amongst them:

https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/model_comparison.html

1 Like

As taking reference from above answers I was doing predictions as

with hierarchical_model:
    pm.set_data({
        "LocationDesc_encoded": test_data['LocationDesc_encoded'].values,
        "obesity_Prevalence": test_data['obesity_Prevalence'].values,
        "data_value_py": test_data['data_value_py'].values,
        "Avg Temp(°F)": test_data['Avg Temp(°F)'].values,
        "AQI": test_data['AQI'].values
    })

    # Sample from the posterior predictive distribution
    posterior_predictive = pm.sample_posterior_predictive(idata, predictions=True,random_seed=rng)

it is giving error as

So I searched for this and found that Data variable are not mutable so to make variable mutable Chatgpt suggest me code as

with hierarchical_model:
        LocationDesc_encoded_shared = pm.Data('LocationDesc_encoded', training_data['LocationDesc_encoded'].values, mutable=True)
        obesity_Prevalence_shared = pm.Data('obesity_Prevalence', training_data['obesity_Prevalence'].values, mutable=True)
        data_value_py_shared = pm.Data('data_value_py', training_data['data_value_py'].values, mutable=True)
        Avg_Temp_data = pm.Data('Avg Temp(°F)', training_data['Avg Temp(°F)'].values, mutable=True)
        AQI_data = pm.Data('AQI', training_data['AQI'].values, mutable=True)

after this my model architecture is as follows


is this correct way of define model with mutable variables ?
Or someone please provide some reference to read about it ?

Here’s a tutorial on data containers. It’s slightly out of date - there is no longer any pm.MutableData and pm.ConstantData, just pm.Data (ditto mutable_coords, there’s just coords now). Otherwise it should be helpful to you.

@iavicenna i have doubt can you please verify this below once
I want to do prior_predictive_check and check does my prior are good or not below is my model code and with the last line for prior_predictive_check

As the Number of rows in my data set is 398 so I took 150 samples

import pymc as pm
import numpy as np

# Preparing the data
cleaned_data['LocationDesc_encoded'] = cleaned_data['LocationDesc'].astype('category').cat.codes

# Define the model
with pm.Model() as hierarchical_model:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', 0.0, 0.5)
    sigma_a = pm.Exponential('sigma_a', 1.0)

    # Priors for individual intercepts
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=len(cleaned_data['LocationDesc_encoded'].unique()))

    # Hyperpriors for group slopes
    mu_b = pm.Normal('mu_b', 0.0, 0.5)
    sigma_b = pm.Exponential('sigma_b', 1.0)

    # Priors for individual slopes
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=(len(cleaned_data['LocationDesc_encoded'].unique()), 4))

    # Model error
    sigma = pm.Exponential('sigma', 1.0)

    # Expected value
    mu = a[cleaned_data['LocationDesc_encoded'].values] + \
         b[cleaned_data['LocationDesc_encoded'].values, 0] * cleaned_data['obesity_Prevalence'] + \
         b[cleaned_data['LocationDesc_encoded'].values, 1] * cleaned_data['data_value_py'] + \
         b[cleaned_data['LocationDesc_encoded'].values, 2] * cleaned_data['Avg Temp(°F)'] + \
         b[cleaned_data['LocationDesc_encoded'].values, 3] * cleaned_data['AQI']

    # Likelihood
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=cleaned_data['Data_Value'])
    idata = pm.sample_prior_predictive(samples=150, random_seed=42)

From the below reference, I read how to do a prior_predictice check and wrote the below code.
https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/posterior_predictive.html

import numpy as np
import xarray as xr

# Example dimensions
num_locations = 49
num_samples_per_location = 150

# Generate random values for each location within the specified range
x = xr.DataArray(
    np.random.uniform(low=-2, high=2, size=(num_locations, num_samples_per_location)),
    dims=['location', 'sample']
)
# Print shapes to verify
print("Shape of prior_a:", prior_a.shape)  # Should be (1, 49, 150)
print("Shape of prior_b:", prior_b.shape)  # Should be (1, 49, 150, 4)
print("Shape of x:", x.shape)              # Should be (49, 150)

I wrote a model for each Location so here below calculating y for each location as the target is dependent on 4 predictors, So I used 4 predictors so here also calculating for all 4 predictors by running loop (k) from 0 to 3.

# Initialize y with zeros
y = np.zeros((num_locations, num_samples_per_location))

# Iterate over each location and sample
for location in range(num_locations):
      for k in range(4):
        # Compute y for the current location and sample
        y[location, :] = prior_a[0, :, location] + \
                              np.sum(prior_b[0, :, location, :][k] * x[location, :])

# y now contains the computed values for each location and sample
print("Shape of y:", y.shape)  #(49,150)

It’s always hard to tell on paper – the best is to plot their implication on the outcome scale, like that:

# Plotting the graph
plt.figure(figsize=(10, 6))
num_locations = 4
for location in range(num_locations):
    plt.scatter(x[location, :],y[location,:], label=f"Location {location}" )

plt.xlabel('X')
plt.ylabel('Y')
plt.title('Y vs X for different locations')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='small', ncol=2)
plt.show()


I uploaded the results I am getting from the above code and did only for 4 locations now, the questions I have are mentioned below.

Q-1) all the above things I did was correct or not if wrong how can I correct them, if possible can you provide code snippets

Q-2) if correct than what can I interpret from the above graphs or result

It is hard to tell whether your prior prediction matches the model exactly when I don’t have a fully working code however what you are doing is in the right direction, you extract the samples for your prior and then recombine them exactly as you did in the model. I won’t be able to tell exactly though, I don’t have the running code and variable names are different between the model and the prior prediction. On the other hand I noticed one thing in the prior prediction:

prior_b[0, :, location, :][k]

prior_b has shape 1,49,150,4 as per your comment where second dimension is normally indexed by ‘LocationDesc_encoded’ in your model. However in the prediction you are indexing location into the third dimension which is the samples dimension (unless you have transposed prior_b somewhere and have not shown the code). Moreover with your notation, you index k into the first dimension of the remaining 2D matrix where as you should really index it into the last (predictor variables dimension). You may want to recheck this. Because you are doing matrix multiplications via for loops, you are less likely to catch shape errors (you will basically catch them if you have index out of bounds but not for shape incompatibility) so you may want to vectorise your multiplications.

Also, if you don’t need to change cleaned_data, then you can do the prior_predictive_check in a more simpler manner. After doing sample_prior_predictive, you can check idata[“prior_predictive”][“Y_obs”] to get the sample counts (using prior distributions) and check if they make sense directly. This will prevent any errors and perhaps better for starters to get acquainted to the results you obtain.

As for what to make of the results you obtain, your case is a bit different than what is in the page because you are doing multivariate regression (or something similar to it, since I still don’t understand why you have a separate intercept for each predictor variable). I am not sure scatter plotting x values against y is the right thing to do. If it was two predictor variables for instance, what you would have done is plot hyperplanes using coordinates x1,x2. In this case you seem to have four predictor variables so I don’t even understand what a scatter of x vs y does but what I would only do in this case is either

1- fix some of the other three variables and plot some lines using the remaining predictor and do this multiple times
2- plot an histogram values of the predicted values to see if they are within reasonable range.

The most important task of prior prediction is to see if the predictions using the default priors are sensible and ultimately it is you who can only tell since you know what you want to model. Its’ requirements are not as strict as posterior predictive checks. One can even say, it is more a debugging tool if your model behaves unexpectedly (for instance if you have more complicated models, you may have chosen a prior that easily creates divergences in your observed values and prior predictive checks may reveal this).

1 Like

@jessegrabowski @iavicenna

_, ax = plt.subplots()

x = xr.DataArray(np.linspace(-2, 2, 50), dims=["plot_dim"])
prior = idata.prior
y = prior["a"] + prior["b"] * x

ax.plot(x, y.stack(sample=("chain", "draw")), c="k", alpha=0.4)

ax.set_xlabel("Predictor (stdz)")
ax.set_ylabel("Mean Outcome (stdz)")
ax.set_title("Prior predictive checks -- Flat priors");

why they are generating x , and what this code actual use and meaning

the reason I am using different intercepts is that a different location(here US states) have different mortality rates and I want to make a hierarchical model for each location(US State) so for each location I have a separate regression equation

image shows my actual model equation

Screenshot 2024-07-08 at 11.43.28 PM

@iavicenna
Also, if you don’t need to change cleaned_data, then you can do the prior_predictive_check in a more simpler manner. After doing sample_prior_predictive, you can check idata[“prior_predictive”][“Y_obs”] to get the sample counts (using prior distributions) and check if they make sense directly. This will prevent any errors and perhaps better for starters to get acquainted to the results you obtain.

you mean below thing right?

import xarray as xr
import arviz as az
def plot_xY(x, Y, ax):
    quantiles = Y.quantile((0.025, 0.25, 0.5, 0.75, 0.975), dim=("chain", "draw")).transpose()

    az.plot_hdi(
        x,
        hdi_data=quantiles.sel(quantile=[0.025, 0.975]),
        fill_kwargs={"alpha": 0.25},
        smooth=False,
        ax=ax,
    )
    az.plot_hdi(
        x,
        hdi_data=quantiles.sel(quantile=[0.25, 0.75]),
        fill_kwargs={"alpha": 0.5},
        smooth=False,
        ax=ax,
    )
    ax.plot(x, quantiles.sel(quantile=0.5), color="C1", lw=3)

figsize = (10, 5)

fig, ax = plt.subplots(figsize=figsize)

plot_xY(range(len(cleaned_data['Data_Value'])), idata.prior_predictive["Y_obs"], ax)
# format_x_axis(ax)
ax.plot(range(len(cleaned_data['Data_Value'])), cleaned_data['Data_Value'], label="observed")
ax.set(title="Prior predictive distribution")
plt.legend();

output is

i updated code as


# Define the model
with pm.Model() as hierarchical_model:
    #Define the data
    Location = pm.Data('LocationDesc_encoded', training_data['LocationDesc_encoded'],mutable=True)
    Obesity = pm.Data('obesity_Prevalence', training_data['obesity_Prevalence'],mutable=True)
    Data_val_py = pm.Data('data_value_py', training_data['data_value_py'],mutable=True)
    Avg_Temp = pm.Data('Avg Temp(°F)', training_data['Avg Temp(°F)'],mutable=True)
    AQI = pm.Data('AQI', training_data['AQI'],mutable=True)


    
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0, sigma=0.5)
    sigma_a = pm.HalfCauchy('sigma_a', beta=1)

    # Priors for individual intercepts
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=len(training_data['LocationDesc_encoded'].unique()))

    # Hyperpriors for group slopes
    mu_b = pm.Normal('mu_b', mu=0, sigma=1)
    sigma_b = pm.HalfCauchy('sigma_b', beta=1)

    # Priors for individual slopes
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=(len(training_data['LocationDesc_encoded'].unique()), 4))

    # Model error
    sigma = pm.HalfCauchy('sigma', beta=1)

    # Expected value
    mu = a[Location] + \
         b[Location, 0] * Obesity + \
         b[Location, 1] * Data_val_py + \
         b[Location, 2] * Avg_Temp + \
         b[Location, 3] * AQI

    # Likelihood
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=training_data['Data_Value'])

with hierarchical_model:
    idata = pm.sample(100, chains=2, target_accept=0.9,random_seed = rng)
with  hierarchical_model:
  pm.set_data({
        'LocationDesc_encoded': test_data['LocationDesc_encoded'],
        'obesity_Prevalence': test_data['obesity_Prevalence'],
        'data_value_py': test_data['data_value_py'],
        'Avg Temp(°F)': test_data['Avg Temp(°F)'],
        'AQI': test_data['AQI']
    })
  posterior_predictive = pm.sample_posterior_predictive(idata, random_seed=rng)

Now it is giving shape mismatch error because training data is (392 x 8) and test data is of (97 x 8).

To resolve this I tried

with  hierarchical_model:
  pm.Model.set_dim('LocationDesc_encoded', len(set(test_data['LocationDesc_encoded'])))

but this is giving error as

TypeError: Model.set_dim() missing 1 required positional argument: 'new_length'

Please help to resolve issue!

Short answer: You should link the shape of Y_obs to one of the shape of the exogenous data, say Location, so pytensor knows internally to update the shape of Y_obs when the shape of Location changes. This was explained in the linked example. You do it as follows:

Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=training_data['Data_Value'], shape=Location.shape[0])

Beyond that, some suggestions:

  1. You should use coords and dims, and shown in the example already linked. This will make your code significantly easier to reason about a read.
  2. Somewhat related, pd.factorize is the suggested way to handle label encoding for PyMC models. See the notes below.
  3. There’s no reason to split up your data. It should be much easier to keep in a matrix, and work with this directly.
  4. Your slopes seem to be wrong. You have a single mean and sigma hyperprior for all feature/location pairs. This encodes that the estimated effect for, say, obesity in location A, is drawn from the same distribution as the effect for the average temperature in region B. I suspect this is not what you intend.
  5. Code inside parenthesis can include line breaks; there’s no need to use this \ notation

My suggested corrections:

import numpy as np
import pandas as pd
import pymc as pm

from string import ascii_uppercase
from sklearn.model_selection import train_test_split

# Generate random data
df = pd.DataFrame(np.random.normal(size=(500, 4)), columns=['obesity_Prevalence', 'data_value_py', 'Avg Temp(°F)', 'AQI'])
df['location'] = np.random.choice(list(ascii_uppercase), size=500, replace=True)

# Stratify by location so all locations are in both the train and test set
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df.location.values)

# Sort doesn't sort the rows of the data, it sorts the indices being assign to each location. When 
# sort = False, label 0 belongs to the first class encountered. When True, it belongs to the sorted
# first class. Its important to sort here so that class 0 is the same thing in both the train and 
# test data (although the location in the first row of data might be different in each)
location_idx_train, locations = pd.factorize(df_train.location, sort=True)
location_idx_test, _ = pd.factorize(df_test.location, sort=True)
features = ['obesity_Prevalence', 'Avg Temp(°F)', 'AQI']

coords = {
    'location':locations,
    'feature': features,
}

# Define the model
with pm.Model(coords=coords) as hierarchical_model:
    #Define the data
    X = pm.Data('X', df_train[features])
    y = pm.Data('y', df_train['data_value_py'])
    location_idx_pt = pm.Data('location_idx', location_idx_train)
    
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_a', mu=0, sigma=0.5)
    sigma_a = pm.HalfCauchy('sigma_a', beta=1)
    a = pm.Normal('a', mu=mu_a, sigma=sigma_a, dims=['location'])

    # Hyperpriors for group slopes
    mu_b = pm.Normal('mu_b', mu=0, sigma=1, dims=['feature'])
    sigma_b = pm.HalfCauchy('sigma_b', beta=1, dims=['feature'])
    b = pm.Normal('b', mu=mu_b, sigma=sigma_b, dims=['location', 'feature'])

    # Model error
    sigma = pm.HalfCauchy('sigma', beta=1)

    # Expected value
    mu = a[location_idx_pt] + (X * b[location_idx_pt]).sum(axis=-1)
    
    # Likelihood
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=y, size=X.shape[0])
    idata = pm.sample(nuts_sampler='nutpie')

Out of sample prediction:

with hierarchical_model:
    pm.set_data({'X':df_test[features], 'location_idx':location_idx_test})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)
1 Like

Imagine you have a model for a linear regression

y = a*x + b

where y are observed, x are independent variables. You setup some priors for a and b and want to determine whether if these are reasonable to start with. You may either want them to be reasonable for the given x, or you may want them to be reasonable for all possible x within some range. So if you know that in reality x can come from anywhere between say [-b ,b] you also regenerate x (instead of using the given x) and then sample from your model using the generated x and priors for a and b. Extending the range of x from what is observed would probably be more crucial for generalized linear regression. There it may quite be possible that the priors you set for your parameters may produce linear predictors which are fine for your link function but that as soon as you extend the range of your x, your start getting linear predictors that make the link function blow-up. This could mean that your priors, the range of x you have set or your link function aren’t realistic.

That being said I think by using set_data (see link below), you can perhaps do prior sample prediction without having to rewrite your model:

https://www.pymc.io/projects/docs/en/latest/api/model/generated/pymc.model.core.set_data.html#pymc.model.core.set_data

So you write a function that returns the model where x is stored with pm.Data and then you use pm.set_data to change x and do sample_prior_predictive. Perhaps try it on the simpler linear regression model in the links above first to understand how it works, before trying it on a more complicated model like yours.

ps: I just saw one of your latter replies, you do in fact seem to be going in this direction. If you are not acquaintanted with setting data etc, perhaps first try it on a simpler toy linear regression model from the links above.

1 Like

Yes for

idata.prior_predictive["Y_obs"]

This would give the predictions of your model using the given x and the priors (without determining the posteriors). I am not sure what is the best representation for comparing it to your observed data though. I was really just imagining doing a histogram of data.prior_predictive[“Y_obs”] flattened and histogram of cleaned_data[‘Data_Value’] and comparing them. The point of prior predictive checks is really just to see if your prior based predictions for data.prior_predictive[“Y_obs”] are within reasonable range, you don’t need to compare them one by one to your observables. If all your observed data are in say range [-10,10] and with your chosen priors your model is producing results like 10000 quite often, then you may for instance need to change the scales of priors. It does not need to produce your observable distribution faithfully at all, it may have biases, it may have large variation etc. It just needs to be “reasonable” (and do not blow-up in more complicated problems).

If I now understand your model correctly by looking at the equation, you have multiple locations (each data coming from one location) and you are basically doing separate multivariate linear regressions for each. If that is the case then you could do a separate prior predictive check for each location (use of coordinates and a working knowledge of xarrays would really help for this matter). You can look at data.prior_predictive[“Y_obs”] for each location separately and either plot a histogram of sampled values vs observed values. Or you can extract a and b samples for each location as you did and plot collections of lines where for each line you fix a randomly selected three of the four predictors and vary the other. Doing it separately for each location may help in the case of unlikely but possible event of data from different locations are given in different scales etc (though I suppose normalizing your data for each location prior to sampling would be more sensical in such a case).

I mean at the end of the day this is a relatively straightforward linear regression model and I think the most important thing you need to check is whether or not with your prior distributions the model produces predictions in a reasonable a range. I am not an expert though so more experienced people may come with alternative suggestions.

1 Like