Non-normalized data in a GLM how to handle?

pabloazurduy · August 28, 2023, 12:45pm

Hello,

I am currently facing a normalization issue with a simple GLM modeling problem. My linear model is fairly simple, but I am struggling to fit two different countries that have a significant difference in the outcome variable.

To clarify, my model looks like this:

\text{income} = \beta_{0,country} + \beta_{1, country} * \text{years_of_education}

The code looks like this

import numpy as np
import pymc as pm
from scipy import stats
import matplotlib.pyplot as plt
import arviz as az

np.random.seed(0)
n_samples = 1000
education_years = np.random.normal(10, 2, (n_samples,2))
b1_real = np.array([-1,5])
b0_real = np.array([5_000, 5])
y =  b0_real  + b1_real*education_years + stats.halfnorm.rvs(0, 20,(n_samples,2)) #income 

with pm.Model() as model_country:
    b0 = pm.Normal("b0", mu=0, sigma=10, shape=2)
    b1 = pm.Normal("b1", mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal("sigma", sigma=10, shape=2)
    
    mu = b0 + b1 * education_years
    inc_mu = pm.Normal("inc_mu", mu=mu, sigma=sigma, observed=y, shape=y.shape)

    trace = pm.sample(draws=500, cores=4, target_accept=0.80, )
    posterior_predictive = pm.sample_posterior_predictive(trace, var_names=['b0', 'b1', 'inc_mu', 'sigma'])
    map = pm.find_MAP()

az.plot_forest(trace, combined=True, hdi_prob=0.95)

The problem is that the sampler struggles a lot to find the original parameters (b0_real, b1_real ). I intentionally selected numbers that are fairly different because they differ significantly on the “real” example that I’m working on

I have learned that normalization is typically a necessary step in data modeling. However, in my specific case, I want to avoid normalization due to certain experimental values and constraints that would affect the simplicity of my results. Instead, I would like to keep the betas unnormalized.

Is there any way to transform the model so that it can handle non-normalized datasets, such as the one in my example? Perhaps changing the NUTS sampler would be the best alternative?. Alternatively, would adding informative priors help resolve the issue? Maybe performing a linear regression first and using those parameters to inform the prior? I would greatly appreciate any guidance you can provide.

Thanks in advance!

cluhmann · August 28, 2023, 3:41pm

Knowing what your priors should be is much easier when working with standardized data. It’s one of the main reasons I prefer standardizing. But if you want to keep your data in raw form, you need to figure out how to craft your priors accordingly. Here would be one easy way to begin:

b0 = pm.Normal("b0", mu=np.mean(y, axis=0), sigma=10, shape=2)

This uses the observed means to “move” the priors on the 2 intercept terms to be “in the vicinity” of the data.

A more general approach would be to tweak priors and then use prior predictive sampling to inspect the resulting implications and see if they match your intuition/knowledge.

Topic		Replies	Views
Should I normalize my input data before feeding into PyMC3? Questions	1	1459	October 21, 2018
How to deal with heteroscedasticity in GLM regression v5 modeling	1	503	February 2, 2024
Gamma GLM for epidemic growth model Questions	3	1122	February 24, 2020
Where to standardize data Questions	2	1789	October 15, 2020
GLM logistic regression with custom prior in pymc3 (v. 3.6) Questions	10	4137	February 1, 2019

Non-normalized data in a GLM how to handle?

Related topics