Non-normalized data in a GLM how to handle?

Hello,

I am currently facing a normalization issue with a simple GLM modeling problem. My linear model is fairly simple, but I am struggling to fit two different countries that have a significant difference in the outcome variable.

To clarify, my model looks like this:

\text{income} = \beta_{0,country} + \beta_{1, country} * \text{years_of_education}

The code looks like this

import numpy as np
import pymc as pm
from scipy import stats
import matplotlib.pyplot as plt
import arviz as az

np.random.seed(0)
n_samples = 1000
education_years = np.random.normal(10, 2, (n_samples,2))
b1_real = np.array([-1,5])
b0_real = np.array([5_000, 5])
y =  b0_real  + b1_real*education_years + stats.halfnorm.rvs(0, 20,(n_samples,2)) #income 

with pm.Model() as model_country:
    b0 = pm.Normal("b0", mu=0, sigma=10, shape=2)
    b1 = pm.Normal("b1", mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal("sigma", sigma=10, shape=2)
    
    mu = b0 + b1 * education_years
    inc_mu = pm.Normal("inc_mu", mu=mu, sigma=sigma, observed=y, shape=y.shape)

    trace = pm.sample(draws=500, cores=4, target_accept=0.80, )
    posterior_predictive = pm.sample_posterior_predictive(trace, var_names=['b0', 'b1', 'inc_mu', 'sigma'])
    map = pm.find_MAP()

az.plot_forest(trace, combined=True, hdi_prob=0.95)

The problem is that the sampler struggles a lot to find the original parameters (b0_real, b1_real ). I intentionally selected numbers that are fairly different because they differ significantly on the “real” example that I’m working on

I have learned that normalization is typically a necessary step in data modeling. However, in my specific case, I want to avoid normalization due to certain experimental values and constraints that would affect the simplicity of my results. Instead, I would like to keep the betas unnormalized.

Is there any way to transform the model so that it can handle non-normalized datasets, such as the one in my example? Perhaps changing the NUTS sampler would be the best alternative?. Alternatively, would adding informative priors help resolve the issue? Maybe performing a linear regression first and using those parameters to inform the prior? I would greatly appreciate any guidance you can provide.

Thanks in advance!

Knowing what your priors should be is much easier when working with standardized data. It’s one of the main reasons I prefer standardizing. But if you want to keep your data in raw form, you need to figure out how to craft your priors accordingly. Here would be one easy way to begin:

b0 = pm.Normal("b0", mu=np.mean(y, axis=0), sigma=10, shape=2)

This uses the observed means to “move” the priors on the 2 intercept terms to be “in the vicinity” of the data.

A more general approach would be to tweak priors and then use prior predictive sampling to inspect the resulting implications and see if they match your intuition/knowledge.

2 Likes