Hello,
I am currently facing a normalization issue with a simple GLM modeling problem. My linear model is fairly simple, but I am struggling to fit two different countries that have a significant difference in the outcome variable.
To clarify, my model looks like this:
\text{income} = \beta_{0,country} + \beta_{1, country} * \text{years_of_education}
The code looks like this
import numpy as np
import pymc as pm
from scipy import stats
import matplotlib.pyplot as plt
import arviz as az
np.random.seed(0)
n_samples = 1000
education_years = np.random.normal(10, 2, (n_samples,2))
b1_real = np.array([-1,5])
b0_real = np.array([5_000, 5])
y = b0_real + b1_real*education_years + stats.halfnorm.rvs(0, 20,(n_samples,2)) #income
with pm.Model() as model_country:
b0 = pm.Normal("b0", mu=0, sigma=10, shape=2)
b1 = pm.Normal("b1", mu=0, sigma=10, shape=2)
sigma = pm.HalfNormal("sigma", sigma=10, shape=2)
mu = b0 + b1 * education_years
inc_mu = pm.Normal("inc_mu", mu=mu, sigma=sigma, observed=y, shape=y.shape)
trace = pm.sample(draws=500, cores=4, target_accept=0.80, )
posterior_predictive = pm.sample_posterior_predictive(trace, var_names=['b0', 'b1', 'inc_mu', 'sigma'])
map = pm.find_MAP()
az.plot_forest(trace, combined=True, hdi_prob=0.95)
The problem is that the sampler struggles a lot to find the original parameters (b0_real
, b1_real
). I intentionally selected numbers that are fairly different because they differ significantly on the “real” example that I’m working on
I have learned that normalization is typically a necessary step in data modeling. However, in my specific case, I want to avoid normalization due to certain experimental values and constraints that would affect the simplicity of my results. Instead, I would like to keep the betas unnormalized.
Is there any way to transform the model so that it can handle non-normalized datasets, such as the one in my example? Perhaps changing the NUTS sampler would be the best alternative?. Alternatively, would adding informative priors help resolve the issue? Maybe performing a linear regression first and using those parameters to inform the prior? I would greatly appreciate any guidance you can provide.
Thanks in advance!