Hierarchical model can't infer parent variables

bramarco · December 9, 2021, 3:37pm

Hi everyone! I am writing again because I am getting crazy with this model. I am trying to test the performance of a hierarchical model on a simulated dataset. I am creating a set of measurements with a Hurdle distribution, already described here. Here the code for the creation of the simulated dataset

def simulate_mixed_data(pi, mean, sigma, tot_size=1000, rs=1, shuffle=True):
    n_zeros = int(tot_size*pi)
    zeros = np.zeros(n_zeros)
    np.random.seed(rs)
    non_zeros = np.random.lognormal(mean=mean, sigma=sigma, size=tot_size-n_zeros)
    mix = np.concatenate((zeros, non_zeros))
    if shuffle:
        np.random.shuffle(mix)
    return mix

def extract_bias(rev_n, biases):
    return np.random.choice(biases[rev_n])

pi_real_data = 0.3
mu_real_data = 1.3
sigma_real_data = 0.8
n_raters = 3

n_data = 2000
rs = 2
np.random.seed(rs)

real_values = simulate_mixed_data(pi_real_data, mu_real_data, sigma_real_data, n_data, rs=rs)

mu_bias_data    = np.array([0.5, .8, 1.2])
sigma_bias_data = np.array([.8, 0.1, 0.3])
pi_bias_data    = np.array([0.08, 0.1, 0.02])

biases  = np.array([simulate_mixed_data(pi=pi_bias_data[i], mean=mu_bias_data[i], sigma=sigma_bias_data[i], 
                                        tot_size=n_data, rs=rs) 
                    for i in range(n_raters)])

df = pd.DataFrame(columns=['real'], data=real_values)
df.loc[:, 'rater'] = np.random.randint(0, n_raters, size=len(real_values))
df.loc[:, 'bias'] = df.loc[:, 'rater'].apply(extract_bias, biases=biases)
df.loc[:, 'biased'] = df.bias * df.real
raters_idxs = df.rater.values
obs_values = df.biased.values

So basically I am saying that I have a distribution of real_values rated by 3 different raters, each of which will “bias” the set of real_values. I am also supposing that this bias is distributed with the same kind of Hurdle distribution and that the bias is multiplicative. So observations = real_values * bias. I am building the model using the analytical relationship between variables: the observed values are the product of two hurdle distributions, for which I have a probability \pi to be 0 and 1-\pi to belong to a lognormal distribution with mean \mu and sigma \sigma. So, if

\begin{equation} real\_values \sim HLN(\pi_r, \mu_r, \sigma_r)\\bias \sim HLN(\pi_\beta, \mu_\beta, \sigma_\beta)\end{equation}

then obs (= real\_values \times bias) \sim HLN(\pi_o, \mu_o, \sigma_o)
with
\pi_o=1-(1-\pi_r)(1-\pi_\beta)
\mu_o=\mu_r + \mu_\beta
\sigma_o = \sqrt{\sigma_r^2+\sigma_\beta^2}

with pm.Model() as bias_model:  
    positiveNormal = pm.Bound(pm.Normal, lower=0.)
    confinedNormal = pm.Bound(pm.Normal, lower=0.001, upper=.999)
    
    # priors on real values
    pi_real    = pm.Beta('pi_real', alpha=1.5, beta=3. , shape=1)
    mu_real    = positiveNormal('mu_real',mu=2., sigma=1., shape=1)
    sigma_real = pm.HalfNormal('sigma_real', sigma=1., shape=1)
    
    # priors on raters' bias
    pi_bias       = pm.Beta('pi_bias', alpha=1.5, beta=3. , shape=n_raters)
    mu_bias       = positiveNormal('mu_bias', mu=2., sigma=1.,    shape=n_raters)
    sigma_bias    = pm.HalfNormal('sigma_bias', sigma=1., shape=n_raters) 
    
    # combination of parameters (prod of lognormally distributed)
    pi_obs    = confinedNormal('pi_obs', mu=1-((1-pi_real)*(1-pi_bias)), sigma=1., shape=n_raters)
    mu_obs    = positiveNormal('mu_obs', mu=mu_real + mu_bias, sigma=1., shape=n_raters)
    sigma_obs = positiveNormal('sigma_obs', mu=pm.math.sqrt(pm.math.sqr(sigma_real)+pm.math.sqr(sigma_bias)), 
                               sigma=1., shape=n_raters)

    # Likelihood
    obs = custom_dists.HurdleLognormal('obs', pi=pi_obs[raters_idxs], mu=mu_obs[raters_idxs], 
                                        sigma=sigma_obs[raters_idxs], observed=obs_values)

    data = pm.sample(500, chains=3, tune=2000, return_inferencedata=True)

(the custom_dists.HurdleLognormal distribution is defined in my previous post)
Here is the issue. The model can infer pretty well the combined parameters (pi_obs, mu_obs, sigma_obs) even if it seems to understand nothing about the priors (pi/mu/sigma_real and pi/mu/sigma_bias), as you can see from the traceplot.

pi_obs = 1-((1-pi_real_data)*(1-pi_bias_data))
mu_obs = mu_real_data + mu_bias_data
sigma_obs = np.sqrt(sigma_real_data**2+sigma_bias_data**2)

lines = [('pi_obs', {},  pi_obs), ('mu_obs', {}, mu_obs), ('sigma_obs', {}, sigma_obs)]
with bias_model:
    az.plot_trace(data, compact=True, var_names=['pi_obs', 'mu_obs', 'sigma_obs'], 
                  legend=True, lines=lines)

lines = [('pi_bias', {},  pi_bias_data), ('mu_bias', {}, mu_bias_data), ('sigma_bias', {}, sigma_bias_data)]
with bias_model:
    az.plot_trace(data, compact=True, var_names=['pi_bias', 'mu_bias', 'sigma_bias'], 
                  legend=True, lines=lines)

lines = [('pi_real', {},  pi_real_data), ('mu_real', {}, mu_real_data), ('sigma_real', {}, sigma_real_data)]
with bias_model:
    az.plot_trace(data, compact=True, var_names=['pi_real', 'mu_real', 'sigma_real'], 
                  legend=True, lines=lines)

To me it looks like the traces for the real and bias parameters are basically due to the initial priors that I set. They are almost not updated at all. Despite this, the model seems to solve the problem, because with a posterior predictive check I can see that the model reproduces pretty well the observed values distribution, also rater by rater

with bias_model:
    ppc = pm.sample_posterior_predictive(data, samples=1)

for n_rater in range(n_raters):
    print('rev:%d, pi=%.2f mu=%.2f sigma=%.2f'%(n_rater, 1-(1-pi_real_data)*(1-pi_bias_data[n_rater]), 
                                                mu_real_data+mu_bias_data[n_rater], 
                                                np.sqrt(sigma_real_data**2+sigma_bias_data[n_rater]**2)))
    idxs = np.where(raters_idxs==n_rater)
    observed = obs_values[idxs]
    reproduced = ppc['obs'][0, idxs].flatten()
    fig, ax = plt.subplots(figsize=(12,8))
    sns.histplot(observed, label='observed', stat='density')
    sns.histplot(reproduced, label='reproduced', color='orange', alpha=0.4, stat='density')
    print('reproduced proportion of zeros: %.2f'%(len(reproduced[reproduced==0])/len(reproduced)))
    non_zeros = reproduced[reproduced>0]
    print('reproduced mean and sigma of non zeros: mu=%.2f sigma=%.2f'%(np.mean(np.log(non_zeros)), 
                                                                        np.std(np.log(non_zeros))))
    plt.xlabel('rates')
    plt.legend()
    plt.title('rater %d'%n_rater)
    plt.show()

Given that I am mostly interested in bias parameters, and I’d like to introduce another source of error later, it is a not negligible issue for me.

Do you have any idea about

how could be possible that the model infers the obs parameters with high accuracy while having no idea about the “parents” (real and bias parameters) in the hierarchical model?
how could I solve this issue and have reliable information about the parameters I am interested in?

Thank you so much if anyone can help!!

Martin_Ingram · December 12, 2021, 8:36am

Hi Marco,

I wonder if the addition in pi_obs, mu_obs and sigma_obs of the sigma=1. term is causing this? It could be introducing so much noise that there is only a weak link to the parent distributions. May I ask why you don’t directly define pi_obs = 1 - ((1-pi_real)*(1-pi_bias))? Is it to keep some kinds of constraints? If so, I wonder if there would be a neater way of doing that.

For a start, though, to see if this is really the issue perhaps you could try setting sigma to something really small, maybe sigma=0.00001, so:

# combination of parameters (prod of lognormally distributed)
    pi_obs    = confinedNormal('pi_obs', mu=1-((1-pi_real)*(1-pi_bias)), sigma=1e-5, shape=n_raters)
    mu_obs    = positiveNormal('mu_obs', mu=mu_real + mu_bias, sigma=1e-5, shape=n_raters)
    sigma_obs = positiveNormal('sigma_obs', mu=pm.math.sqrt(pm.math.sqr(sigma_real)+pm.math.sqr(sigma_bias)), 
                               sigma=1e-5, shape=n_raters)

Curious to hear if that makes a difference!

Best wishes,
Martin

Topic		Replies	Views
Hierarchical model fails to infer with categorical distributions v5 modeling	0	331	November 2, 2022
How to make inference on new data with a hierarchical model? v5	4	227	May 7, 2024
Issues with Hierarchical model version agnostic modeling	3	76	February 7, 2025
Hierarchical Linear Regression Model -- why is it not converging? Questions	3	1562	March 14, 2018
Help with Hierarchical regression for discrete data v5	11	728	August 24, 2022

Hierarchical model can't infer parent variables

Related topics