Feedback on a Hierarchical Model for Learning Shared Coefficient of Variation

Hi all,

I am attempting to estimate the distributions of several, grouped random variables that have different, uncorrelated mean values but similar coefficients of variation (CoV = stddev/mean). Based on some data exploration and general intuition, I believe that a hierarchical model would be appropriate here given the grouped nature of the data. However, I’m a little uncertain how I should specify and structure such a model so as to maintain correct supports for the prior distributions and prevent double counting of mean uncertainty in the posterior predictive model as the standard deviation will be a function of mean and CoV estimate. I’ve implemented a toy example of my model below as a reference and I was wondering if I could get some feedback or advice on how I might improve the modeling approach to avoid the aforementioned issues. I haven’t come across another example of a hierarchical model for CoV estimates so I’m a bit in the dark. Thanks in advance!

means = [20, 150, 78, 65]
CoVs = [0.05, 0.06, 0.055, 0.048]
samples = 6

group_vals = np.zeros((samples, len(means)))
group_labels = np.zeros((samples, len(means)), dtype=int)

for i in range(4):
    group_vals[:, i] = np.random.normal(means[i],means[i] * CoVs[i],size=samples)
    plt.scatter(np.ones(samples)*i, group_vals[:,i], color="black", alpha=0.5)
    group_labels[:,i] = int(i)


group = group_labels.T.flatten()
val = group_vals.T.flatten()
coords = {"groups": [0,1,2,3]}


with pm.Model(coords=coords) as HCoV_model:
    group_idx = pm.MutableData("group_idx", group, dims='obs_id')

    CoV_global = pm.LogNormal("global_CoV", mu=-3.0, sigma=0.459983)
    
    group_CoV = pm.LogNormal("group_CoV", mu=np.log(CoV_global), sigma=0.25, dims="groups")
    group_mean = pm.Normal("group_mean", mu=80, sigma=50, dims="groups")

    sigma = pm.Deterministic("sigma", abs(group_mean*group_CoV))

    y_obs = group_mean[group_idx]
    sigma_obs = sigma[group_idx]


    y = pm.Normal("y", mu=y_obs, sigma=sigma_obs, observed=val, dims="obs_id")

    trace = pm.sample(tune=5000, draws=5000, random_seed=42, target_accept=0.95)

with HCoV_model:
    pm.set_data(new_data={"group_idx": [0,1,2,3]})
    ppc = pm.sample_posterior_predictive(trace, predictions=True)

preds = ppc.predictions.y.values

Your approach looks reasonable to me. I am not an expert but one use of hierarchical models is to remedy biases in subgroups if you have by bad luck came up with a particularly bad sample set for that subgroup (i.e a model more robust against outliers). So its effect could be more significant for lower sample numbers. There is one thing that looks like could be modelled more efficiently though:

np.random.normal(means[i], means[i] * CoVs[i])

I assume you use this as such because each group’s data has a different natural scale but presumably you could try rescaling each by subtracting sample mean and dividing by sample std and then apply the inverse transformation to your estimates. Another alternative could be using lognormal instead of normal where the amount of sd will naturally depend on mu. Also are you generating the data like this because that is what you suspect the model is like or do you have good reason to think it works that way? If the former you might just leave the log sigma as a free parameter such as

log_group_cov = pm.Exponential("log_group_cov", 1, dims="groups")
group_cov = pm.math.exp(log_group_cov)

And if you still want to recover the natural scale by sigma you can sample on the fly by something like:

natural_group_cov = pm.Determinisitic("natural_group_cov", group_cov/group_mean)

Perhaps other folks could have some more suggestions on this. When you say

correct supports for the prior distributions and prevent double counting of mean uncertainty in the posterior predictive model as the standard deviation will be a function of mean and CoV estimate

by double counting of mean, do you also mean using mean in both mean and sd? In that case some of the suggestions I said above could be helpful.

Anyway, you can try some of the suggestions above but I took a shot at your model. I have modified your code slightly so that simulated CoV variables also come from a distribution that looks like your model. Moreover I also modified the prior for the means making it lognormal (assuming it is always positive), then it can be expressed with a more natural scale and you won’t need the abs() for group_mean*group_CoV. Moreover, I have also added an extra parameter cov_sigma. The model is here


import numpy as np
import pymc as pm
import arviz as az

seed=10
rng = np.random.default_rng(seed)

cov_sigma_real = 1
cov_mu_real = 0.05
means_real = [20, 150, 78, 65]
CoVs = rng.lognormal(np.log(cov_mu_real), cov_sigma_real, size=4)
samples = 50

group_vals = np.zeros((samples, len(means_real)))
group_labels = np.zeros((samples, len(means_real)), dtype=int)

group = group_labels.T.flatten()
val = group_vals.T.flatten()
coords = {"groups": [0,1,2,3]}


with pm.Model(coords=coords) as HCoV_model:
    group_idx = pm.MutableData("group_idx", group, dims='obs_id')
    cov_sigma = pm.Exponential("cov_sigma", 1)
    CoV_global = pm.LogNormal("global_CoV", mu=-3.0, sigma=1)

    group_CoV = pm.LogNormal("group_CoV", mu=np.log(CoV_global),
                             sigma=cov_sigma,
                             dims="groups")

    group_mean = pm.LogNormal("group_mean", mu=np.log(80), sigma=2, dims="groups")

    sigma = pm.Deterministic("sigma", group_mean*group_CoV)

    y_obs = group_mean[group_idx]
    sigma_obs = sigma[group_idx]


    y = pm.Normal("y", mu=y_obs, sigma=sigma_obs, observed=val, dims="obs_id")

    trace = pm.sample(tune=5000, draws=5000, random_seed=42, target_accept=0.95)

with HCoV_model:
    ppc = pm.sample_posterior_predictive(trace)#, predictions=True)


az.plot_ppc(ppc)

ax = az.plot_posterior(trace, var_names = ["cov_sigma", "global_CoV",
                                           "group_CoV", "group_mean"])
ax = ax.flatten()
ax[0].scatter(cov_sigma_real, 0, marker='o', color='red')
ax[1].scatter(cov_mu_real, 0, marker='o', color='red')

for i in range(4):
  ax[i+2].scatter(CoVs[i], 0, marker='o', color='red')
  ax[i+6].scatter(means_real[i], 0, marker='o', color='red')

ax[0].get_figure().tight_layout()

The posteriors and ppc looks fine (posterior estimates become more efficient as N increases but anyway the HDI seems to cover the true parameters). Posterior predictive plot also looks reasonable. One thing I have notice is you do get about ~200 divergences in 40000 samples even with target_accept=0.95. So perhaps re parametrizing could help there too.

ppc

Also if you want to see more examples on hierarchical models, chapter 9 of John Kruschke - Doing Bayesian Data Analysis is good. There are also these in pymc documents:

https://allendowney.github.io/BayesianInferencePyMC/04_hierarchical.html

2 Likes

Chapter 9 of the Kruschke book can be found here (and the rest of the chapters are here).

1 Like

Hi iavicenna,

Thank you for the detailed response! I’m glad to hear that my approach at least seems reasonable from a modeling perspective. Also, thank you for the model corrections, shifting to a lognormal group mean seems like a better option. I have a couple of questions regarding your implementation and suggestions.

by double counting of mean, do you also mean using mean in both mean and sd? In that case some of the suggestions I said above could be helpful.

Yes, if the likelihood sd is specified as a function of the mean and its inherent uncertainty multiplied by the CoV uncertainty, then I would imagine the posterior variance would be in some sense double counting mean variance. I’m still new to bayesian modeling, so perhaps this isn’t the case or it isn’t really an issue.

Another alternative could be using lognormal instead of normal where the amount of sd will naturally depend on mu.

I’m not sure I understand this correction, could you expand on this a little?

[A]re you generating the data like this because that is what you suspect the model is like or do you have good reason to think it works that way?

Yes, looking at historical data, I have reason be believe that each group shares a common coefficient of variation with potential for some mild deviations from the common CoV. Under a limited sampling scenario it is easy to under or overestimate the variance, so my hope is to borrow predictive power from similar groups for specifying variation of a group with few samples. I specified the group level CoV variance as a prior (rather than a learned variable) so as to impart domain knowledge of deviance from the common CoV.

One thing I have notice is you do get about ~200 divergences in 40000 samples even with target_accept=0.95. So perhaps re parametrizing could help there too.

Reparameterizing in what sense? Forgive my ignorance.

And thanks for the reference as well, I’ll be sure to give that read!

Reparameterizing in what sense? Forgive my ignorance.

Re parametrizing means changing your model basically. Some times you get divergences because the choice of priors are not suitable for the model etc. I played around with the priors a bit and could bring it down to a level where divergences are not happening at all or are very few (1 in 20000 in one instance, 24 in another) even when I remove the target accept. So perhaps a drastic change to the model is not necessary. In the previous code I posted, I did forget to include the data generation step so here is the final model I have used (with also somewhat slightly changed variable names so I can keep them in my head easier):

import numpy as np
import pymc as pm
import arviz as az

seed=5
rng = np.random.default_rng(seed)

cov_sigma_real = 1
cov_mu_real = 0.05
group_CoV_real = rng.lognormal(np.log(cov_mu_real), cov_sigma_real, size=4)
samples = 50

means_real = [20, 150, 78, 65]
group_vals = np.zeros((samples, len(means_real)))
group_labels = np.zeros((samples, len(means_real)), dtype=int)


for i in range(4):
    group_vals[:, i] = np.random.normal(means_real[i], means_real[i] * group_CoV_real[i],
                                        size=samples)
    group_labels[:,i] = int(i)



group_idx = group_labels.T.flatten()
val = group_vals.T.flatten()
coords = {"groups": [0,1,2,3],
          "obs":range(len(group_idx))}


with pm.Model(coords=coords) as HCoV_model:

    cov_sigma = pm.HalfNormal("cov_sigma", 1)
    cov_mu = pm.HalfNormal("global_CoV", 0.1)
    group_CoV = pm.LogNormal("group_CoV", mu=np.log(cov_mu),
                             sigma=cov_sigma, dims="groups")

    group_mean = pm.LogNormal("group_mean", mu=np.log(80), sigma=1,
                              dims="groups")

    sigma = pm.Deterministic("sigma", group_mean*group_CoV)

    y_obs = group_mean[group_idx]
    sigma_obs = sigma[group_idx]

    y = pm.Normal("y", mu=y_obs, sigma=sigma_obs, observed=val, dims="obs")

    trace = pm.sample(tune=5000, draws=5000, random_seed=rng)

with HCoV_model:
    ppc = pm.sample_posterior_predictive(trace)


az.plot_ppc(ppc)

ax = az.plot_posterior(trace, var_names = ["cov_sigma", "global_CoV",
                                           "group_CoV", "group_mean"])
ax = ax.flatten()
ax[0].scatter(cov_sigma_real, 0, marker='o', color='red')
ax[1].scatter(cov_mu_real, 0, marker='o', color='red')

for i in range(4):
  ax[i+2].scatter(group_CoV_real[i], 0, marker='o', color='red')
  ax[i+6].scatter(means_real[i], 0, marker='o', color='red')

ax[0].get_figure().tight_layout()

posterior variance would be in some sense double counting mean variance.

So if both group mean and sd is involved in calculation observation sigma, then obviously this creates more wiggle room for these parameters to change: by scaling one down and the other up you get the same product, so an identifiability problem. However, there are other places where the scale of group means gets fixed probably remedying the identifiability problem: for instance it appears as the mean of your likelihood, therefore any attempt to scale it down or up from its expected value would also effect here that decreases the likelihood. So as long as group mean is identified reasonably certainly by being the mean of your likelihood, I think sd would be fine too. Consider for instance trying to solve the following set of equations (where x0 and x1 are given a and b are to be determined):

a*b = 10
x0 < a < x1

the more precise the second condition is the more precise your b is going to be. If you look at the posteriors, group means are identified quite certainly, so I wouldn’t worry too much about it.

I’m not sure I understand this correction, could you expand on this a little?

Simply using

y = pm.LogNormal("y", mu=pm.math.log(y_obs), sigma=group_CoV[group_idx], 
                     observed=val, dims="obs")

for your likelihood also seems to give reasonable results the full model being:

with pm.Model(coords=coords) as HCoV_model:

    cov_sigma = pm.HalfNormal("cov_sigma", 1)
    cov_mu = pm.HalfNormal("global_CoV", 0.1)
    group_CoV = pm.LogNormal("group_CoV", mu=np.log(cov_mu),
                             sigma=cov_sigma, dims="groups")

    group_mean = pm.LogNormal("group_mean", mu=np.log(80), sigma=1,
                              dims="groups")


    y_obs = group_mean[group_idx]
    y = pm.LogNormal("y", mu=pm.math.log(y_obs), sigma=group_CoV[group_idx], 
                     observed=val, dims="obs")

    trace = pm.sample(tune=5000, draws=5000, random_seed=rng)

So this might be worth a shot too. Though this assumes now your observation noise is lognormally distributed as LogNormal(log(mu), sd) which is obviously different from Normal(mu, sd*mu). However in the end this might fit your data better who knows. The LogNormal as it is has mean exp(log(mu) + sd) ~ mu (sd is small in your case compared to mu) and variance (exp(sigma^2) - 1)exp(2log(mu) + sd) ~ sd^2 mu^2. So it serves your purposes well.

I have not tried scaling the observed data (and then inverse scaling your parameters in the end) but would be curious to know if that works too.