Your approach looks reasonable to me. I am not an expert but one use of hierarchical models is to remedy biases in subgroups if you have by bad luck came up with a particularly bad sample set for that subgroup (i.e a model more robust against outliers). So its effect could be more significant for lower sample numbers. There is one thing that looks like could be modelled more efficiently though:
np.random.normal(means[i], means[i] * CoVs[i])
I assume you use this as such because each group’s data has a different natural scale but presumably you could try rescaling each by subtracting sample mean and dividing by sample std and then apply the inverse transformation to your estimates. Another alternative could be using lognormal instead of normal where the amount of sd will naturally depend on mu. Also are you generating the data like this because that is what you suspect the model is like or do you have good reason to think it works that way? If the former you might just leave the log sigma as a free parameter such as
log_group_cov = pm.Exponential("log_group_cov", 1, dims="groups")
group_cov = pm.math.exp(log_group_cov)
And if you still want to recover the natural scale by sigma you can sample on the fly by something like:
natural_group_cov = pm.Determinisitic("natural_group_cov", group_cov/group_mean)
Perhaps other folks could have some more suggestions on this. When you say
correct supports for the prior distributions and prevent double counting of mean uncertainty in the posterior predictive model as the standard deviation will be a function of mean and CoV estimate
by double counting of mean, do you also mean using mean in both mean and sd? In that case some of the suggestions I said above could be helpful.
Anyway, you can try some of the suggestions above but I took a shot at your model. I have modified your code slightly so that simulated CoV variables also come from a distribution that looks like your model. Moreover I also modified the prior for the means making it lognormal (assuming it is always positive), then it can be expressed with a more natural scale and you won’t need the abs() for group_mean*group_CoV. Moreover, I have also added an extra parameter cov_sigma. The model is here
import numpy as np
import pymc as pm
import arviz as az
seed=10
rng = np.random.default_rng(seed)
cov_sigma_real = 1
cov_mu_real = 0.05
means_real = [20, 150, 78, 65]
CoVs = rng.lognormal(np.log(cov_mu_real), cov_sigma_real, size=4)
samples = 50
group_vals = np.zeros((samples, len(means_real)))
group_labels = np.zeros((samples, len(means_real)), dtype=int)
group = group_labels.T.flatten()
val = group_vals.T.flatten()
coords = {"groups": [0,1,2,3]}
with pm.Model(coords=coords) as HCoV_model:
group_idx = pm.MutableData("group_idx", group, dims='obs_id')
cov_sigma = pm.Exponential("cov_sigma", 1)
CoV_global = pm.LogNormal("global_CoV", mu=-3.0, sigma=1)
group_CoV = pm.LogNormal("group_CoV", mu=np.log(CoV_global),
sigma=cov_sigma,
dims="groups")
group_mean = pm.LogNormal("group_mean", mu=np.log(80), sigma=2, dims="groups")
sigma = pm.Deterministic("sigma", group_mean*group_CoV)
y_obs = group_mean[group_idx]
sigma_obs = sigma[group_idx]
y = pm.Normal("y", mu=y_obs, sigma=sigma_obs, observed=val, dims="obs_id")
trace = pm.sample(tune=5000, draws=5000, random_seed=42, target_accept=0.95)
with HCoV_model:
ppc = pm.sample_posterior_predictive(trace)#, predictions=True)
az.plot_ppc(ppc)
ax = az.plot_posterior(trace, var_names = ["cov_sigma", "global_CoV",
"group_CoV", "group_mean"])
ax = ax.flatten()
ax[0].scatter(cov_sigma_real, 0, marker='o', color='red')
ax[1].scatter(cov_mu_real, 0, marker='o', color='red')
for i in range(4):
ax[i+2].scatter(CoVs[i], 0, marker='o', color='red')
ax[i+6].scatter(means_real[i], 0, marker='o', color='red')
ax[0].get_figure().tight_layout()
The posteriors and ppc looks fine (posterior estimates become more efficient as N increases but anyway the HDI seems to cover the true parameters). Posterior predictive plot also looks reasonable. One thing I have notice is you do get about ~200 divergences in 40000 samples even with target_accept=0.95. So perhaps re parametrizing could help there too.

Also if you want to see more examples on hierarchical models, chapter 9 of John Kruschke - Doing Bayesian Data Analysis is good. There are also these in pymc documents:
https://allendowney.github.io/BayesianInferencePyMC/04_hierarchical.html
