Eight school problem with student t distribution for treatment effects

Hello,

I am trying to reproduce the model fitted in this paper https://arxiv.org/pdf/1710.03410.pdf i.e. fitting a hierarchical model on results from past experiments.

In essence, it is the same as the eight school problems.

I am specifically interested in the shape of the distribution of treatment effects and in particular its tails, and thus would like to use a student distribution rather than a normal distribution.

Reproducing the non-centered parametrization of the eight school case study and only adding nu and slightly modifying theta_tilde gives :

import numpy as np
import pymc3 as pm
import arviz as az

J = 8
y = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])
sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0])

with pm.Model() as NonCentered_eight:
    mu = pm.Normal("mu", mu=0, sigma=5)
    tau = pm.HalfCauchy("tau", beta=5)
    nu = pm.Gamma("nu", 2,0.1)
    theta_tilde = pm.StudentT("theta_t", mu=0, sigma=1,nu=nu, shape=J)
    theta = pm.Deterministic("theta", mu + tau * theta_tilde)
    obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y)
    prior = pm.sample_prior_predictive(samples=10000)
    trace = pm.sample(tune=15000,target_accept=0.99)
    posterior_predictive = pm.sample_posterior_predictive(trace)

inf_data=az.from_pymc3(prior=prior,trace=trace,posterior_predictive=posterior_predictive)
model_summary=az.summary(trace)
model_summary  

In the eight school problem, it works like a charm.

In my use case, it does not work.

Using priors matching close to those from the paper (p6) (I have also tried many other things on the prior, but it never really worked), what I have is :

#note : in my dataset, a 1% lift is coded as y=0.01 while in the paper it is coded as y=1
with pm.Model() as NonCentered_eight:
    mu = pm.Normal("mu", mu=0, sigma=0.025)
    tau = pm.HalfCauchy("tau", beta=1/100)
    nu = pm.Uniform("nu", lower=1.01,upper=4)
    theta_tilde = pm.StudentT("theta_t", mu=0, sigma=1,nu=nu, shape=df.shape[0])
    theta = pm.Deterministic("theta", mu + tau * theta_tilde)
    obs = pm.Normal("obs", mu=theta, sigma=df['lift_in_perc_standard_error_delta_method'].to_numpy(), observed=df['lift_in_perc_point_estimate'].to_numpy())
    prior = pm.sample_prior_predictive(samples=10000)
    trace = pm.sample(tune=60000,target_accept=0.99)
    posterior_predictive = pm.sample_posterior_predictive(trace)   
    inf_data=az.from_pymc3(prior=prior,trace=trace,posterior_predictive=posterior_predictive)
    model_summary=az.summary(trace)

The warning I have is “the estimated number of effective samples is smaller than 200 for some parameters” ; tau, mu and some thetas have a 1.05 > r_hat > 1.01.

I don’t know if it’s relevant, but :
a°) my number of experiments is high (around 1000)
b°) my posterior checks are good, it does really looks like my empirical distribution.
c°) tau and nu are correlated across samples

I was wondering if there was :
1°) something plain wrong already noticeable in what I am doing ?
2°) a known reparametrization / reformulation when you want to model random effects coming from student t distribution ?
3°) a good case study with a model where the distribution of random effects is not normal ?

I hope I haven’t missed something already posted somewhere.

In any case, I wish you a very good week-end.

Thanks a lot for your help,

Timothée

1 Like

Welcome!

This is expected. Requiring thin tails (large nu) forces the scale to be increased (tau to be decreased) in order to accommodate outlying observations.

As for the poor sampling, it could be many things. It could be the amount of data you have per level/category in the hierarchy (e.g., per school), it could be something about the distribution of data with a level category, it could be the distribution of means/SDs across categories, etc. Have you tried reducing target_accept? That may improve mixing.

As for hierarchical models that use non-normal “random effects” here is one example.

Hello,

Thanks a lot for your help.

What did the trick for me was reparametrizing the student as explained here.

All the main parameters now have a rhat at 1, and only a few experiments (among 600) have a rhat at 1.01.

The code translated into eight schools coordinates now looks like :

import numpy as np
import pymc as pm
import arviz as az

J = 8
y = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])
sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0])

with pm.Model() as NonCentered_eight:
    #base parameters
    mu = pm.Normal("mu", mu=0, sigma=5)
    tau = pm.HalfCauchy("tau", beta=5)
    #student reparametrization for theta_tilde
    nu = pm.Gamma("nu", 2,0.1)
    half_nu = pm.Deterministic("half_nu",0.5 * nu)
    tau_prime= pm.Gamma("tau_prime",half_nu, half_nu)
    alpha= pm.Normal("alpha", 0, 1,shape=y.shape)
    theta_tilde = pm.Deterministic("beta", alpha / np.sqrt(tau_prime))
    #non-centered parametrization
    theta = pm.Deterministic("theta", mu + tau * theta_tilde)
    obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y)
    prior = pm.sample_prior_predictive(samples=10000)
    trace = pm.sampling_jax.sample_numpyro_nuts(tune=40000,target_accept=0.99)
    posterior_predictive = pm.sample_posterior_predictive(trace)

model_summary=az.summary(trace)
model_summary

However, posterior predictive checks (on my real life usecase) show that my model under estimate the mass at or around zero.

Indeed, my actual distribution has both a peak around zero (like a normal distribution) and fat tails (like a student with low degree of freedom), while there is a trade-off in the student between those dimensions (i.e. the fater the tail the lower the mass around zero).

Thus, I may try a mixture model (point normal or student) in the coming days.

That’ll be another topic.

Best regards,

Timothée