Hello everyone. As this is my first post, I wanted to start by thanking you all for the great contributions in this forum and blogs, which have helped me a lot in learning how to learn Bayesian stats and use PyMC3.
Now to the question. I’m trying to model a simple hierarchical linear regression, with a varying intercept and a varying slope, where the observed data are 4,847 reaction times (RTs) in milliseconds (ms). In addition, there is a matrix of indices (patsy dmatrix) from an interaction between a 2-level and a 3-level factor. Finally, the slope is multiplied by a variable containing psychometrics scores from a worry scale (i.e. worry scores from 33 to 67). I have attempted to model this via three models, but I’m quite unsure about which one is the most appropriate. The first model uses a StudentT distribution for the likelihood and looks this way:
**Model 1**
RT = data['RT'].values
Worry = data['worry'].values
mat = pt.dmatrix('0 + ear:type', data=data, return_type='dataframe')
cond_names = mat.columns
slope = np.asarray(mat)
inter = np.asarray(mat)
conds = len(cond_names)
with pm.Model() as mod:
alpha_m = pm.Normal('alpha_m', 0, 1)
alpha_s = pm.HalfNormal('alpha_s', 1)
alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)
beta_m = pm.Normal('beta_m', 0, 1)
beta_s = pm.HalfNormal('beta_s', 1)
beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)
mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worry
sigma = pm.HalfNormal('sigma', 10)
y = pm.StudentT('RT', mu=mu, sigma=sigma, nu=6, observed=RT)
with mod:
trace = pm.sample(1000, tune=1000, chains=4, cores=7)
I understand that a StudentT distribution may be inappropriate, as RTs have no negative values. Even so, it samples really well (between ~80 and ~200 draws/s). ESS, R_hat, etc. check up well; but I am unsure about the PPCs and regression fit (images below). Note that, for brevity, I added the image of only one condition. (Also note that I did not use a hyperprior for nu, because when I used one the model pivoted the intercept a lot and estimated implausibly big increases in RTs).
In the second model I standardised the RT and worry variables, so to make the likelihood more appropriate:
**Model 2**
RT = data['RT'].values
Worry = data['worry'].values
RTz = (RT - RT.mean())/RT.std()
Worryz = (Worry - Worry.mean())/Worry.std()
mat = pt.dmatrix('0 + ear:type', data=data, return_type='dataframe')
cond_names = mat.columns
slope = np.asarray(mat)
inter = np.asarray(mat)
conds = len(cond_names)
with pm.Model() as mod:
alpha_m = pm.Normal('alpha_m', 0, 1)
alpha_s = pm.HalfNormal('alpha_s', 1)
alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)
beta_m = pm.Normal('beta_m', 0, 1)
beta_s = pm.HalfNormal('beta_s', 1)
beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)
mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worryz
sigma = pm.HalfNormal('sigma', 1)
y = pm.StudentT('RT', mu=mu, sigma=sigma, nu=6, observed=RTz)
with mod:
trace = pm.sample(1000, tune=1000, chains=4, cores=7, target_accept=0.99)
Curiously, the sampling was harder and with more divergences. But, after increasing the target_accept, the sampling was all right (R_hats, ESS, etc. check up). Results are somewhat similar as before.
Finally, the third model used a Lognormal likelihood. As this is theoretically more appropriate for RT data.
**Model 3**
RT = data.RT.values
Worry = data.Worry.values
mat = pt.dmatrix('0 + ear:type', data=data, return_type='dataframe')
cond_names = mat.columns
slope = np.asarray(mat)
inter = np.asarray(mat)
conds = len(cond_names)
with pm.Model() as mod:
alpha_m = pm.Normal('alpha_m', 0, 1)
alpha_s = pm.HalfNormal('alpha_s', 1)
alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)
beta_m = pm.Normal('beta_m', 0, 1)
beta_s = pm.HalfNormal('beta_s', 1)
beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)
mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worry
sigma = pm.HalfNormal('sigma', 10)
y = pm.Lognormal('RT', mu=mu, sigma=sigma, observed=RT)
with mod:
trace = pm.sample(1000, tune=1000, chains=4, cores=7, target_accept=0.99)
This one showed the toughest sampling, the beta posteriors were all nearly equal to one when exponentiated, and intercepts were all around ~300ms. I’m not sure whether this issue was caused becasue I did some transformation or parametrisation wrongly. The output is as follows:
As you can see, each model seems to have some advantages, but also some caveats. My intuition is that the correct model is the one using standarised variables (Model 2), but I’m not sure. I would really appreciate your guidance on how to determine which of these models is the most appropriate. In advance, many thanks for your help. My apologies if I was unclear or omitted relevant information above, I’ll try to improve the question if that’s the case. I’ll be looking forward to your comments.