Choosing an appropriate model for reaction times

Hello everyone. As this is my first post, I wanted to start by thanking you all for the great contributions in this forum and blogs, which have helped me a lot in learning how to learn Bayesian stats and use PyMC3.

Now to the question. I’m trying to model a simple hierarchical linear regression, with a varying intercept and a varying slope, where the observed data are 4,847 reaction times (RTs) in milliseconds (ms). In addition, there is a matrix of indices (patsy dmatrix) from an interaction between a 2-level and a 3-level factor. Finally, the slope is multiplied by a variable containing psychometrics scores from a worry scale (i.e. worry scores from 33 to 67). I have attempted to model this via three models, but I’m quite unsure about which one is the most appropriate. The first model uses a StudentT distribution for the likelihood and looks this way:

**Model 1**

RT = data['RT'].values
Worry = data['worry'].values

mat = pt.dmatrix('0 + ear:type', data=data, return_type='dataframe')
cond_names = mat.columns
slope = np.asarray(mat)
inter = np.asarray(mat)
conds = len(cond_names)


with pm.Model() as mod:
    
    alpha_m = pm.Normal('alpha_m', 0, 1)
    alpha_s = pm.HalfNormal('alpha_s', 1)
    alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
    alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)  
    
    beta_m = pm.Normal('beta_m', 0, 1)
    beta_s = pm.HalfNormal('beta_s', 1)
    beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
    beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)  

    mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worry
    sigma = pm.HalfNormal('sigma', 10)

    y = pm.StudentT('RT', mu=mu, sigma=sigma, nu=6, observed=RT)


with mod:
    trace = pm.sample(1000, tune=1000, chains=4, cores=7)

I understand that a StudentT distribution may be inappropriate, as RTs have no negative values. Even so, it samples really well (between ~80 and ~200 draws/s). ESS, R_hat, etc. check up well; but I am unsure about the PPCs and regression fit (images below). Note that, for brevity, I added the image of only one condition. (Also note that I did not use a hyperprior for nu, because when I used one the model pivoted the intercept a lot and estimated implausibly big increases in RTs).


In the second model I standardised the RT and worry variables, so to make the likelihood more appropriate:

**Model 2**

RT = data['RT'].values
Worry = data['worry'].values

RTz = (RT - RT.mean())/RT.std()
Worryz = (Worry - Worry.mean())/Worry.std()

mat = pt.dmatrix('0 + ear:type', data=data, return_type='dataframe')
cond_names = mat.columns
slope = np.asarray(mat)
inter = np.asarray(mat)
conds = len(cond_names)


with pm.Model() as mod:
    
    alpha_m = pm.Normal('alpha_m', 0, 1)
    alpha_s = pm.HalfNormal('alpha_s', 1)
    alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
    alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)  
    
    beta_m = pm.Normal('beta_m', 0, 1)
    beta_s = pm.HalfNormal('beta_s', 1)
    beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
    beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)  

    mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worryz
    sigma = pm.HalfNormal('sigma', 1)

    y = pm.StudentT('RT', mu=mu, sigma=sigma, nu=6, observed=RTz)


with mod:
    trace = pm.sample(1000, tune=1000, chains=4, cores=7, target_accept=0.99)

Curiously, the sampling was harder and with more divergences. But, after increasing the target_accept, the sampling was all right (R_hats, ESS, etc. check up). Results are somewhat similar as before.


Finally, the third model used a Lognormal likelihood. As this is theoretically more appropriate for RT data.

**Model 3**

RT = data.RT.values
Worry = data.Worry.values

mat = pt.dmatrix('0 + ear:type', data=data, return_type='dataframe')
cond_names = mat.columns
slope = np.asarray(mat)
inter = np.asarray(mat)
conds = len(cond_names)


with pm.Model() as mod:
    
    alpha_m = pm.Normal('alpha_m', 0, 1)
    alpha_s = pm.HalfNormal('alpha_s', 1)
    alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
    alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)  
    
    beta_m = pm.Normal('beta_m', 0, 1)
    beta_s = pm.HalfNormal('beta_s', 1)
    beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
    beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)  

    mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worry
    sigma = pm.HalfNormal('sigma', 10)

    y = pm.Lognormal('RT', mu=mu, sigma=sigma, observed=RT)


with mod:
    trace = pm.sample(1000, tune=1000, chains=4, cores=7, target_accept=0.99)

This one showed the toughest sampling, the beta posteriors were all nearly equal to one when exponentiated, and intercepts were all around ~300ms. I’m not sure whether this issue was caused becasue I did some transformation or parametrisation wrongly. The output is as follows:


As you can see, each model seems to have some advantages, but also some caveats. My intuition is that the correct model is the one using standarised variables (Model 2), but I’m not sure. I would really appreciate your guidance on how to determine which of these models is the most appropriate. In advance, many thanks for your help. My apologies if I was unclear or omitted relevant information above, I’ll try to improve the question if that’s the case. I’ll be looking forward to your comments.

Why not Lognormal + Standardized data?

Standardization does not change your model, only makes it (potentially) easier to interpret the coefficients and define their priors.

You can also model the log of your data with a t-student if you want a more flexible version of the Lognormal that like it also doesn’t have the issue of allowing for negative reaction times.

Many thanks for the suggestions.

Do you mean Lognormal and using the standardised observed data (RTs) ? If that’s so, I do not think that’ll work. A Lognormal only takes positive values, but standardised RTs have both negative and positive values (after all, the point of standardising is placing the mean of the distribution at zero). Hence, standardised RTs + Lognormal should result in NaN values in the RT parameter, as shown by a mod.check_test_point():

Out[5]: 
alpha_m         -0.92
alpha_s_log__   -0.77
alpha_t         -5.51
beta_m          -0.92
beta_s_log__    -0.77
beta_t          -5.51
sigma_log__     -1.06
RT                NaN
Name: Log-probability of test_point, dtype: float64

So the sampling would break-down immediately. There’s a whole discussion about the consequences of standardisation as a data transformation procedure, but that’s besides the point. I don’t think any variable would change your model, if the model is though as the structure which holds variables and parameters. But data transformations may change results and thus change inference, thus we change the model (i.e. distributions in the model) to adapt better to those data (if I don’t understand that wrongly). But I guess that’s also besides the point. Back to the suggestion, if you refer to standardising the worry-socre variable, I think that wouldn’t change things much, as the main issue is about using theoretically ‘appropriate’ vs ‘inappropriate’ distributions for the observed data (RTs in this case).

Regarding your second suggestion. If I take the log of the RT variable, the tail of the distribution shifts (goes towards negative), but because RTs are generally skewed (in this case with median ~380ms and SD ~ 310ms), the tail should contain very few negative values. So, in practice, running a StudentT model with log(RT) gives basically the same results as the Lognormal model. I’m not sure whether I follow what you mean by

“…it also doesn’t have the issue of allowing for negative reaction times.”

As far as I understand, a StudentT distribution will always allow for negative values (unless you bound, truncate or fold it). Actually, wouldn’t the point of taking the log of RTs be to have negative values so the StudentT adapts better?

Thanks again.

I meant standardizing after taking the log (and using a normal likelihood in the log-data) which would be equivalent to using the lognormal likelihood in the natural data. I just suggested this because you seemed interested in standardizing your data, the only advantage is that may help with choosing the priors or comparing multiple predictors which doesn’t seem to be relevant in your example.

In practice it should give very similar results to the lognormal yes. Again I suggested this because you were toying with the t-student in the natural scale. Not because I think it would be any better. I was just trying to understand why you were justaposing the options <lognormal + raw_data> with <student + standardized data>, which sounded to me like two orthogonal issues.

Perhaps to be more helpful I would ask how do your log-residuals look like? Do they look like they could be well explained by a normal / t-student? If so, then the log-normal or (log-student) would look sensible. If not, you may try other kinds of transformation or likelihoods.

I understand your approach now. I was a bit wary of a double transformation, but mathematically your point make sense, standardising after taking the log should be equivalent to a lognormal likelihood. However, I’m still puzzled about the difference in results. When I run the model with your suggested standardised-log-transformed RTs, the results seem way more sensible than with the lognormal likelihood (maybe there’s some nuance I’m failing to spot). Here’s what I obtained, including some residual plots as you suggested:

Posterior predictive checks (PPCs):

Regression (one condition):

Residuals:

This seems like the way to go to me. The residuals present a one-sided long tail, but a StudentT may properly account for that, as suggested by the PPCs. The regression seems to fit much better as well.

Thank you very much (=

1 Like

This can only come from your priors I think. They are probably more sensible for the standardized data than the raw one.

I’ll just throw out there that some (many?) people model reaction times using an ex-Gaussian distribution, which is a sum of a normal distribution (to get the basic, unimodality of the RT distribution) and an exponential distribution (to get the long right tail). People like it, in part, because it allows the location of the normal component to be estimated while isolating it from the exponential tail component (perhaps because such isolation makes ). Examples include this paper by our own @twiecki.

1 Like

Thanks for the suggestion. Unfortunately, this approach doesn’t work for me. Maybe I’m parametrising something wrongly, but I cannot find an explicit parametrisation for the likelihood in the paper you sent me. So, just to try out, I applied a pm.ExGaussian to the model:

with pm.Model() as mod:
    
    alpha_m = pm.Normal('alpha_m', 0, 1)
    alpha_s = pm.HalfNormal('alpha_s', 1)
    alpha_t = pm.Normal('alpha_t', 0, 1, shape=conds)
    alpha = pm.Deterministic('alpha', alpha_m + alpha_t*alpha_s)  
    
    beta_m = pm.Normal('beta_m', 0, 1)
    beta_s = pm.HalfNormal('beta_s', 1)
    beta_t = pm.Normal('beta_t', 0, 1, shape=conds)
    beta = pm.Deterministic('beta', beta_m + beta_t*beta_s)  

    mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worry
    sigma = pm.Exponential('sigma', 5)

    y = pm.ExGaussian('RT', mu=mu, sigma=sigma, nu=6, observed=RT)

My guess was that this would probably break the model, as the interactions present in the model may cause some conditions to get ceiling effects or very few observations. Effectively, the models break due to bad energy issues. This happens either using untransformed and transformed RTs. But, again, I’m probably applying this wrongly.

Even so, my impression is that many of these approaches work very well in theory (e.g. in simulations), or with very ad-hoc data. In this case with a stop-signal paradigm. Instead, I’m using a stimulus recognition paradigm, were participants have to answer every trial (i.e. press left or right), thus many misses or false-alarm responses are not included into the analysis, because the interest is only on RTs for correct responses (I guess alternatives may be argued). For instance, in the paper you link, they have a number of observations between 2000 to 3000, without a complex interaction, and that seems to work neatly. They also point out insensitivity to priors. In my case, the models are very sensitive to priors, and I need over 9000 to 10000 observations for the models to get less sensitive (e.g. using an exponential hyperprior for the StudentT greatly shifts the intercept downwards, but a fixed prior does not).

In any case, it is a very interesting discussion. As far as I understand, the modelling of RTs is an ongoing issues. I’m not sure whether the ExGaussian approach is taken by many or few, but it’s certainly one between several approaches (they also mention this in the paper). Personally, I lack the expertise to theoretically address this issue so it also works in practice.

At the moment, @ricardoV94 's suggestion of using the standardised log of RTs with a StudentT likelihood is what works best for me. I also agree with his suggestion that the Lognormal over untransformed RTs may not be working due to issues with priors. However, I’m using weakly informative priors for both intercept and slopes, i.e. very conventional normal distributions in a non-centered parametrisation. No too far from the aforementioned paper, where normal priors are the choice for the hierarchical model they implemented. So, I remain slightly puzzled about the Lognormal or ExGaussian causing issues, but the StudentT with standardised log RTs working so well. Thanks again for the suggestions.

I went back to your initial post, but I can’t quite figure out what’s going on. For example, I can’t figure out how many people are giving you your 4,847 reaction times. Regardless, my strategy for dealing with convergence issues, particularly when there is still uncertainty about the fundementals of your model (e.g., how to model noise in your data), is to tackle a smaller-scale version of the problem, ensure that you can model it reasonably well, and only then begin to complicate things. So in this sort of case, I might take data from a single condition and a single individual and try to model that. Once you have that nailed down, you can try adding in other factors. If things break at that point, you know it’s (probably) not because you picked the wrong noise function.

Thanks for the reply. Those observations came from around 50 participants. However, as I mentioned in the previous posts, the StudentT and Lognormal models do converge. My question is not about convergence issues. I just pointed out that the model does not converge when using an ExGaussian likelihood, at least as I implemented it in the previous post (maybe wrongly?). However, I decided to give the ExGaussian another try. This time with a smaller nu parameter (arbitrarily reduced to 2) and with standardised RTs. This model does converge well, but the PPCs are pretty bad:

I’ve got very similar PPCs when using a Normal distribution for the likelihood. In this sense, the models using StudentT distributions with standardised data have performed much better. And, as these models already converge with very good ESS, R_hats and BFMIs, I’m not sure how trying reduced versions of my data just to force a fit for an ExGaussian would help to solve my question. But maybe I understood your point incorrectly, for instance I don’t quite follow what do you mean by picking the wrong noise function.

By “noise function”, I was referring to the distribution you are wrapping around your predicted mean. You’re predicting that RTs should be:

mu = pm.math.dot(inter, alpha) + pm.math.dot(slope, beta)*Worry

but then you wrap this in a Student t or a normal or an an ExGaussian. I, personally, would be hesitant to dive into a full-blown hierarchical model with several predictors without having a firm handle on how (predicted) means are connected to (noisy) observed data. The suggestion to try a smaller version of your problem wasn’t to “force a fit”, but to allow you to learn more about what your data is like.

Sorry for my misunderstanding and many thanks for the clarification. However, I still don’t quite see how sampling from reduced versions of the data would help me to understand the connection between expectation/predicted mean and the likelihood distribution/noise function. Personally, I prefer to define a model as soon as I design an experimental task, as the model is tightly linked to the question I’m interested in answering. If that model fails, then a different model can be tried out on the basis that it is able to answer the same question. For instance, if I sample from one participant and one condition, wouldn’t that imply an inherently different data structure as compared to the data of fifty participants and two conditions? If the structure is different, then it would not be the structure required to answer the question at hand. But I may be wrong on this, would be great to have a rebuttal on this point to get a deeper consideration no the topic. Thanks again.

Are you pooling the data from the 50 different participants? Because you are already working with a hierarchical model I had assumed that you using a partial pooling over participants, but I don’t see any participant-level parameters. If you are pooling, that complicates things further, because now you’re not working with a standard right-skewed RT distribution, but the sum of 50 such distributions. Without knowing how participants vary (e.g., the variability of individuals’ mean RTs), I wouldn’t know what to expect of such a distribution or how to go about modeling it (without acknowledging the existence of the individuals).

So let’s ignore the pooling issue. If I have data from a 2x3 design but can’t successfully characterize the data from 1 of those 6 conditions, then I’m not going to move on to the full model, because it’s not going to get any easier. That’s all I meant.

That’s an interesting discussion. I included partial pooling of participants in my first model, as a varying intercept. However, I’m not completely convinced about the appropriateness of that approach. The problem I see is that when you include participants in the model, you include them as a variable. The model, however, already has a variable measuring worry. If worry affects participants in such a way that they answer faster or slower, then we have a chained causal relationship. So, worry influences participant and participant causes RT; if I condition on participants, then the effects of worry can get blocked or obscured (i.e. parameters pushed towards zero). It may be argued that the variable/levels of participants may also capture effects of other unobserved variables, which are not necessarily related with worry. In that case, it is still uncertain what those relationships are, and many if not most of them would not be of interest for present questions. If this reasoning makes sense, adding participants into the model would at best create additional ‘not quite informative’ uncertainty, and at worst could create multicollinearity issues which end up obscuring the effect of interest. I may be wrong on this, but I cannot come up with a more compelling argument for keeping participants in the model.

Regarding your second point. Although it does not answer to the question about the appropriateness of the distribution, I think you’re right, it wouldn’t get any easier. However, trying to address the problem is still possible and may be worth it in some cases, even if just a partial solution is reached.