Correlation in Posterior for Slope Parameters (categorical)

Hello everyone :slight_smile:

I might have found an answer myself myself but I am looking for a sanity check.
I have the following simple linear regression:

y_{i} \sim N(\mu_{i}, \sigma)
\sigma \sim HalfCauchy(1)
\mu_{i} \sim \alpha + \beta_{ij}
\alpha \sim N(0, 1)
\beta_{j}\sim N(0, 1)

\beta is a categorical predictor with 4 levels
y has been standardized for having \mu=0 and \sigma=1
n=18,000

Issues I notice

  1. Sampling is painfully slow (given the complexity of the model)
  2. The posterior for the levels of the \beta parameter are almost perfectly correlated

I immgine point number 2 is the cause of point number 1, so looking at the traceplots

I noticed that the estimated parameters for Slope Clusters (\beta) overlap completely and are centered on 0. My suspect is that the predictor Cluster is 100% un-informative with respect to the outcome (this was under investigation so I couldn’t know it in advance), hence every level carries the same information and from here 2 and then 1 arise.

Am I correct in my intuition? Any extra word of wisdom on this?

Thank you

Hi! -
You’re absolutely correct. Although I wouldn’t say they are correlated, rather, the posterior distributions of the groups overlap. Also, the coefficients beta_j aren’t slopes - they’re conditional means of the outcome. Or in this case, the deviation from the fixed intercept term for each level.

The model you described above is also referred to as a ANOVA. https://online.stat.psu.edu/stat502/lesson/1

What the plot indicates is that the outcome w.r.t to each level is similar. For instance, if the 4 levels are 4 different drugs and the outcome is reduction in blood pressure, if the posterior distribution of the 4 drugs looked like the above, there’d be very little evidence that any one of them reduces blood pressure more than the others.

I imagine the sampling is slow because of the number of observations.

Have you tried removing the intercept (alpha)? It sounds like you might have an overparametrized model (5 de facto intercepts for 4 categories)

1 Like

Hello @dilsher_dhillon :slight_smile:

thank you very much for you answer, it is very reassuring!
Just a couple of addtional points:

  1. Asserting that the posterior for the levels of a categorical predictor overlaps isn’t a synonynm of correlation?

  2. I know categorical variables are indicized in PyMC3, the reason I call \beta_{j} a slope is that in my mind I see the linear model as a one-hot encode so \mu_{i} \sim \alpha + 1\beta_{ij} (also I am putting an effort to think at ANOVAs more in terms of linear models as it seems to make things easier :rofl:)

  3. I have ran more complicated models with more data in a fraction of the time of the linear model specified above. I know NUTS tend to slow down considerably if parameters’ posteriors are correlated.

My suspect was that since, as you said, the posterior for the groups overlap (i.e. they appear perfectly correlated to the sampler) NUTS was having an hard time doing its job.

Hi @ricardoV94 :slight_smile:

yes I used to do that before but it was ocassionally giving me problems. I know that when you run a linear model as the one above in statsmodel it automatically uses one of the categories as intercept.

From previous experience in PyMC3, if you build the model in context manager mode, that doesn’t happen (I might be disastrously wrong here) and over parametrizing seems to force one level of the categoty to become the intercept (the level parameter goes to zero and the intercept expresses the impact of that category).

Just to chime in, it’s often not possible to draw strong conclusions about the sampling or the posterior it generates without knowing more about a) how the model is implemented, b) the data you are using for inference, and c) the trace your sampling has produced. But I think folks have already laid out the relevant clues.

As @dilsher_dhillon mentioned, The similar-looking posteriors do not imply that the parameter estimates are correlated/dependent. Right now, you are plotting the marginal posteriors. To assess correlations/dependence, you would have to look at the joint posterior(s). So grab the trace object and crank out some scatter plots (e.g., plot_pair()).

That immediately leads into @ricardoV94 observation that the presence of the intercept is definitely going to cause some dependence (correlation) between the estimated value of \alpha and the estimated values of the \beta's. This always happens with regression-style models and is generally not that big a deal. The problem here, however, is that the “slopes” themselves are basically just intercepts. For example, \alpha=0, \beta_1=2, \beta_2=4, \beta_3=6, \beta_4=8 will give you exactly the same model as \alpha=5, \beta_1=-3, \beta_2=-1, \beta_3=1, \beta_4=3. And there are infinitely many such equivalent sets of parameter values.

So inspect your trace, looking for dependence between the the sampled values of \alpha and all the \beta's. Assuming you find strong dependence, omit \alpha and see how that goes. Or just drop \alpha and see how it goes (but then go check the original trace, because sampling diagnostics are important).

1 Like

Hi again,

@cluhmann already described the issue of the overparametrization in detail.

I just wanted to clarify the confusion with the statsmodels. If you run your model in statsmodels indeed one of the categories is removed, and so you end with 1 intercept + 3 categories = 4 variables. It is perfectly fine to have a model with 0 intercept + 4 categories = 4 variables in PyMC3 (it would also be fine in statsmodels if that was an option there). But here you are actually going with 1 intercept + 4 categories = 5 variables. So you have more variables than you should, which leads to the issues that @cluhmann mentioned.

1 Like

Hi @cluhmann :slight_smile:

I’ve been too hasty in my answer and I left out important details

  1. When I say perfectly correlated I mean that is the result coming from plot_pair(my_trace)

  2. Here is the plate for the model, all the relevant information regarding the priors are specified in my original post

  3. As you and @ricardoV94 said the intercept was causing the correlation problems!


I’ve been very silly.

Now I wonder, does this logic also apply to random intecepts?

This can happen if you have enough information (either prior or from the data) to identify the overparametrized model. But even if you can identify this more complicated model what is the benefit you get?

The posterior looks good. I have made the very same type of mistake before :slight_smile: