Logistic Regression with severe divergences

Hi all,

I am implementing (what I think is) a straightforward logistic regression, trying to predict a binary group label from a series of predictors. The predictors are the result of a principal components analysis retaining PC’s explaining 90% of the variance from set of image pixels (256 * 256 images of faces) of 1,333 faces, with the final design matrix being 1333 * 148.

The PC’s are all centred and scaled (though this is maybe not necessary), but the sampler runs very slowly and does not converge, ending with several thousand divergences. They seem to kick in around halfway through the sampling. I can’t seem to get it converge even with very constrained priors on the coefficients, which is about the extent of my knowledge in fixing divergences.
Minimally reproducible example below:

# Just to show the PC and scaling procedure with sklearn #
#scaler = Pipeline([('scaler', StandardScaler()), ('PCA', PCA(n_components=.90)), ('rescaler', StandardScaler()])

# Scaling applied here to large pixel array #
# preds = scaler.fit_transform(pixel_arr)

import pandas as pd
import pymc3 as pm

# Read data
preds = pd.read_csv('https://osf.io/s53vm/download').set_index('Unnamed: 0')
label_data = pd.read_csv('https://osf.io/pe2n7/download').set_index('Unnamed: 0')
labels = label_data.squeeze()

# Set up model
with pm.Model() as classifier1:

   # Set data (for holdout classification later)
Xdata = pm.Data('Xdata', preds)
ydata = pm.Data('ydata', labels)

# Priors
β = pm.Normal('coefs', 0, 1, shape=Xdata.dshape[1])
β0 = pm.Normal('intercept', 0, 1)

# Linear combination
p = pm.math.invlogit(β0 + pm.math.dot(Xdata, β))
proba = pm.Deterministic('proba', p)

# Likelihood
_ = pm.Bernoulli('likelihood', p=p, observed=ydata)

# Inference button
idata = pm.sample(1000, cores=8, return_inferencedata=True)


I am sure this is relatively simple but I can’t seem to stop making the sampler angry! Any help is greatly appreciated. Thank you!

This is just a hunch, but I would try:

   logit_p = β0 + pm.math.dot(Xdata, β)
# Likelihood
_ = pm.Bernoulli('likelihood', logit_p=logit_p, observed=ydata)


instead of computing p with the invlogit and then passing that in. The reasoning here is that providing the logit_p rather than p should allow pymc3 to avoid some numerical difficulties when the logit is very large or small. But I’m not confident that’s the issue here, just thought it might be worth a go!

2 Likes

Your hunch was right! Sampling quickly and efficiently, thank you so much. That is a great trick! Is there general advice on when to prefer the p versus logit_p parameterisation of Bernoulli?

Thanks again!

1 Like

I guess the best is to use the one that is in the same scale as your computed parameter. In your case you had a logit and you were inverting it to get on the natural scale. Other times you may compute it in such a way that is already on the natural scale (such as the output of a Beta Distribution for example). That would be my guess, but I could be wrong.

2 Likes

Thanks! That does make sense. Its particularly interesting as for smaller models with a similar dataset here (predictor numbers < 10) the invlogit approach worked just fine, but for these larger sets of predictors it would really struggle. Very neat parameterisation nonetheless!

I agree with Ricardo that things might be different with other models, but for logistic regressions I would always try to use the logit if possible, basically because we can keep all probabilities on the log scale. You can see in the source that pymc3 does something special with the logits, whereas log(p) is taken if probabilities are given. So if your logit is very small, for example, it could happen that invlogit(x) ends up being zero, and then taking log(0) gives an infinity. On the other hand, the code using the logits should still work fine. If you’re curious, you could try the two versions in the source code for some extreme values of the logit, e.g. -40 or something and compare the results.
I do have some intuition for why things are fine with a few predictors. When you do X \beta in your logistic regression, each row of the result is going to be a sum of random variables: x_1 * \beta_1 + x_2 * \beta_2 ... x_n * \beta_n. Under your prior, the \beta_i are independent with variance 1, so you can compute the variance of the result (under the prior) as the sum of the individual variances. For a given (constant) set of x, that’ll end up being \sum x_i^2 Var(\beta_i) = \sum x_i^2. If you have only a few predictors, that sum is likely to be small. But if you have many, like 1000, you could end up with a variance of about that number, or a standard deviation of over 30, and logits that large could start causing trouble. If you’re curious, it might be interesting to see whether things work better with the invlogit if you make the prior on \beta tighter, e.g. \mathcal{N}(0, \sqrt{1 / p}) where p is your number of predictors. But that’s a stronger prior of course which you may not be comfortable with!
That’s a great explanation, thank you so much! I think definitely the log(p) was causing issues there. I’ll stick to the logit_p in future - and that’s a nice result about the variances I hadn’t thought of either. Lots to learn here, thanks again!