Weird/pathological behavior with a simple logistic regression model

dain · August 12, 2018, 9:29pm

I’m trying to do a simple logistic regression model. I have an output consisting of a 1D array of binary variables, and a 1D input array of the same length consisting of a single predictor variable. If the input is I_t_cut and the output is s_after_cut, I want to model:

s_after_cut = invlogit(z) = 1/(1 + exp(-z))

Where z = rv_qc * I_t_cut

rv_qc being a parameter I want to estimate, with a vague prior N(0, 10000)

My code looks like this:

logistic_model = pm.Model()
with logistic_model:
    rv_qc = pm.Normal("rv_qc", mu=0, sd=10000)
    z = pm.invlogit(rv_qc * I_t_cut)
    s_obs = pm.Bernoulli("s_obs", p=z, observed=s_after_cut)
    
    trace = pm.sample(500, cores=4, chains=4, tune=1000, target_accept=0.95)
    pm.traceplot(trace)
    print("Likelihood sampled using MCMC NUTS:")
    plt.show()
    print(pm.summary(trace))

However the results I get are weird. For one thing, it samples quickly at first (~100-200 samples/second) but after a certain point it goes right down to 1-2 samples/second, and takes ages to finish. Secondly, the chains are divergent, and the samples of the parameter are weird: huge spike at particular values, and a smaller amount over a range which is barely visible. I tried this with many different settings – increased number of samples, increased the “tuning” parameter, increased the “target_accept” parameter, but they all result in this strange behavior. Here is just one example:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [rv_qc]
Sampling 4 chains: 100%|██████████| 6000/6000 [19:42<00:00,  1.51draws/s] 
D:\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 233 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
Likelihood sampled using MCMC NUTS:

   mean        sd  mc_error   hpd_2.5  hpd_97.5     n_eff       Rhat
rv_qc  0.507397  0.149241  0.014921  0.257683  0.635606  2.003107  47.298419

I have uploaded the two arrays as text files so you can reproduce what I did. Here is the one for I_t_cut, and here is the one for s_after_cut.

Does anyone know what might be causing this? I’m new to this sort of analysis … I tried reading some of the information in the docs about diagnosing this sort of thing but it went over my head.

junpenglao · August 13, 2018, 5:46am

Your predictors are on a pretty wild range (mostly under 9 but then suddenly 22… 69), you should consider scaling your predictors. Better yet, since it looks like you have a discrete predictor, it might be better to model it as a categorical predictor.
Also FYI, you dont need to do the invlogit, as Bernoulli take logit_p directly:

pm.Bernoulli("s_obs", logit_p=rv_qc * I_t_cut, observed=s_after_cut)

dain · August 13, 2018, 11:42am

Thanks, that seemed to work. I scaled the input by using log(I_t_cut) and added an intercept term and I’m getting results consistent with what I expect:

What do you think was causing those isolated spikes? I’m interested to know because the NUTS algorithm is kind of a black box for me.

junpenglao · August 13, 2018, 2:11pm

If you are not getting any divergence and rhat warning it usually is fine. You might want to check the effective sample size to make sure that things are perfect but the other two metrics are more important.

Topic		Replies	Views
Logistic Regression with severe divergences Questions	6	1046	January 11, 2021
Hierarchical Logistic Regression sampling problems Questions	4	1808	January 18, 2018
Good results but many divergent samples: what, me worry? Questions	16	1964	September 18, 2020
Sampler gets stuck in hierarchical logistic regression modeling	4	78	April 30, 2025
General doubt about problems when sampling a model Questions	7	698	March 27, 2021

Weird/pathological behavior with a simple logistic regression model

Related topics