I’m trying to do a simple logistic regression model. I have an output consisting of a 1D array of binary variables, and a 1D input array of the same length consisting of a single predictor variable. If the input is I_t_cut and the output is s_after_cut, I want to model:
s_after_cut = invlogit(z) = 1/(1 + exp(-z))
Where z = rv_qc * I_t_cut
rv_qc being a parameter I want to estimate, with a vague prior N(0, 10000)
My code looks like this:
logistic_model = pm.Model()
with logistic_model:
rv_qc = pm.Normal("rv_qc", mu=0, sd=10000)
z = pm.invlogit(rv_qc * I_t_cut)
s_obs = pm.Bernoulli("s_obs", p=z, observed=s_after_cut)
trace = pm.sample(500, cores=4, chains=4, tune=1000, target_accept=0.95)
pm.traceplot(trace)
print("Likelihood sampled using MCMC NUTS:")
plt.show()
print(pm.summary(trace))
However the results I get are weird. For one thing, it samples quickly at first (~100-200 samples/second) but after a certain point it goes right down to 1-2 samples/second, and takes ages to finish. Secondly, the chains are divergent, and the samples of the parameter are weird: huge spike at particular values, and a smaller amount over a range which is barely visible. I tried this with many different settings – increased number of samples, increased the “tuning” parameter, increased the “target_accept” parameter, but they all result in this strange behavior. Here is just one example:
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [rv_qc]
Sampling 4 chains: 100%|██████████| 6000/6000 [19:42<00:00, 1.51draws/s]
D:\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
output = mkl_fft.rfftn_numpy(a, s, axes)
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 233 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
Likelihood sampled using MCMC NUTS:
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat
rv_qc 0.507397 0.149241 0.014921 0.257683 0.635606 2.003107 47.298419
I have uploaded the two arrays as text files so you can reproduce what I did. Here is the one for I_t_cut, and here is the one for s_after_cut.
Does anyone know what might be causing this? I’m new to this sort of analysis … I tried reading some of the information in the docs about diagnosing this sort of thing but it went over my head.