Logistic Regression with a very sparse matrix

Hi,

I’m building a logistic regressor to use on survey data where 80%+ of the data may be zeros. I’m running into issues sampling even basic models and wondering whether there are some methods to transform the data or specify the model differently so as to facilitate sampling.

All data is boolean cast as int. the relevant features of dataset are described below:

          q1_0	q2_0 q3_0
count	100	100	100
mean	0.22	0.68	0.95
std	0.41	0.46	0.21
min	0	0	0
25%	0	0	1
50%	0	1	1
75%	0	1	1
max	1	1	1  

I’m testing on random data where:

y = q1_0 + 2 * q2_0 + -2 * q3_0

The model is:

def bernoulli(X,y):
    n_features = X.shape.eval()[1]
    with pm.Model() as model:
        mu_alpha = pm.Normal('mu_alpha', mu=0, sd=100)
        sigma_alpha = pm.HalfNormal('sigma_alpha', sd=100)

        mu_beta = pm.Normal('mu_beta', mu=0, sd=100)
        sigma_beta = pm.HalfNormal('sigma_beta', sd=100)

        alpha = pm.Normal('alpha', mu=mu_alpha, sd=sigma_alpha)
        beta = pm.Normal('beta', mu=mu_beta, sd=sigma_beta, shape=n_features)

        p = pm.invlogit(alpha + _dot(X,beta))

        y_obs = pm.Bernoulli('y_obs', p, observed=y)
    return model

During sampling I got 52 divergences:

Sampling 4 chains, 52 divergences: 100%|██████████| 6000/6000 [03:31<00:00, 28.34draws/s]
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.9978061165399938, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.996341172826582, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 26 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.9689980122205897, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 24 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.9708234101690537, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

summary:

	mean	sd	hpd_3%	hpd_97%	mcse_mean	mcse_sd	ess_mean	ess_sd	ess_bulk	ess_tail	r_hat
mu_alpha	0.239	45.384	-94.697	91.707	0.881	2.162	2656.0	221.0	1433.0	1429.0	1.21
mu_beta	-0.022	0.081	-0.218	0.086	0.036	0.027	5.0	5.0	6.0	13.0	2.67
alpha	-1.649	0.334	-2.252	-0.981	0.082	0.059	17.0	17.0	17.0	28.0	1.16
beta[0]	0.017	0.101	-0.189	0.267	0.033	0.024	9.0	9.0	10.0	28.0	2.13
beta[1]	-0.027	0.123	-0.398	0.151	0.051	0.038	6.0	6.0	8.0	12.0	2.56
beta[2]	-0.024	0.102	-0.246	0.155	0.032	0.023	10.0	10.0	14.0	19.0	1.82
beta[3]	-0.034	0.097	-0.262	0.101	0.031	0.022	10.0	10.0	14.0	17.0	1.87
beta[4]	-0.037	0.085	-0.232	0.058	0.024	0.017	13.0	13.0	18.0	56.0	1.81
beta[5]	-0.012	0.134	-0.353	0.236	0.047	0.035	8.0	8.0	9.0	15.0	2.37
beta[6]	-0.016	0.108	-0.303	0.159	0.021	0.026	26.0	9.0	24.0	25.0	2.24
beta[7]	0.037	0.132	-0.182	0.335	0.027	0.024	24.0	16.0	26.0	27.0	2.26
beta[8]	-0.025	0.088	-0.248	0.104	0.029	0.021	9.0	9.0	9.0	23.0	1.84
beta[9]	-0.044	0.129	-0.400	0.117	0.052	0.039	6.0	6.0	11.0	13.0	1.95
beta[10]	-0.040	0.135	-0.358	0.166	0.043	0.032	10.0	10.0	14.0	17.0	2.03
beta[11]	-0.001	0.084	-0.203	0.157	0.017	0.015	24.0	17.0	19.0	46.0	2.20
beta[12]	-0.024	0.098	-0.280	0.141	0.039	0.029	6.0	6.0	9.0	14.0	2.19
beta[13]	-0.038	0.153	-0.263	0.158	0.048	0.035	10.0	10.0	9.0	14.0	2.11
beta[14]	0.050	0.150	-0.216	0.383	0.034	0.032	20.0	11.0	18.0	17.0	2.35
beta[15]	-0.012	0.082	-0.206	0.126	0.014	0.014	36.0	18.0	34.0	28.0	2.16
beta[16]	-0.032	0.161	-0.399	0.287	0.066	0.049	6.0	6.0	8.0	11.0	2.64
beta[17]	-0.012	0.118	-0.278	0.214	0.035	0.025	11.0	11.0	10.0	34.0	2.24
beta[18]	-0.035	0.123	-0.332	0.164	0.048	0.035	7.0	7.0	10.0	14.0	2.13
beta[19]	-0.020	0.120	-0.229	0.232	0.032	0.023	14.0	14.0	15.0	24.0	2.30
beta[20]	-0.035	0.121	-0.275	0.137	0.042	0.031	8.0	8.0	10.0	15.0	2.00
beta[21]	-0.055	0.140	-0.428	0.121	0.059	0.044	6.0	6.0	9.0	13.0	2.02
sigma_alpha	58.925	50.717	0.044	151.255	6.262	4.448	66.0	66.0	45.0	655.0	1.06
sigma_beta	0.061	0.090	0.000	0.225	0.036	0.027	6.0	6.0	5.0	11.0	2.53

my sense is that the predominance of zeros and the varying frequency of ones in the relevant features is making it difficult to model. Any ideas what I can do to improve sampling?

I would start with reducing the scale of the prior, sigma=10. or even sigma=1..
You should also try a matrix factorization like QR decomposition: https://betanalpha.github.io/assets/case_studies/qr_regression.html

2 Likes

hi, thanks for the suggestions. They have both helped a lot with speed and stability. The blog-post linked was very informative. I did have two questions regarding it, however, that I haven’t been able to find an answer to.

First, what is the advantage to combining R and beta into beta_tilde?

I encountered and issue with this as often the R matrix on my data is singular, and thus not invertible. Is this a necessary transformation in the model or can you simply model it as: pm.math.dot(Q,pm.math.dot(R,beta)) ?

Second, are any special transformations necessary to interpret the posteriors of the coefficients as log-odds as with normal logistic regression?