Issues with Poisson Regression using GLM method

Snejana · April 5, 2018, 8:15pm

Hi,

I was following an example described in https://docs.pymc.io/notebooks/GLM-poisson-regression.html.
The data I am using for this example is rather simple (badhealth download here https://vincentarelbundock.github.io/Rdatasets/doc/COUNT/badhealth.html)

This is how the dataset looks like:

   numvisit	badh	age
1	30	0	58
2	20	0	54
3	16	0	44
4	20	0	57
5	15	0	33

The code is super simple:

with pm.Model() as model:
    pm.glm.GLM.from_formula("numvisit ~ badh + age", df, family=pm.glm.families.Poisson())

After sampling from the trace, I am getting the “ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.” After browsing other questions related to this subject, I printed my test values which might suggest an overflow. Although I am not convinced in that.

Intercept 0.0
badh -7.82669381218681
age -7.82669381218681
mu_log__ -2.7641181592074506
y -3838.248213052888

After attempting to fix the problem with log(y) instead of y, I have got a different issue.“Bad initial energy: inf. The model might be misspecified.” I suspect this is due to the fact that some values are 0s and then the logs get’s messed up. I printed my test values and they support the assumption:

Intercept 0.0
badh -7.82669381218681
age -7.82669381218681
mu_log__ -2.7641181592074506
y -inf

I went with the more native PyMC3 approach, and implemented the model as follows:

with pm.Model() as model1:
  intercept = pm.Normal("intercept", mu=0, sd=2, testval=0)
  b_badh = pm.Normal("b_badh", mu=0, sd=2, testval=0)
  b_age = pm.Normal("b_age", mu=0, sd=2, testval=0)
  log_lam = intercept + b_badh * df.badh + b_age * df.age 
  numvisits = pm.Poisson("numvisits", mu=np.exp(log_lam), observed=df.numvisit.values)

And here I am back with the “ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.”

What other ideas can I try here to troubleshoot the issue. Note, I have experimented with various ranges of sd (from 1 to 1e4), also experimented with tau instead of sd (ranges 0.1 to 0.0001) and multiple test values.

I will appreciate hints on further troubleshooting. Thank you.

junpenglao · April 5, 2018, 9:09pm

With the raw input data range, you quickly ran into overflow problem. For example, if the coefficients are 1s (which is quite reasonable given the priors), you have a prediction of the first row being np.exp(50) --> 5.184705528587072e+21

A usual treatment is to standardize or scales your input to a more reasonable range, for example, divide the age by 100.

Snejana · April 5, 2018, 9:28pm

This is great! I added the following code to make this work according to the suggestion:

scaler = StandardScaler()
scaler.fit(df)
scaled_df = pd.DataFrame(scaler.transform(df), columns=df.columns)
# reset the binary column 
scaled_df['badh'] = df.badh

After this, BOTH methods worked great.

Also thank you for super prompt response!

Topic		Replies	Views
Poisson regression divide by zero Questions	1	536	December 3, 2021
Gamma GLM problem: ValueError: Mass matrix contains zeros on the diagonal Questions	3	741	November 30, 2018
ValueError: Mass matrix contains zeros on the diagonal. The derivative of RV `@@@`.ravel()[0] is zero v3 modeling	0	584	September 13, 2022
"ValueError: Mass matrix contains zeros on the diagonal" error, how to debug? Questions	7	1600	March 7, 2020
I get "Bad initial energy" even on one of the examples Questions	3	1147	March 6, 2019

Issues with Poisson Regression using GLM method

Related topics