Issues with Poisson Regression using GLM method

Hi,

I was following an example described in https://docs.pymc.io/notebooks/GLM-poisson-regression.html.
The data I am using for this example is rather simple (badhealth download here https://vincentarelbundock.github.io/Rdatasets/doc/COUNT/badhealth.html)

This is how the dataset looks like:

   numvisit	badh	age
1	30	0	58
2	20	0	54
3	16	0	44
4	20	0	57
5	15	0	33

The code is super simple:

with pm.Model() as model:
    pm.glm.GLM.from_formula("numvisit ~ badh + age", df, family=pm.glm.families.Poisson())

After sampling from the trace, I am getting the “ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.” After browsing other questions related to this subject, I printed my test values which might suggest an overflow. Although I am not convinced in that.

Intercept 0.0
badh -7.82669381218681
age -7.82669381218681
mu_log__ -2.7641181592074506
y -3838.248213052888

After attempting to fix the problem with log(y) instead of y, I have got a different issue.“Bad initial energy: inf. The model might be misspecified.” I suspect this is due to the fact that some values are 0s and then the logs get’s messed up. I printed my test values and they support the assumption:

Intercept 0.0
badh -7.82669381218681
age -7.82669381218681
mu_log__ -2.7641181592074506
y -inf

I went with the more native PyMC3 approach, and implemented the model as follows:

with pm.Model() as model1:
  intercept = pm.Normal("intercept", mu=0, sd=2, testval=0)
  b_badh = pm.Normal("b_badh", mu=0, sd=2, testval=0)
  b_age = pm.Normal("b_age", mu=0, sd=2, testval=0)
  log_lam = intercept + b_badh * df.badh + b_age * df.age 
  numvisits = pm.Poisson("numvisits", mu=np.exp(log_lam), observed=df.numvisit.values)

And here I am back with the “ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.”

What other ideas can I try here to troubleshoot the issue. Note, I have experimented with various ranges of sd (from 1 to 1e4), also experimented with tau instead of sd (ranges 0.1 to 0.0001) and multiple test values.

I will appreciate hints on further troubleshooting. Thank you.

With the raw input data range, you quickly ran into overflow problem. For example, if the coefficients are 1s (which is quite reasonable given the priors), you have a prediction of the first row being np.exp(50) --> 5.184705528587072e+21

A usual treatment is to standardize or scales your input to a more reasonable range, for example, divide the age by 100.

This is great! I added the following code to make this work according to the suggestion:

scaler = StandardScaler()
scaler.fit(df)
scaled_df = pd.DataFrame(scaler.transform(df), columns=df.columns)
# reset the binary column 
scaled_df['badh'] = df.badh

After this, BOTH methods worked great.

Also thank you for super prompt response!

1 Like