GLM logistic regression with custom prior in pymc3 (v. 3.6)

New pymc3 user here :slight_smile:

I’ve been trying to get a slightly modified version of this pymc3 GLM logistic regression tutorial to work - to no avail.

print(pm.__version__)
3.6

I want to use custom (non-default) priors for the GLM coefficients. Here’s my code:

with pm.Model() as logistic_model:
  
  # define somewhat better priors
  n_coeffs = 4
  my_priors = {"Regressor": pm.distributions.Normal('dummy', mu=0, tau=1e-2, shape=(n_coeffs,))}
  
  pm.glm.GLM.from_formula('income ~ age + age2 + educ + hours', data, family=pm.glm.families.Binomial(), priors=my_priors)
  
  trace_logistic_model = pm.sample(2000, chains=1, tune=1000)

The data are coming from a Pandas DataFrame with column names age, age2 etc.

I keep getting the following error, what gives?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-471a1af6a7b1> in <module>()
      5   my_priors = {"Regressor": pm.distributions.Normal('dummy', mu=0, tau=1e-2, shape=(n_coeffs,))}
      6 
----> 7   pm.glm.GLM.from_formula('income ~ age + age2 + educ + hours', data, family=pm.glm.families.Binomial(), priors=my_priors)
      8 
      9   trace_logistic_model = pm.sample(2000, chains=1, tune=1000)

/usr/local/lib/python3.6/dist-packages/pymc3/glm/linear.py in from_formula(cls, formula, data, priors, vars, family, name, model, offset)
    147         return cls(np.asarray(x), np.asarray(y)[:, -1], intercept=False,
    148                    labels=labels, priors=priors, vars=vars, family=family,
--> 149                    name=name, model=model, offset=offset)
    150 
    151 

/usr/local/lib/python3.6/dist-packages/pymc3/model.py in __call__(cls, *args, **kwargs)
    282         instance = cls.__new__(cls, *args, **kwargs)
    283         with instance:  # appends context
--> 284             instance.__init__(*args, **kwargs)
    285         return instance
    286 

/usr/local/lib/python3.6/dist-packages/pymc3/glm/linear.py in __init__(self, x, y, intercept, labels, priors, vars, family, name, model, offset)
    122             x, y, intercept=intercept, labels=labels,
    123             priors=priors, vars=vars, name=name,
--> 124             model=model, offset=offset
    125         )
    126 

/usr/local/lib/python3.6/dist-packages/pymc3/glm/linear.py in __init__(self, x, y, intercept, labels, priors, vars, name, model, offset)
     75                             priors.get(
     76                                 'Regressor',
---> 77                                 self.default_regressor_prior
     78                             )
     79                         )

/usr/local/lib/python3.6/dist-packages/pymc3/model.py in Var(self, name, dist, data, total_size)
    807                 with self:
    808                     var = FreeRV(name=name, distribution=dist,
--> 809                                  total_size=total_size, model=self)
    810                 self.free_RVs.append(var)
    811             else:

/usr/local/lib/python3.6/dist-packages/pymc3/model.py in __init__(self, type, owner, index, name, distribution, total_size, model)
   1203         if distribution is not None:
   1204             self.dshape = tuple(distribution.shape)
-> 1205             self.dsize = int(np.prod(distribution.shape))
   1206             self.distribution = distribution
   1207             self.tag.test_value = np.ones(

/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py in prod(a, axis, dtype, out, keepdims)
   2561             pass
   2562         else:
-> 2563             return prod(axis=axis, dtype=dtype, out=out, **kwargs)
   2564 
   2565     return _methods._prod(a, axis=axis, dtype=dtype,

TypeError: prod() got an unexpected keyword argument 'out'

It looks more like a numpy version problem. Try upgrading numpy and scipy, and then try again.

I’m running this on Google’s colab platform:

print(np.__version__)
1.14.6

import scipy
print(scipy.__version__)
1.1.0

Updated both numpy and scipy to the latest versions 1.16.1 and 1.2.0 (with pip). I get the same error.

Also your way of defining the prior is not correct, see: Linear Regression With Positivity Constraint

2 Likes

Thank you! This got me a bit further:

with pm.Model() as logistic_model:
  
  my_priors = {"Intercept": pm.Normal.dist(mu=0, tau=1e-2),
               "age": pm.Uniform.dist(0,100),
               "age2": pm.Uniform.dist(0,10000), # statistically not correct, but whatever
               "educ": pm.Uniform.dist(0,20),
               "hours": pm.Uniform.dist(0,120)
              }
  
  pm.glm.GLM.from_formula('income ~ age + age2 + educ + hours', data, family=pm.glm.families.Binomial(), priors=my_priors)

  trace_logistic_model = pm.sample(2000, chains=1, tune=1000)

This code now crashes with a SamplingError: Bad initial energy :frowning:
It may be that there is something wrong with the prior I’ve set. Then again, the default (Gaussian) priors (as used in the tutorial I’ve linked to) result in the same error…

Perhaps I should code up a simpler example (less data, fewer variables) and take it from there.

For completeness, these are the data statistics:

        age	        educ     	hours	        age2	        income
count	32561.000000	32561.000000	32561.000000	32561.000000	32561.000000
mean	38.581647	10.080679	40.437456	1674.599152	0.240810
std	13.640433	2.572720	12.347429	1179.047521	0.427581
min	17.000000	1.000000	1.000000	289.000000	0.000000
25%	28.000000	9.000000	40.000000	784.000000	0.000000
50%	37.000000	10.000000	40.000000	1369.000000	0.000000
75%	48.000000	12.000000	45.000000	2304.000000	0.000000
max	90.000000	16.000000	99.000000	8100.000000	1.000000

The uniform prior for age2 is way too wide… but the Bad initial energy might due to the start point jitter in the default initialization. Try:

trace_logistic_model = pm.sample(2000, chains=4, tune=1000, init='adapt_diag')

Ran with "age2": pm.Uniform.dist(0,3500) (age2 := age^2) and

trace_logistic_model = pm.sample(2000, chains=4, tune=1000, init='adapt_diag')

Same error…

In that case, check the input table (make sure there is no NaN or Inf), then check the output of logistic_model.check_test_point() make sure there is no NaN

I think we may be onto something…

The check assert data.isnull().values.any() == False passes.

(data[['age','educ','hours','income']] >= 0).all(1).all() so there are no negative values in those data columns.

However, here is what logistic_model.check_test_point() returns:

Intercept           0.000000
age2_interval__    -1.390000
age_interval__     -1.390000
educ_interval__    -1.390000
hours_interval__   -1.390000
y                       -inf
Name: Log-probability of test_point, dtype: float64

Most likely after multiplication and transformation the linear predictor is at the boundary (i.e., p becomes 0 or 1), try rescale your predictors.

1 Like

All right! I scaled the predictors. Now the NUTS sampler seems to work.
I’ll have a look at the traces and then report back. Many thanks!