I was creating a logistic regression model and found that the observed 0/1 values didn’t match up with the probabilities which were supplied as parameters to a Bernoulli distribution. The code block below gives a minimal reproducible example:
import numpy as np import pymc3 as pm import scipy N = 100 with pm.Model() as minimal_example: X = pm.Normal('x',shape = N) intercept = pm.Normal('intercept') #intercept = 0 p = pm.Deterministic('p',pm.math.sigmoid(X + intercept)) Y = pm.Bernoulli('Y',p = p,shape = N) samples = pm.sample_prior_predictive(samples=1) print('Observed +1 values: ',np.sum(samples['Y'])) print('Expected +1 values: ',np.sum(samples['p'])) print('Corresponding binomial CDF:',scipy.stats.binom.cdf(np.sum(samples['Y']),np.sum(samples['p'])*2,0.5))
I ran this and got expected / observed counts of +1 values that were wildly incompatible (with corresponding binomial CDFs of nearly exactly 0 or 1). However, if I replace the intercept term above with the commented out line
intercept = 0 then the predicted and observed Y values line up as expected.
TL;DR: logistic regression probabilities and observations are mismatched when sampling from a prior.