Hi,
I was creating a logistic regression model and found that the observed 0/1 values didn’t match up with the probabilities which were supplied as parameters to a Bernoulli distribution. The code block below gives a minimal reproducible example:
import numpy as np
import pymc3 as pm
import scipy
N = 100
with pm.Model() as minimal_example:
X = pm.Normal('x',shape = N)
intercept = pm.Normal('intercept')
#intercept = 0
p = pm.Deterministic('p',pm.math.sigmoid(X + intercept))
Y = pm.Bernoulli('Y',p = p,shape = N)
samples = pm.sample_prior_predictive(samples=1)
print('Observed +1 values: ',np.sum(samples['Y']))
print('Expected +1 values: ',np.sum(samples['p']))
print('Corresponding binomial CDF:',scipy.stats.binom.cdf(np.sum(samples['Y']),np.sum(samples['p'])*2,0.5))
I ran this and got expected / observed counts of +1 values that were wildly incompatible (with corresponding binomial CDFs of nearly exactly 0 or 1). However, if I replace the intercept term above with the commented out line intercept = 0
then the predicted and observed Y values line up as expected.
TL;DR: logistic regression probabilities and observations are mismatched when sampling from a prior.