Samples from prior appear to have wrong distribution

Hi,

I was creating a logistic regression model and found that the observed 0/1 values didn’t match up with the probabilities which were supplied as parameters to a Bernoulli distribution. The code block below gives a minimal reproducible example:

import numpy as np
import pymc3 as pm
import scipy

N = 100
with pm.Model() as minimal_example:
    X         = pm.Normal('x',shape = N)
    intercept = pm.Normal('intercept')
    #intercept = 0
    
    p = pm.Deterministic('p',pm.math.sigmoid(X + intercept))
    Y = pm.Bernoulli('Y',p = p,shape = N)
    
    samples = pm.sample_prior_predictive(samples=1)

print('Observed +1 values: ',np.sum(samples['Y']))
print('Expected +1 values: ',np.sum(samples['p']))
print('Corresponding binomial CDF:',scipy.stats.binom.cdf(np.sum(samples['Y']),np.sum(samples['p'])*2,0.5))

I ran this and got expected / observed counts of +1 values that were wildly incompatible (with corresponding binomial CDFs of nearly exactly 0 or 1). However, if I replace the intercept term above with the commented out line intercept = 0 then the predicted and observed Y values line up as expected.

TL;DR: logistic regression probabilities and observations are mismatched when sampling from a prior.

1 Like

This is likely related to https://github.com/pymc-devs/pymc3/issues/3210, which we are currently trying to fix.

Sure, I’ll keep tabs on that issue page then. Thanks!