Hi everyone,
I am fitting a geometric distribution to the following data:
[40000, 600, 1500, 30000, 12000, 25000, 65000, 1500, 40000, 10000000, 25000, 2000, 2000, 500, 800, 1500, 30000, 850, 25000, 1000, 15000, 40000, 9000, 3000, 12000, 1000, 1000, 1500, 10000, 25000, 7000, 35000, 30000, 25000, 750, 20000, 7000, 1500, 50000, 1250, 17000, 300, 15000, 1000, 50000, 3000, 35000, 8000, 41000]
Initially I just wanted to find the MLE, so I used sp.stats and sp.optimize:
f_geom = lambda p: -np.sum(stats.geom.logpmf(k=data, p=p))
mle_geom_parameters = optimize.minimize_scalar(
f_geom,
bounds=(0,1),
method='bounded'
)
Which results in:
fun: 653.6801753844973
message: 'Solution found.'
nfev: 25
status: 0
success: True
x: 5.9608609865491405e-06
So far so good.
Then, I wanted to do model comparison with other distributions on the same data, so I thought I would find the full posterior with PyMC3. Here’s the code:
with pm.Model() as model:
p = pm.Uniform('p', lower=0, upper=1)
observed = pm.Geometric('obs', p=p, observed=data)
logpt = pm.Deterministic('logp', model.logpt)
trace_geom = pm.sample(
cores=1,
tune=2000,
# return_inferencedata=True
)
Just to check, I compared the predicted loglikelihoods on a point:
x = 0.1
print(model.logp({
'p_interval__': np.log(x / (1-x))
}))
print(np.sum(stats.geom.logpmf(k=data, p=x)))
Which prints out:
-1134795.4133526431
-1134793.005407033
Is this just numerical error? There is a pattern in the difference between the two, namely the closer to the boundaries of the p parameter, the greater the difference:
xs = np.linspace(0, 1, 1000)
logliks_geom_pymc3 = np.array([
model.logp({
'p': x,
'p_interval__': np.log(x / (1-x))
})
for x in xs
])
logliks_geom_np = np.array([
np.sum(stats.geom.logpmf(k=data, p=x))
for x in xs
])
plt.plot(xs, np.abs(logliks_geom_pymc3-logliks_geom_np), color='blue')
I was wondering what might be causing this difference? Thank you!