Sp.stats and PyMC3 logps different

juststarted · December 9, 2021, 9:06am

Hi everyone,

I am fitting a geometric distribution to the following data:

[40000, 600, 1500, 30000, 12000, 25000, 65000, 1500, 40000, 10000000, 25000, 2000, 2000, 500, 800, 1500, 30000, 850, 25000, 1000, 15000, 40000, 9000, 3000, 12000, 1000, 1000, 1500, 10000, 25000, 7000, 35000, 30000, 25000, 750, 20000, 7000, 1500, 50000, 1250, 17000, 300, 15000, 1000, 50000, 3000, 35000, 8000, 41000]

Initially I just wanted to find the MLE, so I used sp.stats and sp.optimize:

f_geom = lambda p: -np.sum(stats.geom.logpmf(k=data, p=p))
mle_geom_parameters = optimize.minimize_scalar(
    f_geom,
    bounds=(0,1),
    method='bounded'
)

Which results in:

     fun: 653.6801753844973
 message: 'Solution found.'
    nfev: 25
  status: 0
 success: True
       x: 5.9608609865491405e-06

So far so good.

Then, I wanted to do model comparison with other distributions on the same data, so I thought I would find the full posterior with PyMC3. Here’s the code:

with pm.Model() as model:
    p = pm.Uniform('p', lower=0, upper=1)
    observed = pm.Geometric('obs', p=p, observed=data)
    logpt = pm.Deterministic('logp', model.logpt)
    trace_geom = pm.sample(
        cores=1,
        tune=2000,
#         return_inferencedata=True
    )

Just to check, I compared the predicted loglikelihoods on a point:

x = 0.1
print(model.logp({
    'p_interval__': np.log(x / (1-x))
}))
print(np.sum(stats.geom.logpmf(k=data, p=x)))

Which prints out:

-1134795.4133526431
-1134793.005407033

Is this just numerical error? There is a pattern in the difference between the two, namely the closer to the boundaries of the p parameter, the greater the difference:

xs = np.linspace(0, 1, 1000)

logliks_geom_pymc3 = np.array([
    model.logp({
        'p': x,
        'p_interval__': np.log(x / (1-x))
    })
    for x in xs
])

logliks_geom_np = np.array([
    np.sum(stats.geom.logpmf(k=data, p=x))
    for x in xs
])
plt.plot(xs, np.abs(logliks_geom_pymc3-logliks_geom_np), color='blue')

I was wondering what might be causing this difference? Thank you!

ricardoV94 · December 9, 2021, 10:11am

The difference probably comes from the Jacobian term to account for the parameter transformation of p

You can try to use model.logp_nojac which should be similar to the scipy results

ricardoV94 · December 9, 2021, 10:16am

Or you can disable the transform of p by using

p = pm.Uniform('p', lower=0, upper=1, transform=None)

Topic		Replies	Views
GP and MC integral approximation - which logp to use? Questions	2	562	February 12, 2019
Log likelihood plot for pymc3 model vs scipy stats distributions v3 modeling	9	1034	June 7, 2022
How to limit a Geometric Discrete Distribution in a range Questions	11	896	March 30, 2020
Maximum likelihood estimation of a Bayesian model Questions	11	5444	April 12, 2018
Calculating marginal likelihood of GP Model Questions	6	1284	December 31, 2018

Sp.stats and PyMC3 logps different

Related topics