Log likelihood plot for pymc3 model vs scipy stats distributions

I am trying my best to understand how backend of pymc3 works.
In this process I have created a comparison of a log-likelihood function of an array of observed values based on both scipy stats logpdf and pymc3 logp functions. The methodology is based on @junpenglao 's notebook “Code 3 - Combining_Likelihood.ipynb” in his course “advanced baysesian modelling with pymc3”.

I have attached the complete file below.

If setting the parameter b sufficiently high, for example b=5e13 or higher, the log likelihood function plot of the scipy stats logpdf implementation seems to create one unique solution as expected, checked by finding peaks in the plot (is this a valid approach?), while the pymc3 logp functions log likelihood plot is very “weird” with multiple peaks. Is a numerical issue, or am i doing something wrong here? I would assume the floating point effects of the b parameter being very large should have been handled by the auto-transformation of my uniform priors to alpha_interval__.

image
Above is plot of peaks in log-likelihood function of pymc3 model based on the models logp function.

Any help is extremely welcome

BR

log_likelihood_demo.py (5.0 KB)

Could I ask you to present a minimal example to illustrate what you are seeing that is surprising? I checked script you attached, but are lots of plots generated and I’m not sure which ones are relevant.

Yes, sure!
I am sampling 100 random observed values from the following function:

b=5e11
a = 15
def get_random_numbers(a,b,number_obs):
    return st.beta.rvs(a=a, b=b,size=number_obs) 

number_obs=100
obs=get_random_numbers(a,b,number_obs)

I want to infer the parameters a and b with a simple pymc3 model:

with pm.Model() as m:
    alpha=pm.Uniform("alpha", a/4,4*a)
    #alpha = pm.Deterministic((_alpha+lower_a)*(upper_a-lower_a))
    beta = pm.Uniform("beta", b/4,4*b) 
    y=pm.Beta("y",alpha=alpha,beta=beta,observed=obs)

For a sanity check I use junpenglaos examle as mentioned earlier (advance-bayesian-modelling-with-PyMC3/Code3 - Combining_Likelihood.ipynb at master · junpenglao/advance-bayesian-modelling-with-PyMC3 · GitHub) and compare the log-likelihood function based on 100 observed data for scipy stats and pymc3 with. For scipy stats this can be done by:

av = np.linspace(a/3,a*3, 100)
bv = np.linspace(b/3,b*3, 100)
av_, bv_ = np.meshgrid(av, bv)

ll = np.array([st.beta.logpdf(obs[i], a=av_, b=bv_) for i in range(len(obs)) ]).sum(axis=0)

and for pymc3 this can be done by:

logp_y = y.logp

lly = np.zeros_like(ll)
lly = lly.flatten()

point= m.test_point

# to make likelihood plots for y work, transform alpha and beta values.
a_ = alpha.transformation.forward(av_).eval().flatten()
b_= beta.transformation.forward(bv_).eval().flatten()

for nbr,[i,j] in enumerate(zip(a_,b_)):
    point['alpha_interval__'] = i
    point['beta_interval__'] = j   
    lly[nbr] = logp_y(point)
  
lly = lly.reshape(av_.shape)

Plotting variables ll and lly, in an imshow yields the below graph:


which on first glance look to be identical.
However, the graph on the left from scipy stats has one peak (expected), and the one for lly has multiple peaks along a line.
How are the two not identical?

Below are the multiple peaks found in lly. This is under my possible misconception that multiple peaks mean multiple solutions to the sampler

1 Like

Why they are not identical is not clear to me. However, I would be hesitant try and do too much fine-grained analysis of the model landscapes using just these images. In particular, norming the colormap seems highly misleading because I suspect that the likelihoods you are viewing here are very small. But I don’t know how small they are. If you are concerned about the multiple peaks, I would want to know what likelihoods these peaks represent and how they compare to the value at the peak found via scipy, etc.

I think you might be missing the fact that pymc adds a jacobian term for the interval transform. Try to pass transform=None when you define the beta variable (and then don’t transform the values when evaluating the logp)

Hmm the beta is observed so it shouldn’t have a transform.

Anyway, you might just be looking at numerical precision issues for extreme values, even possibly from the uniform interval transforms (you can pass transform=False to rule that out).

Also we used to have some issues in the beta logp that were fixed sometime ago, not sure if they would have affected you still Good results but many divergent samples: what, me worry? - #14 by aseyboldt

I am facing something seemingly similar with the student T distribution. My code is very similar @BjornHartmann although i also added the log-likelihood function for m:

image

image

I wrote a small script to investigate to images:

When i am run the code, I get different number of peaks and locations each run (when comparing ll,lly and llm). see some examples below:



at some values for the parameters of obs i would get a multitude of maximas as in the @BjornHartmann case.

I did also experiment with @BjornHartmann 's model with the transform=None added, It seems it did not make any difference.

I played around with the b value in
image
and when I reached 1e15 i got the following log-likelihood graphs (ll,lly and llm ):

any ideas ?

I don’t quite get the concerns in this thread, logp with extreme values may suffer from float precision issues, and we may also use optimized routines that exchange precision for speed. PyMC does not really care about Maxima, as it is built with MCMC in mind.

In any case, if you are concerned with precision in edge cases the best thing is to compile the theano / aesara function directly and inspect the generated C-code to see exactly what is being used.

For the studentT in PyMC3, the theano expression can be found here: pymc/continuous.py at ed74406735b2faf721e7ebfa156cc6828a5ae16e · pymc-devs/pymc · GitHub

Scipy expression can be found here: scipy/_continuous_distns.py at 56039d9a766edfa38afb653ccbba0b9cff559d7b · scipy/scipy · GitHub

Ignoring the Normal approximation for infinite df, it looks quite similar. Besides the fact that it does not account for loc and scale, the only differences I see is that PyMC uses log1p in the last line and the log expression in the second line. You can try to replace it by 0.5 * (tt.log(lam) - tt.log(nu * np.pi)) to see if that leads to more stable evaluations.

1 Like

Just to go one step beyond @ricardoV94 's point that MCMC doesn’t care about maxima, it doesn’t want maxima. Michael Betancourt discusses this in A Conceptual Introduction to Hamiltonian Monte Carlo, section 1.4. Quoting from this paper:

The neighborhood immediately around the mode features large densities, but in more
than a few dimensions the small volume of that neighborhood prevents it from having much
contribution to any expectation. On the other hand, the complimentary neighborhood far
away from the mode features a much larger volume, but the vanishing densities lead to
similarly negligible contributions expectations. The only significant contributions come
from the neighborhood between these two extremes known as the typical set

MCMC is concerned with identifying and exploring the typical set, which is the collection of points that contribute both density and volume to the expectations (integrals) we are approximating when we run MCMC. As he notes, while the probability density is highest at the mode, in high dimensions, the volume there is essentially zero, so it’s contribution to the integral is zero.

He also discusses this concept (and many others) in this video, if you prefer that format. I do, and I highly recommend the video.

I am worried this all started with my plots of likelihood functions in that other thread about parameter identification, which is a very different issue than just trying to find the mode of the likelihood function. Analysis of the gradients of the likelihood function with respect to parameters and data can yield information about (local) identification. Canonical citiations for this idea in the frequentist literature would probably be be Rothenberg (1971) and White (1982). These use the information matrix of a normal likelihood function (which is the outer product of the gradient of the likelihood, at least in the normal case) to draw conclusions about how well models are identified.

In a Bayesian context, however, note that 1) your priors can improve identification, and 2) partial pooling can improve identification, and 3) parameterization can improve identification. So it’s not just a matter of simply comparing the jacobian and hessian matrices of your likelihood function, as in the White test. In addition, NUTS is itself a huge identification test. If you model samples slowly or diverges, your model is probably poorly identified. This is a huge advantage of using NUTS – it fails loudly, and for reasons that force you to do good research.

Once again, Betancourt gets the last word with a very nice blog post that I’ve linked to before about identification in models. I think that’s the most accessable place to start if you’re worried about this issue.