Strange looking Metropolis acceptance rate

Hello !
While trying to implement a specific model, I saw that the Metropolis step seemed to never move.
I sneeked in pymc3 source code to add some prints and saw that the delta_logp was always very small.
I’m working with a black-box likelihood with pm.Potential and when printing my log-likelihoods, I shouldn’t have such low delta_logp, in my understanding…
Here is a quick output exemple :

Sequential sampling (2 chains in 1 job)
Metropolis: [law]
Sampling chain 0, 0 divergences:   0%|          | 2/1100 [00:00<03:10,  5.77it/s]
returned log-likelihood: [[array(-29.67120048)]]
returned log-likelihood: [[array(-29.65408013)]]
delta_logp: -6.347108208972685
Accepted ? False
returned log-likelihood: [[array(-29.75667088)]]
returned log-likelihood: [[array(-29.65408013)]]
delta_logp: -9.220555196247574
Accepted ? False
Sampling chain 0, 0 divergences:   0%|          | 4/1100 [00:00<02:37,  6.94it/s]
returned log-likelihood: [[array(-29.97665752)]]
returned log-likelihood: [[array(-29.65408013)]]
delta_logp: -10.00953635258783
Accepted ? False
returned log-likelihood: [[array(-29.70574403)]]
returned log-likelihood: [[array(-29.65408013)]]
delta_logp: -8.004581846022162
Accepted ? False
Sampling chain 0, 0 divergences:   1%|          | 6/1100 [00:00<02:20,  7.76it/s]
returned log-likelihood: [[array(-29.62658475)]]
returned log-likelihood: [[array(-29.65408013)]]
delta_logp: -9.647356751319833
Accepted ? False

Do these numbers seem fine to you ? If so could you please explain to me how such small log-likelihood differences can create these delta_logp ?
Thanks a lot in advance.

What are the shape of your variables in the metropolis stepper? We fixed metropolis in version 4 to work better with non-scalar variables. Make Metropolis cope better with multiple dimensions by ricardoV94 · Pull Request #5823 · pymc-devs/pymc · GitHub

Thanks for your quick answer !
My free variable is of dimension 45. Can it be problematic ?

What I really don’t get is how delta_logp is equal to -8 to -10 while the log-likelihoods given to pm.Potential are really close (-29.6 to -29.9).

Looking forward to reading you ! :slight_smile:

Definitely, I suggest you try to update to v4 and see if it helps.

Perhaps due to the prior logp of the unobserved variables?

Thanks again for your input.
I’ll give v4 a try whenever I can, but we are quite rigid in our work environment…
Using version 3.8, I get what is printed above with a code as simple as this:

with pm.Model() as model2:

    law = pm.Uniform("law", lower = bounds_low, upper = bounds_up, shape=45)

    var_tt = tt.as_tensor_variable(law)

    pm.Potential("likelihood", logl(var_tt))

    trace = pm.sample(1000,
                      step=pm.Metropolis(),
                      tune=100,
                      cores=1

As you can see, my prior is Uniform, thus I cannot comprehend why the delta_logp is so low, preventing any posterior sampling.

Best regards ! :slight_smile:

The problem is that Metropolis tries to update all the dimensions at once. The chance of proposing 45 steps that don’t have a miniscule joint probability is very small. That’s why we patched it in the recent version, to attempt one jump per dimension at a time.

If you cannot upgrade to V4, you can try to 1) patch/ recreate the Metropolis stepper to do one update per dimension or 2) rewrite your model with independent scalar priors that are latter stacked so that sampling still happens dimension-wise

law = at.stack([
  pm.Uniform(f"law_{i}", bounds_low, bounds_up)
  for i in range(45)
])

By the way what is the reason you need to use Metropolis? The best solution is usually not to use it :stuck_out_tongue:

Haha, sure !
But on this test-case, defining everything within pymc3 results in inf or nan in cholesky, so I tried a black-box likelihood. From here on, pm.Slice stayed stuck on 0 iteration, and I then tried pm.Metropolis, which got me confused on the low acceptance rate.
This is the roughest explanation, but if you were willing to give me some advice on the whole code/process, I can post it and would be utterly grateful :grin: I’m not sure that I’m tackling this the right way…

It sounds like you have more fundamental problems with your model. I would suggest trying to investigate those instead of “downgrading” the sampler in the hope that the dumber one will complain less. Even if it does, it will probably not be doing a great job at sampling.

1 Like

My goal here is to find an input law (of size 45) so that f_i(law) = ref_i for 11 f_i simultaneously. However, these functions f_i are unknown, and are estimated with Gaussian processes.
In short, with a database of \{law_k,f(law_k)\}, where f gathers all f_i, we want to:

  • construct GP surrogates of these functions,
  • find the law that outputs f close to the reference, taking into account GP uncertainty.
    Here is my code, with the model built in pymc3.

Note that when the flag “fully bayesian” is False, hyperparameters of a previously built GP are forced. Else, I’m well aware that 11*45 lengthscales is a lot to manage…

fully_bayesian = False
 
with pm.Model() as model:
   
    law = pm.Uniform("law", lower = b_low, upper = b_up, shape=45)
   
    laws_all = pm.math.concatenate((inputs,[law]))
    temp_all = np.concatenate((data,[ref]))
   
    for i,m in enumerate(GPs):
       
        if fully_bayesian:
           
            sigma = pm.HalfNormal(f"sigma_{i}", 1)
            l_scale = pm.Gamma(f"l_scale_{i}", alpha= 1, beta = 5, shape=45)
            s_n = pm.HalfNormal(f"s_n_{i}", 1)
                                      
        else:
           
            sigma = float(m.kern.variance[0])
            l_scale = list(m.kern.lengthscale)
            s_n = float(m.Gaussian_noise.variance[0])
           
        cov = pm.gp.cov.Matern52(45, l_scale)
       
        gp = pm.gp.Marginal(cov_func = sigma*cov)
       
        y = gp.marginal_likelihood(f"y_{i}",
                                   X=laws_all,
                                   y=temp_all[:,i],
                                   noise=s_n)
   
    trace = pm.sample(1000,
                      cores=1,
                      tune=100)

    az.plot_trace(trace)
    plt.show()

However, this produces an error, either “inf or nan in cholesky” or “bad initial energy”, even when I limit the scope to only one function f_i.
So I tried solving this with a black-box likelihood, that I define here:

def log_likelihood(x):
    return np.sum(norm(GP(x),GP.std(x)).logpdf(ref),axis=-1)
 
class LogLike(tt.Op):
 
    itypes = [tt.dvector] 
    otypes = [tt.dscalar] 
 
    def __init__(self, loglike):
       
        self.loglikelihood = loglike
 
    def perform(self, node, inputs, outputs):
       
        (variables,) = inputs
 
        logl = self.loglikelihood(variables)
 
        outputs[0][0] = np.array(logl) 

 
logl = LogLike(log_likelihood)
 
with pm.Model() as model2:

    law = pm.Uniform(f"law_{i}", lower = b_low, upper = b_up, shape=45)
       
    var_tt = tt.as_tensor_variable(law)
 
    pm.Potential("likelihood", logl(var_tt))
 
    trace = pm.sample(1000,
                      tune=100,
                      cores=1)
 
    az.plot_trace(trace)
    plt.show()

But as said above, the Slice sampler doesn’t move and the Metropolis sampler never accepts any proposal, regardless of how close the likelihood are (I tried to make bounds very close to each other to have a nearly constant likelihood, and Metropolis still didn’t move).

Does the process seem fine to you ? Am I missing something fundamental ?

Again, thanks a lot to you and all the team for all the great work and your quick answers :slight_smile:

1 Like

My experience with problems with many parameters is that sampling was very difficult unless I started with initial values relatively close to the modal values. When I let the sampler chose initial values from independent priors I started far away from the mode in a very large space, and it was quite difficult for the sampler to find the correct direction because everything was so flat in those parts of the space that noise overwhelmed any true gradient.

Opher

Hello ! Thanks for your input :slight_smile:
I have also experienced this kind of behavior, and here I tried with start points at MAP. Yet it did not seem to improve much, even when I shrank prior Uniform bounds by a lot (which is where I got a zero acceptance rate on Metropolis while the posterior was basically flat).
On my side, I still don’t understand what’s happening here…
@ricardoV94, I tried with 45 separated variables, but it did not change the behavior. I don’t understand why you mention “minuscule joint probability” for the proposal. If the proposal is symmetric, it shouldn’t appear in the acceptance rate calculation, right ?

Thanks again for your precious help !

The proposed point is the one with a minuscule joint probability, not the proposal itself (which is not even included in the delta_logp due to cancelation as you said).