Decreasing rhat for Black-Box Likelihood Model

What are some ways to improve convergence for a model with, say, 4 continuous random variables with uniform priors (with reasonable, engineering-judgment-based lower and upper bounds) and a black-box likelihood? I tried increasing the number of samples, but I’m still getting rhat statistic larger than 1.4. Here’s the log from one of the runs:

Multiprocess sampling (4 chains in 4 jobs)
DEMetropolisZ: [param6_1, param1_1, scalar1, param2_1]

Sampling 4 chains for 50_000 tune and 50_000 draw iterations (200_000 + 200_000 draws total) took 3808 seconds.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
/utrc/home/username/.../python3.8/site-packages/pymc3/sampling.py:1944: UserWarning: The effect of Potentials on other parameters is ignored during prior predictive sampling. This is likely to lead to invalid or biased predictive samples.
  warnings.warn(
            mean     sd  hdi_2.5%  ...  ess_bulk  ess_tail  r_hat
param6_1  -0.226  0.007    -0.239  ...      11.0      20.0   1.29
param1_1   0.231  0.007     0.220  ...      51.0      92.0   1.08
scalar1   26.274  4.587    17.813  ...       7.0      22.0   1.49
param2_1  -1.544  5.160    -9.949  ...      10.0      24.0   1.34

I’ve sampled hundreds of thousands of draws with metropolis and not gotten convergence. If it’s a hard problem to sample it’s just going to be rough going with metropolis, no way around it.

Can you explain anything more about what you’re trying to do? With no other information I guess I could suggest trying to put more (very much more) informative priors than just uniform, but it’s hard to give totally unconditional advice.

1 Like

I can try to use more informative priors (like normal distribution with not-so-small standard deviation, which can be tricky to guess), but I’m hoping to avoid biasing the posterior too much with (very) informative priors. There’s such a risk, right?

The probabilistic model attempts to “adjust” parameters of an external model, from which I can’t get analytical derivatives, by using data observations. Maybe another aggravating fact about my case is that I have very few observations (like 8). The PyMC model computes a log-likelihood that attempts to measure how good the output of the external model is when compared to the observed value (well, actually I have a specified distribution for the observed value, so I essentially use its PDF with some modifications for my purposes). See Sampler recommendation for black-box likelihood and Observed Variable as a Range for some background of why I’m doing what I’m doing. :slight_smile:

When you say “it’s PDF”, you mean you have a PDF for the external model? Or you mean you have an assumed distribution over the data, and the external model acts like a (very complicated) deterministic transformation of the parameter samples?

If it’s the latter, it might be nice to look into the simulation api if you haven’t. It does likelihood-free parameter calibration by matching sample moments to data moments.

If it’s the former, is it possible to use approximate numerical derivatives? Statsmodels uses this approach to avoid trying to differentiate through the Kalman Filter in it’s Bayesian example (example code in PyMC3 unfortunately – there are some changes to the way you would implement this now, especially w.r.t the use of DensityDist). The way they do it in the example is a bit opaque, but the model.score function calls numdifftools. NUTS + approximate gradients might do better than metropolis. It’s worth a try at any rate.

I recently had a highly nonlinear model I couldn’t get nice gradients for, and I had some success using the emcee package to do ensemble sampling. You can use PyMC to compile a likelihood function then hand it to emcee to generate global samples without gradients. The tradeoff is you have to use a ton of chains.

Just some thoughts, but they might not be helpful.

Thanks for the suggestions!

I defined PDFs for each observation point. That’s why I use pm.Potential, because I defined an Op that calls the external model (simulator) and calculates a likelihood of its output values using the PDF of each observation point.

I didn’t know about pm.Simulator. Maybe I can investigate that. I’ll have to think about the distance metric I’ll use in order to utilize the PDF information on observations.

Quick update. I managed to get “better” results with pm.Simulator. By better I mean that I got rhat values that are either 1.0 or very close to it without exaggerating the number of samples. Also thanks to Aesara tensor’s switch and and_ functions that allowed me to model some conditions efficiently.

1 Like

Are the estimates you get reasonable?

Well, I’m estimating 4 parameters. The posterior density plots of 3 of them show two or more peaks, i.e., they’re not concentrated on a single value each. I’m again using uniform priors.