My students and I are trying to use PyMC to explore the parameter space that could produce a particular data set (in astrophysics, if it matters.) One complication is that each observed data point can actually rule out a region of parameter space, i.e., \ln(\ell(\vec{\Theta})) \to - \infty for certain values of the parameters \vec{\Theta}. These “forbidden” values of \vec{\Theta} depend on the observed data and are somewhat complicated, so I can’t simply use a truncated prior for \vec{\Theta} (or at least I haven’t figured out a way to do so easily.)
The Metropolis-Hastings sampler in PyMC works fine for this model, but we’re plagued with divergences when we try to use the NUTS solver. My understanding is that this happens because of the rapid variations of \ln(\ell) near the “boundaries” of the allowed region of parameter space, causing the Hamiltonian integrators to fail to track the trajectories to sufficient accuracy. Increasing the target_accept
value reduces the number of divergences but does not get rid of them (at target_accept = 0.9999
, I still get about 4% divergences), and increasing tune
doesn’t seem to change things either.
Is it a fool’s errand to try to use Hamiltonian MCMC methods in a situation where the potential is actually divergent? Or are there “tricks” that can be used to avoid divergences when applying Hamiltonian MCMC to cases where there are “forbidden” regions of parameter space that are not known a priori? If there are, is there a good way to implement these “tricks” in PyMC?
Apologies if this is a basic question, or if I’m using the wrong terminology to talk about any of this. I’m a physicist, not a statistician or a programmer, which means I know just enough about this stuff to get myself into trouble.
2 Likes
Can you tell a bit more about the structure of the likelihood?
If the density really is -inf at some point, then I don’t think HMC will work properly. I’m pretty sure a posterior like that breaks some assumptions. (whether or not it might still give you something useful is a different question that I don’t know the answer to, or ever really how I’d go about answering it…).
But maybe you can instead approximate this, by making sure the density just becomes small enough? I wouldn’t be surprised if that still leads to a pretty nasty posterior geometry though.
If so, it might be worth trying out the normalizing flow adaptation (nf-adapt – Nutpie), but keep in mind, that that’s still a bit experimental.
There are multiple models that we’re hoping to analyze, but right now we’re working on the simplest one, which involves a single parameter \Theta and no independent variables x_i. In this model, the likelihood function for a given observation y_i is
\ell = \begin{cases} \frac{ \Theta^2 \sqrt{\Theta^2 - T_i^2} + (1 - \Theta^2 - T_i^2)\tan^{-1}\sqrt{\Theta^2 - T_i^2}}{(\Theta^2 + 1 - T_i^2)} & \Theta \geq T_i \\ 0 & \Theta < T_i
\end{cases}
where T_i^2 = 1 - y_i^2. For \Theta \gtrsim T_i, this means that \ell \approx \sqrt{2 T_i} (1 - T_i^2) \sqrt{\Theta - T_i} + \mathcal{O}(\Theta - T_i)^{3/2}, meaning that \ln \ell has a logarithmic divergence as \Theta \to T_i from above.
In the simple one-parameter case, you can actually impose a truncated prior to cure this — just figure out what the largest value of T_i is among all the observations y_i, and truncate the prior for \Theta there. But we’ll eventually want to expand our analysis to cases where there are more parameters and the observations depend on independent variables x_i. In these case, \Theta is replaced in the above likelihood function by a quantity f(\vec{\Theta}, x_i) that will depend on both the parameters and the dependent variables. This means that the constraints on parameter space will be much more complicated, which is why I’m loath to just truncate the prior and call it a day.
(I’m also having one of my students try adding artificial “observational noise” to the model, which would probably cure the problem by making any “impossible” observations y_i for a given value of \Theta merely “very very unlikely”. He’s making progress on it, but if we stall out on that idea I might post another thread here about how to implement it.)
Can you say more about how we would “make sure the density becomes small enough”? We did consider having the log-likelihood function return a large negative value below some critical value of \Theta via some simple case-checking. But we were concerned that if we did that, the gradient of \ln \ell would be zero in that region of parameter space and the Hamiltonian techniques wouldn’t be able to push the system back towards the typical region.
You could try adding a penalty term via a pm.Potential
? Something like 1/(softplus(c + b(Theta - T))
. When Theta >> T, this converges to zero, but when Theta is close to T, you can get an arbitrarily large likelihood penalty. c
and b
are free parameters to help control the size of the penalty term at zero (c
) and the speed at which it crashes out to zero (b
).
The advantage of something like this over something like ell = pt.switch(pt.ge(theta, T), ell, 0)
is that it’s differentiable, so it might end up with better sampling? I’d try the switch too (assuming you haven’t).
I mean something like what @jessegrabowski proposed, or what you mentioned here:
(I’m also having one of my students try adding artificial “observational noise” to the model, which would probably cure the problem by making any “impossible” observations yi for a given value of Θ merely “very very unlikely”. He’s making progress on it, but if we stall out on that idea I might post another thread here about how to implement it.)
I think the “artificial noise” approach is easier to interpret, so I think I’d try that one first?
I wish we had have a better solution. More complicated constraints on the posterior come up from time to time, and they are mathematically well defined, but unfortunately I’m not aware of any algorithms that can really deal with them well.