I’m building a copula model that includes a transformation of the observed. I have a good architecture that I’m mostly happy with except for the Jacobian adjustment - and I keep coming up against conceptual and implementation roadblocks. This is a little related to Jacobian adjustment and I’ve implemented everything so far in pymc v5.6.1.
I’ve several questions and will probably post many here But will start simple…
For a 1D FreeRV which is transformed (i.e. forget about my longer term issue on transforming the observed), I believe I can apply a Jacobian transform directly, see my Notebook Gist here: 991_mre_jacobiandet_issue · GitHub
Several questions here already(!)
Is this a faithful port of Bob Carpenter’s and @junpenglao’s work? (I think it is)
Are the effects that I’m seeing reasonable? I’m actually surprised to see Models B with the explicit Jacobian adjustment create a posterior theta that’s further from the empirical mean…
The jacobian=True in compile_logp refers to the transformations done to RVs that don’t have support on \mathbb R. For example, log transformation of half-normal to change it’s support from \mathbb R^+ to \mathbb R. See here, for example. Model A will have a transformation, a log-odds transformation of theta. You can check these in the mdla1.rvs_to_transforms dictionary. You can manually disable the transform by passing transform=None to pm.Beta. (But I’m not sure why you would need to?)
I am not sure what you are thinking about in terms of jacobian adjustment. In all examples you set, you have some more or less specified prior, and a likelihood that eventually overwhelms the prior. It doesn’t really matter how you defined that prior, as long as it allows for values matching a 0.3 mean.
Where do you need jacobian? When you are sampling a variable according to a distribution by proposing values on a given space and transforming them to another one. For example if you want to sample a unit uniform variable by proposing values on -inf to +inf range, you need two things:
a transform that brings your unconstrained proposal to the unit range (like a log-odds transform) and
an adjustment for the fact that uniform draws on the real line do not correspond to uniform draws after the log-odds.
For PyMC, jacobian terms are only relevant for priors (and added automatically), but you seem to be thinking of them in relation to posteriors?
One example where you would need it, is if you wanted to sample in that uniform space, but had specified your prior as a Flat:
import arviz as az
import pymc as pm
with pm.Model() as m:
# A very laborious way of writing x = pm.Uniform("x")
x_raw = pm.Flat("x_raw")
x = pm.Deterministic("x", pm.math.invlogit(x_raw))
adj = pm.Potential("adj", pm.math.log(x) + pm.math.log(1 - x))
idata = pm.sample()
az.plot_posterior(idata, var_names=["x"])
But note that the jacobian adjustment becomes irrelevant once the posterior distribution is mostly determined by the likelihood (there was none in the model I did). All that happens without it, is that I specified a slightly different prior than what I had in mind.
@jessegrabowski Yeah, I suspected that compile_logp was being clever under the hood and applying adjustments… This ModelA1 is purely for later comparison to see if I can make the Jacobian adjustment happen in Models B1 and B2.
@ricardoV94 My intention here is to get a baseline knowledge before tackling my problems yet to come
A 2D Jacobian adjustment for freeRVS, which I can’t make work
A 2D Jacobian adjustment of transformed data observed data
… All quite tricky to get my head around
Just to rephrase then, is the adjustment really just “fiddling around the edges” when there’s a large amount of data?
I’ll rerun now with a smaller dataset and see what happens (and also set ModelA1 to transform=None)
Just to rephrase then, is the adjustment really just “fiddling around the edges” when there’s a large amount of data?
Yes, like any prior choice. As long as it has nonzero mass over the true value, it will converge to a normal centered around it with infinite data (and a sufficiently “nice” model).
My hunch is that you’re thinking of jacobian in a setting where they matter (copulas? weakly informative likelihood?), but trying to understand them in one where they don’t (highly informative likelihood over a “mispecified” prior)?
Gotcha, that makes sense. And yes that’s exactly what I’m trying to do: understand the concepts first with something easy and then progress to the copula issue
The model is exactly the same with transform=None or left as default, just NUTS can’t really sample well without the transform. A Slice sampler would give you the same results.
Checking the Stan article you shared, note that what they did there is something you can’t really do in PyMC (at least not without hacking under the hood).
parameters {
real alpha;
}
transformed parameters {
real<lower=0, upper=1> theta = inv_logit(alpha);
}
model {
theta ~ uniform(0, 1);
y ~ bernoulli(alpha);
}
I don’t understand the Stan syntax completely (I would expect theta in the bernoulli, not alpha), but I think it would correspond to something like:
with pm.Model() as m:
alpha = pm.Flat("alpha")
theta = pm.Deterministic("theta", pm.math.invlogit(alpha))
pm.Potential("theta_prior", pm.logp(pm.Uniform.dist(), theta))
pm.Bernoulli("y", p=theta, observed=y)
Which is trying to give theta a Uniform prior but not quite doing it because the jacobian adjustment is missing (as in my example above).
The jacobian thing that is discussed many times in Stan circles is a non-concern for PyMC users because they never write models like the ones above (i.e., defining parameters as flats, transforming them, and then adding Potentials to the transformations as an attempt to define their priors)
Just for good measure, I’ve included a new ModelB1 which uses pymc’s automatic transformation.
Curiously the posteriors look higher for the Normal (Model B2, Model B3) than for the Beta (Model B1), even when they all use a Jacobian adjustment. Do you reckon I can put this down to the different priors in the small data environment? When I try this with larger data (n=1000), the differences start to disappear, though still higher
The distinction between pm.Beta(...) and pm.Beta(..., transform=None) is irrelevant. It’s an implementation detail and you get exactly the same posterior as long as the sampler converges. This will work if you use step=pm.Slice().
The default NUTS sampler will complain without the transform and the posterior it returns is not correct. That’s obviously why we automatically transform variables under the hood.
The fact we do, and that it requires a jacobian adjustment, is an implementation detail. All to say I am not sure those comparisons are going to help answer the questions you’re struggling with.
It sounds like you’re trying to break PyMC automatic reparametrization to understand how they work for another context, which is fine, but perhaps too indirect? You can just write down a simple Metropolis Hastings and play directly with that? Perhaps something like (link coming later)
That’s a fair point - as I said at the top this is all my attempt to understand better what’s going on under the hood so I can diagnose potential reasons for the issues I’m seeing elsewhere. In particular, the adapt_diag seeming to get stuck during init for NUTS in >5.6.1
Thanks for all this though, hopefully a helpful thread for others too, since the automated transformations and Jacobian adjustments are great, but it helps a little to understand it explicitly