Bounded variable transformation when not using HMC?

covertg · February 10, 2023, 12:03am

I’ve mainly read about why it’s helpful to transform bounded variables to an unbounded space in the context of sampling with HMC.

What are the pros and cons of these transforms in the context of more traditional, non-gradient-based samplers? Especially e.g. the Slice sampler or Metropolis methods we have in pymc?

covertg · February 10, 2023, 5:08am

Here’s a bit of motivation for what prompted me to ask this.

I’m working with a model that has a parameter that takes positive values. Using a positive/nonnegative prior with support on [0, inf), pymc will log-transform it for the sampler.

This is a black box model without gradient information so we use a more traditional step method. Let’s say metropolis-hastings with a uniform proposal to keep it simple. What I am struggling to think about is how the change in scale (due to the log transform) will affect the efficiency of the sampler’s proposals.

Suppose the proposal width is 1 on the log scale. At the lower range of parameter values, this means that the M-H step has a range of e^{1.01}-e^{0.01} \approx 1.73 in original units in terms of how far away the two furthest possible proposals are. But suppose we begin the M-H step a bit further up the number line at 6.5. Then the the M-H step has a range of e^{7}-e^{6} \approx 693.2 in the original units. Exponential growth is fast!

Perhaps this makes it easier to explore the posterior close to zero. But I’ve also run into an issue with this in some models. Sometimes the step method (metropolis or slice) will propose values that are so large (in original units) that they result in numerical issues or exceptions when passed to the black-box likelihood.

Now, perhaps this could be addressed by setting a tighter prior over the positive parameter. And perhaps one could also modify the black-box likelihood to return logp=-inf or so if it encounters really extreme parameter values. However, the issue is also prevented by disabling the log-transform, and instead sampling on the original scale.

Is this a case where the untransformed space could be preferable? What would I be missing out on?

n.b. I am still trying to make a minimum working example that only uses pymc and demonstrates the extreme parameter value issue, but I haven’t yet gotten it…

A related example that also gets at what I am grappling with is: perhaps I know that a reasonable proposal distribution for my particular problem in original units is U(-b, b). But I don’t know of any good way to directly translate this knowledge to the log-transformed regime without it potentially getting very distorted.

ricardoV94 · February 10, 2023, 8:24am

Tuning should allow NUTS to find a reasonable step size near the posterior. Are you seeing failures during tuning? Setting unstable values to -inf might work, but if your likelihood fails with large values it seems the prior is at fault here.

You can also try to tweak the initial point, but I don’t think that would suffice for your case.

Not important, but the smallest step size with a proposal of 1 is actually smaller because it can go into negative domain now, so it would be e-700 - e-701 (exponentiation underflows around ~ e-712 in float64 so you can’t go much more negative).

covertg · February 10, 2023, 5:19pm

Thanks @ricardoV94
In this case I cannot use NUTS unfortunately. But still, you’re right, tuning for Slice or Metropolis or what have you should also do a fine job of adjusting the slice width or proposal variance or whatever. To that end, I’m wondering how to think about in which cases a transformed variable will give an easier time for the proposal method versus a harder time for the proposal method. I am curious about this theoretical question in general…

But in my case, tuning is indeed when the black-box likelihood fails. Here’s a MWE using that black-box likelihood, but it requires the external library pyddm.

model code

import arviz as az
import numpy as np
import pyddm
import pymc as pm
import pytensor
import pytensor.tensor as pt

# Generate our "observed" data,
# 500 reaction time samples from a model with linear collapsing bound of bslope=1
samp = pyddm.Model(bound=pyddm.BoundCollapsingLinear(B=1, t=1)).solve().resample(500)

# Create op for the log likelihood (to be passed to a Potential). The only free parameter
# of this model is "t", the slope of the collapsing decision bound.
m = pyddm.Model(bound=pyddm.BoundCollapsingLinear(B=1, t=pyddm.Fittable()))
@pytensor.compile.ops.as_op(itypes=[pt.scalar().type], otypes=[pt.scalar().type])
def ddm_lik_op(bslope):
    m.set_model_parameters((bslope,))
    neg_logp = pyddm.get_model_loss(m, sample=samp, lossfunction=pyddm.LossRobustLikelihood)
    return np.array(-neg_logp)

# pymc model
with pm.Model() as model:
    bslope = pm.HalfNormal("bslope", sigma=5)
    obs = pm.Potential("obs", ddm_lik_op(bslope))

The log-likelihood given the synthetic data looks like this,

It’s a drift-diffusion model with a linear collapsing bound. The external library is solving a stochastic differential equation and some other things to get us the model likelihood. With its default settings for numerical precision, the likelihood will throw an exception for bslope>=200.

Running sampling on the model above, an exception is thrown in tuning because (for some reason I’m still trying to understand exactly) really large values get proposed for bslope. This occurs reliably for all of Slice, Metropolis, and DEMetropolisZ. But I’ve been mostly working with slice sampling.

with model:
    idata = pm.sample(step=pm.Slice(), chains=1, cores=1)

The issue is avoided if I set a tighter prior. sigma=1 seems to mostly work. However, I’m surprised that any sigma much larger than 1 is an issue. HalfNormal(sigma=5) has a 94% HDI of somewhere around [0.00…, 9.4]. It’s unexpected to me that values as high as 200 should be at all probable (and in this case, the proposals are often many orders of magnitude larger, sometimes overflowing to inf).

Further, sampling seems to have a much easier time if I don’t transform the bslope variable. Even with sigma=25, transform=None gives no problems during sampling/tuning! And thus my question…

FWIW it's also surprising to me that the exception has been occurring for me *only* during tuning. Setting `tune=0` in `pm.sample()` has more or less avoided the issue. I'm not sure if this is a red herring or not.

Also, indeed, setting a small initval helps a little bit but doesn’t avoid the issue. It mostly just delays how many steps occur before the proposal value shoots up.

ricardoV94 · February 10, 2023, 6:02pm

Which sampler are you using? Setting tune=0 might mean the initial proposal size is not too bad (and it never had the opportunity to explore something worse… or better). If your step sampler is not too complicated (Metropolis or Slice) you can hand-tune the proposals yourself. For NUTS that would be quite an enterprise

And yes, it may be fine to disable the transforms. The default are there mostly for use with NUTS which cannot cope at all with sharp boundaries in the logp. Other samplers are much more okay with that and it may justify not using a default transform. There is nothing “sacred” about them.

I think you are not veering anywhere too dangerous. Just be cautious to run many chains at different initial points to confirm they converge well. You loose a lot of “warning diagnositics” by not using NUTS so it’s easier to be blindsided.

covertg · February 13, 2023, 7:52pm

Thanks for the helpful discussion @ricardoV94. Really appreciate the help and insight on automatic transforms. (FWIW I’m mainly using slice sampler.)

I’ve come up against an issue when working with non-transformed models that do have bounded variables. It’s immediately following this thread so I thought I’d write here, but happy to take the conversation somewhere else if that’d be preferred.

I’ve noticed that if a variable has a bounded distribution and is not transformed, the sampler may sometimes send out-of-bounds parameter values to the likelihood function. In particular what seems to be going on is,

Step method proposes, or tests, a parameter value that is outside the bounds of that variable (this is fine and expected sometimes)
pymc/pytensor calculates that the logp of that particular factor is -inf
But pymc still cranks out the full sum of model logp terms, which means that the variable with logp=-inf gets passed on to the model likelihood term. (or in more complex models, it could just be that the impossible value of the hyperprior gets passed on to some other node in the DAG)

This is not an issue when the default transforms are used, because any possible real number that the step method could propose gets transformed to that variable’s bounds. In my case, this basically means that I run up against the same exception from my likelihood function as we were discussing above.

With the built-in pymc distributions, it looks like the logp functions use switch to ensure that they are robust to any parameter value. But in my case, the black-box likelihood may fail if parameter values are too extreme. I had thought that specifying bounded prior distributions would be sufficient to ensure that out-of-bounds values do not get passed along the DAG when computing model logp, but currently this is not the case.

Here’s a MWE of what I mean:

import numpy as np
import pymc as pm
import pytensor

# Simple regression on one slope term. Let's enforce that the slope variable
# remains within the following bounds.
bounds = (-2, 2)

# Data generation
X = np.random.uniform(size=1000)
m = 1
assert bounds[0] <= m <= bounds[1]
Y = m * X + np.random.randn(1000) * 0.1

# Model using an extended Normal logp that raises AssertionError if its coefficient
# is outside of the enforced bounds
def regression_logp(x, y, beta, sigma):
    b = pytensor.raise_op.Assert("unexpected beta!")(
        beta,
        pytensor.tensor.all(
            [pytensor.tensor.ge(beta, bounds[0]), pytensor.tensor.le(beta, bounds[1])]
        ),
    )
    val = x * b
    return pm.Normal.logp(value=y, mu=val, sigma=sigma)

with pm.Model() as m:
    b = pm.TruncatedNormal("b", mu=0, sigma=5, lower=bounds[0], upper=bounds[1], transform=None)
    y_obs = pm.Potential("y_obs", regression_logp(x=X, y=Y, beta=b, sigma=0.1))

    # Sampling will fail due to AssertionError. This occurs with any step method.
    # But sampling doesn't fail if we keep the default transform on b.
    idata = pm.sample()

I’m wondering if it could be reasonable to modify the behavior of the pymc model’s logp() function. What if, rather than calculating all of the logp terms (in this line), we stop if any of the terms is -inf? I’m not sure if this would have any bad side-effects that I can’t think of right now. Or how it should be implemented—would we want a new pytensor accumulator Op that stops if a certain condition is met?

ricardoV94 · February 13, 2023, 11:10pm

What you are describing is part of why we use transformations: to not propose useless or invalid values in the first place. Our step samplers are mostly general and distribution agnostic so the best they can do is provide values and check if they are valid.

The only cases this has shown to be somewhat problematic is when there are random indexing operations and the logp function may try to index an out of bounds value. That can usually be sorted by adding an explicit guard, you could probably do the same:

with pm.Model() as m
  sigma = pm.HalfNormal("sigma", transform=None)
  safe_sigma_value = 1
  safe_sigma = pm.math.switch(sigma < 0, safe_sigma_value, sigma)
  pm.Potential(..., foo(safe_sigma, ...)

That way negative sigmas will still be rejected, but it won’t lead to any exceptions in foo.

I am not sure what you’re proposing with interrupting when there is a negative value, our non-nuts samplers do need to propose and observe invalid logps.

You can also use a non-default transform of your liking.

Perhaps if you share the exact likelihood that is giving you trouble we can help more.

covertg · February 14, 2023, 4:05am

Right, this is making sense. Ultimately it may be more effective to fix this on the likelihood side (adjust the backend library to return -inf, or a specific exception, if any of the possible extreme-parameter cases are passed to it).

The switch statement is cool. I’m hoping to be able to provide more generalized advice on working with this class of “generalized” DDM models. GDDMs can take arbitrary (even user-defined) extensions to their parameterizations, which means a different switch statement for each different model.

What I was meaning with the “interrupting” is something like this:
Let’s say the “model logp” is the sum of the logp of its “factors,” where each “factor” is a RV or a potential. To the best I can tell, pymc calculates the model logp as sum([sum(factor) for factor in logp_factors]) where logp_factors is ordered in a particular way.
So if any of the terms of the sum is -inf, the total sum would also be -inf.* Hence why do we need to keep calculating the rest of the terms in the above list comprehension, if a previous term evaluates to -inf?

I think this makes sense in terms of being statistically well-founded (the step methods would still get -inf for the logp in what I’m asking)… but maybe I’m wrong. Also I’m not sure if there are other considerations: implementing this in a pytensor Op, or if the Sum accumulator necessarily computes the sum in serial and in the order that the terms are passed, etc…

* well an edge case would be if one term is -inf and another is inf, then it’s not exactly obvious what we want to return. In numpy this gives a nan and maybe a warning. Not sure if we would experience this case ever or what would happen…

ricardoV94 · February 14, 2023, 6:49am

Yeah we can’t abort of the sum early like that exactly because of undefined values like inf - inf or inf + nan, which must return nan.

Also accommodating these special cases would probably generate slower machine code due to the introduction of branching. This would be undesirable, because we expect the special branching to not be needed in the majority of cases.

If you really wanted, PyTensor has a lazy IfElse that could be used to that effect, but the switch would still be probably faster.

Topic		Replies	Views
Hamiltonian MC with different variable dimensions Questions	3	848	July 23, 2019
Bounded Variables and Custom Likelihood Functions v5	2	641	June 6, 2022
Custom transformations for positive random variables? version agnostic	7	682	August 28, 2022
Hierarchical model with bounded Normals? Questions	3	551	January 20, 2020
HMC Sampling from a Model with no Free RVs? v5	1	332	August 19, 2022

Bounded variable transformation when not using HMC?

Related topics