Truncated Inverse normal distribution (also known as Wald distribution)

I am testing a model that describes a set of observations that follow a Inverse gaussian/Wald distribution that is right truncated (i.e. the support x is in the interval 0\leq x \leq a). As far as I understood I cannot use the bounded method already shipped with PyMC3 for the likelihood distribution. I found this example on censored data, but I have to admit it is not super clear to me what is going on (disclaimer: I can hack my way around but I do not have a strong foundation on probability unfortunately). Moreover, I believe truncation and censoring are two different things (I do not know the points that were censored).

From the STAN manual I found that a upper-bounded pdf is defined as p_{[-inf, b]}(x)=\frac{p(x)}{\int^b_{-inf}p(u)d(u)} where the denominator is the CDF F_X(x) (which acts as a normalizing factor). In STAN code an upper-bounded likelihood distribution (e.g., \mathcal{N}\sim(0, 1) in the interval -inf\leq x \leq 2.1) is defined as:

target += normal_lpdf(y | 0, 1);
if (y > 2.1)
    target += negative_infinity();
else
    target += -normal_lcdf(2.1 | 0, 1);

The STAN code is quite understandable. Can a PyMC3 truncated distribution be defined like that (given that I actually would llke to work with an Inverse Gaussian distribution instead of the Normal one)?

Yes they are different, censoring data usually involve integral of the tail area - intuitively you can understand as you dont have observation beyond that point, so you take the expectation by integrating the tail area.

The equivalent of the Stan code above is:

with ...
    pm.Normal('obs', 0., 1., observed=y)
    # potential works the same as target +=... in Stan, tt.switch works the same as ifelse
    pm.Potential('upper', tt.switch(tt.gt(y, 2.1), -np.inf, -normal_lcdf(0., 1., 2.1)))
1 Like

Thank you @junpenglao!

I created a test notebook following your suggestions, and it seems everything is sort of working. In this notebook, for example, I am trying to recover \mu and \sigma of a truncated normal distribution.

I have some doubts though. The observations in the notebook were from a normal distribution \mathcal{N}\sim(0, 3), which was truncated at upper=5. The posterior distribution seems to fit the observations well

image

However, the HPD in the marginals posterior for \mu and \sigma do not contain the true value 0 and 3, respectively.

image

Am I missing something?

EDIT: there was a mistake in the theano function. I did not computed the log. I fixed it but the problem remains

I didnt check your implementation of the logcdf but you should use the one from the censor_example.py:

def normal_lcdf(x, mu, sigma):
    z = (x - mu) / sigma
    return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
    )

Ok, I used the implementation in the example and it is gives the right answer:
image
And then I realized I made a stupid mistake. I used tt.sqr instead of tt.sqrt. Gosh. The results I got with my implementation are the same. Good!
image

I actually prefer my implementation, beacuse it is straight from the standard notation, that one can find on Wikipedia. Therefore one can apply this method on a range of different distributions (in fact, now I will try to apply it on a inverse normal distribution)

Anyway, I got interested in the implementation you suggested. I do not understand the code in the switch statement

return tt.switch(
        tt.lt(z, -1.0),
        tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.,
        tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)
    )

from Wikipedia: let z=\frac{x-\mu}{\sigma}, then the CDF F(x)=\Phi(z)=0.5\left[ 1 + erf\left( \frac{z}{\sqrt{2}} \right) \right] = 0.5 * erfc\left( -\frac{z}{\sqrt{2}}\right). In code that should be the last line of the switch statement:

tt.log1p(-tt.erfc(z / tt.sqrt(2.)) / 2.)

I do not understand why we are using a switch statement based on the value of z though. And what the second line of the switch statement means

tt.log(tt.erfcx(-z / tt.sqrt(2.)) / 2.) - tt.sqr(z) / 2.

I am not exactly sure why it is implemented this way, but I imagine it is either for speed or numerical stability. You can ask @domenzain on this PR https://github.com/pymc-devs/pymc3/pull/2688 about his logic of the implementation.

Oh I see there are many distribution already implemented in that PR. Also the Wald one. Need to have a look at those. Hope they will be part of the next PyMC3 release :tada:

I am still trying to figure the truncated distribution stuff. I am wondering, why do we need a switch here:

pm.Potential('upper', tt.switch(tt.gt(y, 2.1), -np.inf, -normal_lcdf(0., 1., 2.1)))

If my observations y are truncated at 2.1, they will never be greater than 2.1. I am quite sure my thinking is flawed (also beacuse If I do remove the switch the estimation are off), but I can’t really understand why :thinking:

EDIT: I got it now.

  1. I fit the likelihood based on the observed data.
  2. The potential then cut the tail by setting the tail of the pdf at 0 (that is, the log(pdf) at -inf) above the upper bound
  3. The potential finally normalizes the pdf below the upper bound dividing by the cdf value at the upper bound, that is, it substracts the log(cdf)

And, by the way, I tried to use the Wald logcdf implemented in in PR #2688 without success. The estimations are way off. Need to look into that more carefully I guess.

I dont think that’s tha case. the reason why you are getting different estimation without the tt.switch is that you did not account for the number of y. I think below should gives you the same result:

pm.Potential('upper', -len(y)*normal_lcdf(0., 1., 2.1)))

oh. Then I really didn’t understand how potentials work. I need to read up on that; potentials are still kind of obscure to me. Thank you though.

As you said, y is never larger than 2.1, to me the expression it’s more for completeness (i.e., useful when you are trying to sample from the distribution).

To me the explanation in the censor_example is quite good. But here the example is slightly different - it is normalizing the upper-bounded distribution to have integral to 1.

Ok, I think I start to understand now. A potential is a multiplier/divisor (or an addend/subtrahend in in the log space) for the likelihood. I need to consider the number of observation in -len(y)*normal_lcdf(0., 1., 2.1)) because the likelihood of each observation need to be re-scaled.

Do you know if there is any best practice when we deal with truncated likelihood? For example in STAN, for a truncated normal distributed distribution, one would write something like

data {
   real<upper=U> y
}

model {
   y ~ normal(mu, sigma) T[, U]
}

to constrain the the data for y . Maybe in PyMC3 we do no not need something like that.

Yep, that’s correct.

Didn’t know that you can do this in Stan. I don’t think this is easy to do in pymc3 currently as we don’t have lcdf function for distributions. Thus if you want to do this correctly you need to account for the truncated part by hand. However, the potential part is just a constant, which means you can drop them when you are sampling. In another word, it would be the same as if you are using a bounded RV. I am not sure how Stan deal with it internally, but I think the default is dropping all the constant.

Because I am also wondering: in case of the Wald/inverse Gaussian distribution, both \mu and \lambda need to be strictly greater than 0. And this is fine if I have a simple model, let’s say y\sim Wald(\mu, \lambda) where the prior on \mu and \lambda is HalfNormal(100) to restrict them to positive values. But I like to model my data using random effect :wink:

And here’s my doubt. My model now is still y\sim Wald(\mu, \lambda), but \mu is estimated as \mu=\beta+\gamma, where the prior on the fixed effect is \beta\sim HalfNormal(100), and for the random effect (i.e. the deviation around the estimated group effect \beta) is \gamma\sim \mathcal{N}(0, 1). Now, the sampler could take values that make \mu+\gamma<0, because \gamma can in fact be negative.

Maybe I should not worry about it, but I would like to know how to correctly restrict the values that the sampler can take. I found that the Wald distribution is a bit of a mess to deal with. Any ideas? (I hope I explained myself clearly enough :slight_smile: )

I often find it’s easier to have your parameters model on the real line and transform them into the positive using exp().

Right, I should have thought about that. So it would be something like:

with pm.Model() as model:
    # Fixed effect
    beta = pm.Normal('beta', mu=.., sd=..)
    
    # Random effect (Non centered parametrization)
    sigma = pm.HalfNormal('sigma', sd=..)
    gamma_offset = pm.Normal('gamma_offset', mu=.., sd=..)
    gamma = pm.Deterministic('gamma', gamma_offset * sigma)  
    
    mu = tt.exp(beta + gamma) # <-- here the transformation to get only positive numbers for mu
    lam = pm.HalfNormal('lam', sd=..)
        
    y_likelihood = pm.Wald('y_likelihood', mu=mu, lam=lam, observed=observations)
    
    trace = pm.sample(500)

And then I need to change my prior accordingly. If I still expect \mu and \lambda to be a small number (say, \mu=\lambda=1), I should tighten the prior (e.g., around 0) otherwise the values for \mu, \sigma will explode. I will try to apply this model.