Modeling with soft truncation

Hi all,

I would like to code with pymc a linear regression (to make a simple example) when the outcome variable is “soft truncated”. By soft truncation I mean that the probability of the data being missing goes smoothly from 1 to 0, in place to jump abruptly across the two values as in the usual truncation case. Missing values are missing completely (including how many are missing).

We can focus, on the truncation case described in

by changing the censoring function to a cumulative normal function.

If are interested to a bit of background, the soft truncation is common in astronomy (faint objects/features are often, but not always, missed) and various case studies, modeled in JAGS, are described in Bayesian Methods for the Physical Sciences: Learning from Examples in Astronomy and Physics | SpringerLink

To sum up, is there a way to infer the parameters of a fit in pymc when the data are soft truncated?

Regards, S.

Could you share one of those JAGS examples. I have a bit of a hard time imagining what a soft truncation process looks like.

How would you generate data, with a stochastic dropout probability that depends how much on the tails an observation falls?

Here is a full example of soft truncation with derivation of the (biased, because not accounting for the soft truncation) slope.

# modified from https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-truncated-censored-regression.html#glm-truncated-censored-regression

import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
from numpy.random import default_rng
from scipy.stats import norm, truncnorm
import arviz as az


%config InlineBackend.figure_format = 'retina'
rng = default_rng(12345)

# generate full sample, included data never observed
slope, intercept, scat, N = 1, 0, 2, 200
x = rng.uniform(-10, 10, N)
y = rng.normal(loc=slope * x + intercept, scale=scat)

def soft_truncate_y(x, y, threshold):
    py=norm.cdf(y+threshold)
    myrand=np.random.uniform(size=y.size)
    keep = (py >= myrand)
    return (x[keep], y[keep])
    
# included in the observed sample    
threshold = -1
xt, yt = soft_truncate_y(x, y, threshold)
print('include in the observed sample',yt.size,'out of',y.size)    

# plot all
plt.plot(x, y, ".", c=[0.7, 0.7, 0.7],label='_nolegend_')
plt.axhline(threshold, c="r", ls="--")
plt.xlabel("x"), plt.ylabel("y")
plt.plot(xt, yt, ".", c=[0, 0, 0],label='Observed data')

# (biased) fit, neglecting soft truncation
def linear_regression(x, y):
    with pm.Model() as model:
        slope = pm.Normal("slope", mu=0, sigma=1)
        intercept = pm.Normal("intercept", mu=0, sigma=1)
        sig = pm.HalfNormal("sig", sigma=1)
        pm.Normal("obs", mu=slope * x + intercept, sigma=sig, observed=y)

    return model

trunc_linear_model = linear_regression(xt, yt)

with trunc_linear_model:
    trunc_linear_fit = pm.sample(return_inferencedata = True)

# plot derived (biased) posterior, not accounting for soft selection
az.plot_posterior(trunc_linear_fit, var_names=["slope"], ref_val=slope)
1 Like

You can wrap your code in triple backticks ``` to format correctly

I think this works:

# modified from https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-truncated-censored-regression.html#glm-truncated-censored-regression

import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import pytensor.tensor as pt
from scipy.stats import norm


rng = np.random.default_rng(123)

# generate full sample, included data never observed
slope, intercept, scat, N = 1, 0, 2, 2000
x = rng.uniform(-10, 10, N)
y = rng.normal(loc=slope * x + intercept, scale=scat)

def soft_truncate_y(x, y, threshold):
    # Same as 1 - norm.cdf(threshold, y, 1)
    py=1 - norm.cdf(threshold - y, 0, 1) 
    myrand=rng.uniform(size=y.size)
    keep = (py >= myrand)
    return (x[keep], y[keep])
    
# included in the observed sample    
threshold = -1
xt, yt = soft_truncate_y(x, y, threshold)
print('include in the observed sample',yt.size,'out of',y.size)    

# plot all
plt.plot(x, y, ".", c=[0.7, 0.7, 0.7],label='_nolegend_')
plt.axhline(threshold, c="r", ls="--")
plt.xlabel("x"), plt.ylabel("y")
plt.plot(xt, yt, ".", c=[0, 0, 0],label='Observed data')

# (biased) fit, neglecting soft truncation
def linear_regression(x, y, adjustment=False):
    def logp(value, mu, sigma):
        base_logp = pm.logp(pm.Normal.dist(mu, sigma), value)
        # Probability of truncation is given by logccdf of the 
        # convolved Normal(mu, sigma) distribution and Normal(0, 1) 
        # evaluated at the threshold
        convolved_normal = pm.Normal.dist(mu, pt.sqrt(sigma ** 2 + 1 ** 2))
        truncation_adj = pt.log1mexp(pm.logcdf(convolved_normal, threshold))
        return base_logp - truncation_adj
    
    with pm.Model() as model:
        slope = pm.Normal("slope", mu=0, sigma=2)
        intercept = pm.Normal("intercept", mu=0, sigma=1)
        sigma = pm.HalfNormal("sigma", sigma=1)
        mu = slope * x + intercept
        if adjustment:
            pm.CustomDist("obs", mu, sigma, logp=logp, observed=y)
        else:
            pm.Normal("obs", mu=mu, sigma=sigma, observed=y)

    return model

for adjustment in (False, True):
    with linear_regression(xt, yt, adjustment):
        trunc_linear_fit = pm.sample(random_seed=rng)
    pm.plots.plot_posterior(trunc_linear_fit, var_names=["intercept", "slope", "sigma"], ref_val=[intercept, slope, scat])

I modified a bit the threshold logic to be more similar to the logp. Also you were not truncating at the threshold, I imagined you wanted to do y - threshold instead of y + threshold?

image

Without adjustment:

With adjustment:

3 Likes

I have to admit I got to the solution more from intuition than mathematical principles. So it could be flawed. The idea was the following.

A truncated variable has the same density as the original variable Y scaled by the probability of non-truncation or P(keep). For a “hard” lower truncation at threshold t, P(keep) = 1 - P(Y \le t).

Now we need to find P(keep) for the “soft” lower truncation. The expression that is most intuitive for me is P(keep) = 1 - P(Y <= N(t, 1)).

But it may help to derive it from where you started originally with P(keep) = P(N(0, 1) \le Y - t). Then we can move Y to the other side P(keep) = P(N(0, 1) - Y \le -t).

Now it’s important that Y is also Normal RV! The difference of two RVs is obtained by a convolution, but in the case of Normals it has a simple closed form solution that is N(\mu_1, \sigma_1) - N(\mu_2, \sigma_2) = N(\mu_1 - \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2}). So we get P(keep) = P(N(-\mu, \sqrt{\sigma^2 + 1}) \le -t). Which is what we need! This is the CDF of the convolved Normals evaluated at -t.

It turns out that P(N(-\mu, \sigma) \le -t) = 1 - P(N(\mu, \sigma) \le t), which is what I did in my code.


You can arrive to the same solution from the more “intuitive” form above, with the extra trick that N(t, 1) = N(0, 1) + t


For intuition it helped me see that if we reduce the standard deviation of the threshold, we approach the “hard” truncation case P(keep) =1 - P(Y - N(0, \epsilon) \le t).

2 Likes

Interesting problem @Stefano! I thought of a different way to get the expression for the probability density given the soft truncation.

You can think of your problem like this:

  1. Generate a value for variable X given its probability distribution: x \sim f_{X}(x)
  2. Generate a value for a binary random variable Y. If y=1, you keep the value x for variable X and return it. If y=0, then you go back to step 1. The probability that y=1 is given by some function g(x).

You can write down the joint density for variables X and Y like this:

P(X=x, Y=y) = g(x)^{y} (1-g(x))^{1-y}f_{X}(x)

To get the truncated distribution’s probability density function, we need to grab the above joint probability and try to arrive to the conditional probability P(X=x | Y=1). Using Bayes theorem, we can derive that as:

P(X=x | Y=1) = \frac{P(X=x, Y=1)}{\int P(X=x, Y=1) dx} \\ = \frac{g(x) f_{X}(x)}{\int g(x) f_{X}(x) dx}

Since you assume that g(x) is a normal cumulative function, and that X follows a normal distribution, you can compute the integral in the denominator using this property to finally get:

P(X=x | Y=1) = \frac{\mathrm{normcdf}(x-t, 1) N(x|\mu, \sigma)}{\mathrm{normcdf}\left( \frac{\mu - t}{\sqrt{1+\sigma^2}} \right)}

As far as I can tell, my expression for the denominator is exactly the same as the one @ricardoV94 posted above for P(keep), which makes sense since P(keep)=\int P(X=x, Y=1) dx. The only benefit you get from my expression is that you can place generic functions g(x) and distributions f(x) and then try to work out the resulting soft truncated probability density function (spoiler alert, there may be integrals in the denominator that are impossible to solve analytically).

2 Likes