Advice on Modeling Censored Data from a Ratio of Variables

Hello everyone,
I’ve previously benefited from the expertise on this forum on modeling censored data using standard censored distributions in PyMC. I’m now working on a slightly more complex problem and would appreciate your insights.
I’m working on a model where the observed variable is derived from two measurements: an intensive property X_i (which is subject to censoring) and the mass analyzed Z_i (fully observed). I calculate the ratio Y_i=X_i/Z_i and assume that Y_i is lognormally distributed. However, since X_i scales with Z_i (doubling Z_i doubles X_i), I am unsure about the best approach to model X_i.

My current approach is to parametrize the relation as:

\log(X_i) = \log(Z_i) + \log(Y_i)

with

\log(Y_i) \sim \mathcal{N}(\mu, \sigma).

X_i follows a lognormal distribution with parameters \mu + \log(Z_i) and \sigma. To handle the censoring in X_i, I would use PyMC’s pm.Censored.

This is a sketch of my proposed implementation:

import pymc as pm
import numpy as np

# Assume we have our observed data arrays:
# Zi: mass analyzed (fully observed)
# Xi: measured intensive property (with censoring)
# lower_limit: censoring threshold for Xi

with pm.Model() as model:
    # Priors for the lognormal parameters for Yi
    # These priors values are just for the example
    mu = pm.Normal('mu', mu=0, sigma=2)
    sigma = pm.HalfNormal('sigma', sigma=2)
    
    log_Xi_expected = mu + np.log(Zi)
    log_lower_limit = np.log(lower_limit)
    
    log_Xi = pm.Censored('log_Xi',
                        pm.Normal.dist(mu=log_Xi_expected, sigma=sigma),
                        lower=log_lower_limit,
                        observed=np.log(Xi))

I’d appreciate any feedback or suggestions on this approach. In particular:

  • Is this reparameterization and use of pm.Censored appropriate for handling the dependency between X_i and Z_i?
  • Are there any improvements or alternative strategies to better capture the censoring in X_i while considering the lognormal assumption on Y_i ?
  • To assess the distribution of Y_i across different experimental settings, would you recommend imputing the censored values for X_i and then reconstructing Y_i, or use the posterior estimates of \mu and \sigma directly to construct the Y_i distribution?

Thank you in advance for your insights.

Best regards,
Andrea

Looks reasonable, I would perhaps use a lognormal likelihood directly to avoid converting the censoring bounds.