OOS predictions of covariate with "out-of-range" missing values

Hi all,

apologies for the rather obscure title for this topic. Part of the problem resides in the fact that I do not know exactly how to frame the issue that I am trying to tackle.
The (partially simplified) setup is overall pretty simple:

  • I have a data set of measurements as a function of time (yhe value of the dependent variable should decrease with time).
  • The time information is missing for a large proportion of the measurements. More importantly, I know that time is available only for measurements collected at early times. The measurements with missing covariate could instead refer to times that are longer than those available in the dataset.
  • The goal would be to learn the relationship between the dependent variable and the covariate but also to impute the missing time values (I can reach the first goal without using the missing values although of course that comes with a loss of information).

The first thing I tried (I was a bit sceptical that it would work) was to develop a model that included the missing values in the typical PyMC3 way (using a normal random variable for the time with mean and standard deviation defined by hyperpriors and the masked time array as the observed data). Here’s the code for the model (where train is the dataset for training):

days = train["age_in_days"].values
days_missing = np.ma.masked_array(days, np.isnan(days))
road_cat_idx = train["road_cat"].cat.codes.values 
n_cat = len(np.unique(road_cat_idx))
deg_obs = train["qual_idx"].values
days_max = days_missing.max()
days_std = (days_missing/days_max) 
cat_names = train["road_cat"].cat.categories.values

coords = {
    "categories": cat_names,
    "obs_id": days_std,
}

with pm.Model(coords=coords) as betabinmod:
    
    #Hyperpriors
    a_mu = pm.Normal("a_mu", mu=-3, sd=0.1)
    a_sd = pm.HalfNormal("a_sd", sd=0.3)
    b_sd = pm.HalfNormal("b_sd", sd=1)
    phi_mu = pm.Exponential("phi_mu", lam=0.1)
    phi_sd = pm.HalfNormal("phi_sd", sd=0.1)
    age_mu = pm.Normal("age_mu", mu=1.7, sd=1.7)
    age_sd = pm.HalfNormal("age_sd", sd=1.7)

    #Priors
    a = pm.Normal("a", mu=0, sd=1, dims="categories")
    b = pm.HalfNormal("b", sd=2, dims="categories")            
    phi = pm.HalfNormal("phi", sd=0.1, dims="categories")
    age = pm.Normal("age", mu=age_mu, sd=age_sd,  observed=days_std, dims="obs_id")

    theta = pm.Deterministic("theta", var=(phi_mu + phi[road_cat_idx]*phi_sd) + 2, dims="obs_id")
    pbar = pm.Deterministic("pbar",
                                            pm.math.invlogit((a_mu + a[road_cat_idx]*a_sd) 
                                            + (b[road_cat_idx]*b_sd)*age),
                                            dims="obs_id"
                                            )
    
    # Likelihood
    road_qual = pm.BetaBinomial("road_qual",
                                                    n=200,
                                                    alpha=pbar * theta,
                                                    beta=(1 - pbar)*theta,
                                                    observed=deg_obs,
                                                    dims="obs_id"
                                                    )

I should mention that I can’t seem to be able to sample the posterior if I used a shared theano variable to replace the covariate. If I do that, when I start sampling it sets the initial value for age to NaN and PyMC3 does not return the imputation message warning. Any thoughts about why this is the case?
Regardless, with this approach the model seems to be unable to understand that the lower values of the dependent variable associated to missing time values should refer to larger times and instead assumes that they all fall within the same range of the observed values. Essentially the model squeezes all the data in the range of time values that I observe and thinks the dependent variable drops faster.

I guess this is not surprising given that I have no information for larger times but I was wondering if there could be any way to do this (stringent prior specifications?).

A different (and I guess less proper) approach I wanted to try was instead to split my two goals and to first use only the data for which the time information is available to sample the posterior. This works well and returns the expected behaviour of the dependent variable. Secondly, I would like to use a test dataset with missing time values to do inference and obtain samples for the time variable. I’m struggling to conceptualize if this is something the model should even be able to do. I implemented this approach following hints from previous discussions (shared variables sizes seem to not get updated correctly):

and in particular I ended up constructing a model factory that I used to separately run the “training” step and the “imputation” step (please disregard the different dimension names).

def betabinom_model_factory(coords, days_std, days_max, categories, obs):
    with pm.Model(coords=coords) as model:
        days_shared = pm.Data("days_shared", days_std, dims="days_array")
        cat_shared = pm.Data("cat_shared", categories, dims="cat_array")
        obs_qual = pm.Data("obs_qual", obs, dims="obs_array")
        
        #Hyperpriors
        a_mu = pm.Normal("a_mu", mu=-3, sd=0.1)
        a_sd = pm.HalfNormal("a_sd", sd=0.3)
        b_sd = pm.HalfNormal("b_sd", sd=1)
        phi_mu = pm.Exponential("phi_mu", lam=0.1)
        phi_sd = pm.HalfNormal("phi_sd", sd=0.1)
        age_mu = pm.Normal("age_mu", mu=1.7, sd=1.7)
        age_sd = pm.HalfNormal("age_sd", sd=1.7)
        
        #Priors
        a = pm.Normal("a", mu=0, sd=1, dims="categories")
        b = pm.HalfNormal("b", sd=2, dims="categories")            
        phi = pm.HalfNormal("phi", sd=0.1, dims="categories")
        age = pm.Normal("age", mu=age_mu, sd=age_sd, observed=days_shared, dims="days_array")
        
        theta = pm.Deterministic("theta", var=(phi_mu + phi[cat_shared]*phi_sd) + 2, dims="days_array")
        pbar = pm.Deterministic("pbar",
                                                pm.math.invlogit((a_mu + a[cat_shared]*a_sd) 
                                                + (b[cat_shared]*b_sd)*age),
                                               dims="days_array"
                                               )
        
        #Likelihood
        road_qual = pm.BetaBinomial("road_qual",
                                                        n=200,
                                                       alpha=pbar * theta,
                                                       beta=(1 - pbar)*theta,
                                                       observed=obs_qual,
                                                       dims="obs_array"
                                                       )
    return model

This “works” in the sense that the sampling runs smoothly but it returns essentially the same value for age for all the samples (but different for each category). This is the case also when I use an approach more similar to that suggested by @junpenglao. Also, with this approach PyMC3 does not create an age_missing variable given that there are no missing values in the sampling of the posterior distribution.

Is what I am trying to do feasible?
I am still learning about PyMC3 capabilities and about Bayesian statistical approaches in general so apologies for the lack of clarity and for the likely blunders!

Thanks for your help!

Hi all,

apologies for asking again for help. I was wondering if anyone might have any input particularly for what concerns the approach that follows the steps:

  1. Train the model on a complete dataset
  2. Generate predictions from the posterior distribution for entries with missing values

If anyone has thoughts on how best to approach this problem it would be super helpful.

Thanks,
Paolo

@PaoloAndrich did you ever figure out a way to solve this issue? I can’t figure out how to run OOS posterior predictive with missing values, not even with all the goodies brought into PyMC4.

Hi @anchovy, no unfortunately I never found a way around that but I also abandoned that attempt soon after. I don’t know if there could be any new capabilities in PyMC4 that could help with this, it would be great to know if someone has an idea!

If there’s a relationship between missingness and a covariate that you are interested in, you have to model that explicitly. A model cannot give you more than the structure you put in its prior.

You could for instance model y and x as a mvnormal with partially missing y, and x. A covariance could be learned from the data. There’s an old pymc3 talk that touches on this: Partial Missing Multivariate Observation and What to Do With Them by Junpeng Lao

Another example, is a soft truncation process: Modeling with soft truncation

Hey Ricardo,

Thanks for this.
The idea of soft truncation sounds like what I was looking at in my data so I will look into that!

On the side, I am trying to understand your comment about the model prior as I think I am missing something.
The situation I had at hand was one where covariates were not missing at random but rather I knew that this happened more at long times. So the model was only learning from data at early times. However, I was thinking that, as the model learns the shape of the time dependence from the complete data, it should be able to infer that points with a small value of the dependent variable are more likely to be associated with large times. Somehow though the credible intervals for the times were all squeezed into the range of “existing” times independently of the prior on the covariate. What am I missing?

I don’t see any parameters in your model that relate to this time dependence. PyMC can’t infer patterns out of thin air. Can only infer parameters that are explicitly functionally related to your data