OOS predictions of covariate with "out-of-range" missing values

Hi all,

apologies for the rather obscure title for this topic. Part of the problem resides in the fact that I do not know exactly how to frame the issue that I am trying to tackle.
The (partially simplified) setup is overall pretty simple:

  • I have a data set of measurements as a function of time (yhe value of the dependent variable should decrease with time).
  • The time information is missing for a large proportion of the measurements. More importantly, I know that time is available only for measurements collected at early times. The measurements with missing covariate could instead refer to times that are longer than those available in the dataset.
  • The goal would be to learn the relationship between the dependent variable and the covariate but also to impute the missing time values (I can reach the first goal without using the missing values although of course that comes with a loss of information).

The first thing I tried (I was a bit sceptical that it would work) was to develop a model that included the missing values in the typical PyMC3 way (using a normal random variable for the time with mean and standard deviation defined by hyperpriors and the masked time array as the observed data). Here’s the code for the model (where train is the dataset for training):

days = train["age_in_days"].values
days_missing = np.ma.masked_array(days, np.isnan(days))
road_cat_idx = train["road_cat"].cat.codes.values 
n_cat = len(np.unique(road_cat_idx))
deg_obs = train["qual_idx"].values
days_max = days_missing.max()
days_std = (days_missing/days_max) 
cat_names = train["road_cat"].cat.categories.values

coords = {
    "categories": cat_names,
    "obs_id": days_std,
}

with pm.Model(coords=coords) as betabinmod:
    
    #Hyperpriors
    a_mu = pm.Normal("a_mu", mu=-3, sd=0.1)
    a_sd = pm.HalfNormal("a_sd", sd=0.3)
    b_sd = pm.HalfNormal("b_sd", sd=1)
    phi_mu = pm.Exponential("phi_mu", lam=0.1)
    phi_sd = pm.HalfNormal("phi_sd", sd=0.1)
    age_mu = pm.Normal("age_mu", mu=1.7, sd=1.7)
    age_sd = pm.HalfNormal("age_sd", sd=1.7)

    #Priors
    a = pm.Normal("a", mu=0, sd=1, dims="categories")
    b = pm.HalfNormal("b", sd=2, dims="categories")            
    phi = pm.HalfNormal("phi", sd=0.1, dims="categories")
    age = pm.Normal("age", mu=age_mu, sd=age_sd,  observed=days_std, dims="obs_id")

    theta = pm.Deterministic("theta", var=(phi_mu + phi[road_cat_idx]*phi_sd) + 2, dims="obs_id")
    pbar = pm.Deterministic("pbar",
                                            pm.math.invlogit((a_mu + a[road_cat_idx]*a_sd) 
                                            + (b[road_cat_idx]*b_sd)*age),
                                            dims="obs_id"
                                            )
    
    # Likelihood
    road_qual = pm.BetaBinomial("road_qual",
                                                    n=200,
                                                    alpha=pbar * theta,
                                                    beta=(1 - pbar)*theta,
                                                    observed=deg_obs,
                                                    dims="obs_id"
                                                    )

I should mention that I can’t seem to be able to sample the posterior if I used a shared theano variable to replace the covariate. If I do that, when I start sampling it sets the initial value for age to NaN and PyMC3 does not return the imputation message warning. Any thoughts about why this is the case?
Regardless, with this approach the model seems to be unable to understand that the lower values of the dependent variable associated to missing time values should refer to larger times and instead assumes that they all fall within the same range of the observed values. Essentially the model squeezes all the data in the range of time values that I observe and thinks the dependent variable drops faster.

I guess this is not surprising given that I have no information for larger times but I was wondering if there could be any way to do this (stringent prior specifications?).

A different (and I guess less proper) approach I wanted to try was instead to split my two goals and to first use only the data for which the time information is available to sample the posterior. This works well and returns the expected behaviour of the dependent variable. Secondly, I would like to use a test dataset with missing time values to do inference and obtain samples for the time variable. I’m struggling to conceptualize if this is something the model should even be able to do. I implemented this approach following hints from previous discussions (shared variables sizes seem to not get updated correctly):

and in particular I ended up constructing a model factory that I used to separately run the “training” step and the “imputation” step (please disregard the different dimension names).

def betabinom_model_factory(coords, days_std, days_max, categories, obs):
    with pm.Model(coords=coords) as model:
        days_shared = pm.Data("days_shared", days_std, dims="days_array")
        cat_shared = pm.Data("cat_shared", categories, dims="cat_array")
        obs_qual = pm.Data("obs_qual", obs, dims="obs_array")
        
        #Hyperpriors
        a_mu = pm.Normal("a_mu", mu=-3, sd=0.1)
        a_sd = pm.HalfNormal("a_sd", sd=0.3)
        b_sd = pm.HalfNormal("b_sd", sd=1)
        phi_mu = pm.Exponential("phi_mu", lam=0.1)
        phi_sd = pm.HalfNormal("phi_sd", sd=0.1)
        age_mu = pm.Normal("age_mu", mu=1.7, sd=1.7)
        age_sd = pm.HalfNormal("age_sd", sd=1.7)
        
        #Priors
        a = pm.Normal("a", mu=0, sd=1, dims="categories")
        b = pm.HalfNormal("b", sd=2, dims="categories")            
        phi = pm.HalfNormal("phi", sd=0.1, dims="categories")
        age = pm.Normal("age", mu=age_mu, sd=age_sd, observed=days_shared, dims="days_array")
        
        theta = pm.Deterministic("theta", var=(phi_mu + phi[cat_shared]*phi_sd) + 2, dims="days_array")
        pbar = pm.Deterministic("pbar",
                                                pm.math.invlogit((a_mu + a[cat_shared]*a_sd) 
                                                + (b[cat_shared]*b_sd)*age),
                                               dims="days_array"
                                               )
        
        #Likelihood
        road_qual = pm.BetaBinomial("road_qual",
                                                        n=200,
                                                       alpha=pbar * theta,
                                                       beta=(1 - pbar)*theta,
                                                       observed=obs_qual,
                                                       dims="obs_array"
                                                       )
    return model

This “works” in the sense that the sampling runs smoothly but it returns essentially the same value for age for all the samples (but different for each category). This is the case also when I use an approach more similar to that suggested by @junpenglao. Also, with this approach PyMC3 does not create an age_missing variable given that there are no missing values in the sampling of the posterior distribution.

Is what I am trying to do feasible?
I am still learning about PyMC3 capabilities and about Bayesian statistical approaches in general so apologies for the lack of clarity and for the likely blunders!

Thanks for your help!

Hi all,

apologies for asking again for help. I was wondering if anyone might have any input particularly for what concerns the approach that follows the steps:

  1. Train the model on a complete dataset
  2. Generate predictions from the posterior distribution for entries with missing values

If anyone has thoughts on how best to approach this problem it would be super helpful.

Thanks,
Paolo