Two likelihoods dependent on missing predictors


I am trying to solve a silly indexing issue and realize I’m not sure I understand pymc3 syntax. Suppose I want to model a y that is normally distributed, and have predictors x0 and x1. x1 has only sometimes been measured, but I do not wish to impute it. Instead, I want to define two likelihoods for y. The two likelihoods would have different linear models, the less complex one where x1 is missing has, instead of x1 as a predictor, an additional residual uncertainty (sigma). Thus, when complete data is available, I want to model y like so:

y \sim \mathcal{N}(\mu, \sigma) \\ \mu = a+\beta_0 * x_0 + \beta_1 * x_1

and in the case I only know x0:

y \sim \mathcal{N}(\mu, \sigma + \sigma_{x1}) \\ \mu = a+\beta_0 * x_0

In stan I can do this like so (with N observations, and x1_measured an indicator vector):

   // likelihood, which we only evaluate conditionally
        for (i in 1:N) {
           if (x1_measured[i]==1) y[i] ~ normal(mu[i] , sigma ) ;
           if (x1_measured[i]==0) y[i] ~ normal(mu[i] , sigma+sigma_x1 ) ; // more uncertainty when non-measured 

I tried this approach in pymc3 (after defining priors etc, idx_without_m is an array of indeces):

   with pm.Model() as model: 
       # define priors...(not shown)
       mu_with = a + beta_x0*x0 + beta_x1*x1
       mu_without = a + beta_x0 * x0 

       y_likelihood_with = pm.Normal("y_with",mu_with, sigma, observed=np.array(dw)[idx_with_m])
       y_likelihood_without = pm.Normal("y_lik_without",mu_without, sigma+sigma_m, observed=np.array(dw)[idx_without_m])

Is this “canonical”? Should I used Theano shared variables etc?

That looks fine. (At least pymc syntax wise, I really don’t know about the imputation/non-imputation).
There is no need to use theano shared variables, that is only something you can use to switch out datasets.
Shouldn’t the second sigma be something like tt.sqrt(sigma**2 + sigma_m**2)?

1 Like

Thanks! And you are right.