Calculating the likelihood based on not missing observed values

square · May 25, 2018, 10:58am

Hello,
I am trying to deal with my missing data of the target values by using mask, and I want to calculate the likelihood just at the points where the target values are missing, how should I do?

https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/lasso_missing.py

My model looks like this.

 ann_input  = theano.shared(X_train)
 ann_output = theano.shared(Y_train)

 with pm.Model() as neural_network:
        weights_in_1 = pm.Normal('w_in_1',  0, sd = 1,   shape=(5,3),  testval=init_1)      
        weights_1_out = pm.Normal('w_1_out', 0, sd = 1,  shape=(3,),  testval=init_out)
         
        hidden1_bias = pm.Normal( 'hidden1_bias', sd=1, shape=n_hidden1)
        hidden_out_bias = pm.Normal('hidden_out_bias', mu = 3, sd=1)
        act_1 = pm.math.tanh( pm.math.dot(ann_input, weights_in_1 )+ hidden1_bias[None, :])
        regression = T.dot(act_1, weights_1_out) + hidden_out_bias    

        pm.Normal('out', mu =regression, sd=np.sqrt( 0.9 ), observed =  ann_output)

        train_trace = pm.sample( )

In Likelihood function ann_output has some values masked, and part of it looks like this

Based on the example from disaster case study, I changed the value to -999 for mask, then I got the result

So I wonder if the data was really masked, I changed to 999, and mask the value 999, then I got the result.

junpenglao · May 25, 2018, 11:33am

I am not sure I understand what you are trying to do. But in general you dont have a point value as likelihood for missing data - they are represented as some kind of density and if you sample then the MCMC samples of the missing data are from this said density.

rlouf · May 26, 2018, 8:12am

I concur with @junpenglao: In Bayesian analysis you don’t have to give missing value a special treatment. To compute the likelihood you only have to care about those values that are observed.

square · May 29, 2018, 9:17am

Hallo,
thanks for the reply. It seems that when the variable is assigned to theano.shared, then the likelihoof function does not see the mask anymore. After we tried without theano.shared, feeding input directly into the model, then it works well, like @junpenglao mentioned.

gokl · June 11, 2018, 10:32pm

I’m facing a very similar problem atm where I have a masked array in a pm.Minibatch in a theano.shared var. This could explain wrong estimates I am facing. Is there an issue for this open somewhere?

Where and how is the missing observed treatment for numpy masked arrays implemented? I would like to look into it to understand if it is my problem.

junpenglao · June 12, 2018, 5:40am

I am not sure this would work - at least it is not one of the cases that we tested I think.

Internally, PyMC3 search for the masked value in the observed, and create a free random variable of the masked values. In effect it is adding a new random variable and do prior predictive sample from it:

github.com

pymc-devs/pymc/blob/dd7caabe366f6ae7a9917f50154ea779ffb4b110/pymc3/model.py#L817-L836


      
          elif isinstance(data, dict):
              with self:
                  var = MultiObservedRV(name=name, data=data, distribution=dist,
                                        total_size=total_size, model=self)
              self.observed_RVs.append(var)
              if var.missing_values:
                  self.free_RVs += var.missing_values
                  self.missing_values += var.missing_values
                  for v in var.missing_values:
                      self.named_vars[v.name] = v
          else:
              with self:
                  var = ObservedRV(name=name, data=data,
                                   distribution=dist,
                                   total_size=total_size, model=self)
              self.observed_RVs.append(var)
              if var.missing_values:
                  self.free_RVs.append(var.missing_values)
                  self.missing_values.append(var.missing_values)
                  self.named_vars[var.missing_values.name] = var.missing_values

github.com

pymc-devs/pymc/blob/dd7caabe366f6ae7a9917f50154ea779ffb4b110/pymc3/model.py#L1250-L1274


      
          def as_tensor(data, name, model, distribution):
              dtype = distribution.dtype
              data = pandas_to_array(data).astype(dtype)
          
              if hasattr(data, 'mask'):
                  from .distributions import NoDistribution
                  testval = np.broadcast_to(distribution.default(), data.shape)[data.mask]
                  fakedist = NoDistribution.dist(shape=data.mask.sum(), dtype=dtype,
                                                 testval=testval, parent_dist=distribution)
                  missing_values = FreeRV(name=name + '_missing', distribution=fakedist,
                                          model=model)
                  constant = tt.as_tensor_variable(data.filled())
          
                  dataTensor = tt.set_subtensor(
                      constant[data.mask.nonzero()], missing_values)
                  dataTensor.missing_values = missing_values
                  return dataTensor
              elif sps.issparse(data):
                  data = sparse.basic.as_sparse(data, name=name)
                  data.missing_values = None

This file has been truncated. show original

Topic		Replies	Views
Logistic Regression w/ Missing Data? Questions	7	2867	September 11, 2017
Masking missing values of predictors Questions	3	1342	July 10, 2020
What does the hierarchical model look like when having missing in observed? Questions	12	1114	October 31, 2018
Disabling missing data imputation Questions	17	2200	October 10, 2023
OOS predictions with missing input values Questions	6	4114	November 22, 2018

Calculating the likelihood based on not missing observed values

Related topics