Ignoring missing data or allowing posterior check on DensityDist

I have the model below:

with m:
  param_gist = pm.Beta('param_gist', alpha=1, beta=1, shape=num_subj) 
  param_generalize = pm.Exponential('param_generalize',lam=1, shape=num_subj)
  param_memory = pm.Exponential('param_memory',lam=1, shape=num_subj)
  param = at.as_tensor_variable([param_gist, param_generalize, param_memory], dtype='float64')
    llh_new = pm.Deterministic('llh_new' ,logp(param))
    obs = pm.Bernoulli("obs",p=at.exp(llh_new), observed=response_)

Where the function logp outputs a tensor of the predicted log-likelihood of choosing option A over option B. response_ is the observed data. However, response_ has missing data in it encoded as np.nan before it was converted into aesara tensor. It seems that the pm.Bernoulli failed as a result of those missing data. I thought about excluding nan before modeling, but that would mess-up the tensor structure since difference subjects have different amounts of missing data. If I exclude them, I would no longer be able to put them into the same tensore. I understand a work-around is pm.DensityDist where I treat logp as a black box function and exclude nan iterative through aesara within that function. However, by doing so I can’t take advantage of pm.sample_posterior_predictive to simulate predicted data straight from the fitted traces. I wonder is there a work around? Either to allow pm.Bernoulli to ignore nan, or to enable pm.DensityDist to be compatible with pm.sample_posterior_predictive in simulating model.

Most of the times, the solution to ragged arrays is to use long form data. Have one observation per row and use indexing to match few-to-many variables to observations.

2 Likes

Ic. So instead of making a tensor of shape (nSubj, nTrials), we make it one dimensional with length (nSubj*nTrials, ). Then once the model is fitted we use other arrays to index them back to the corresponding subject and trial number. Is there a notebook tutorial somewhere about this? I feel like a lot of people would benefit from it! Thanks again for answering the question.

There are plenty, though they often assume that your data is in long format rather than instruct/illustrate the value of doing so (see here for a recent discussion of this). This is probably the canonical example, but there are plenty of others (e.g., this one).

1 Like