Dealing with missing data and custom distribution

I believe I have solved my issue. In the spirit of good forum practice, to so whomever it may concern in future, here is the solution!:

import pandas as pd
import numpy as np
import pymc3 as pm
import theano.tensor as tt

df = pd.DataFrame([["alice","x",1],
["alice","y",1],
["bob","x",1],
["bob","y",0],
["charlie","x",1]
], columns=['user','question','correct'])

data = df.pivot(index='user', columns='question', values='correct')
obs = tt._shared(np.ma.masked_invalid(data.values))

with pm.Model() as model:

    ## Independent priors
    alpha = pm.Normal('User', mu = 0, sigma = 3, shape = (1, len(data)))
    gamma = pm.Normal('Task', mu = 0, sigma = 3, shape = (data.shape[1], 1))

    ## Log-Likelihood
    def logp(obs):
        rasch = tt.nnet.sigmoid(alpha - (gamma - gamma.mean(0)))
        corrects = tt.switch(tt.isnan(obs), 0, obs)
        incorrects = tt.switch(tt.isnan(obs), 0, (1-obs))
        correct = tt.transpose(corrects) * tt.log(rasch)
        incorrect = tt.transpose(incorrects) * tt.log(1 - rasch)
        result = correct + incorrect
        return result

    ll = pm.DensityDist('ll', logp, observed = obs)
    trace = pm.sample(cores=1)
    trace = trace[250:]

I check for NaNs in my observations, and remove those from the computation. An important additional detail here is that I no longer pass the masked numpy array in. A masked numpy array is converted to a theano tensor with NaNs as 0, so they are lost. Instead I create the theano tensor using theano’s _shared, then switch on that to remove those from the computation.

Thank you very much ricardoV94 for helping me debug this :slight_smile:

1 Like