Dealing with missing data and custom distribution

I’m a PyMC3 newbie. I’m trying to create a Bayesian Rasch model, but I don’t have all the observations (i.e. an observation for every user, for every question). I understand that ordinarily, for missing observations PyMC3 automatically performs imputation by sampling from the prior. In my case however, I suspect as a consequence of my custom distribution, this doesn’t seem to be working.

In the example below I am deliberately missing an observation for charlie on item y . Removing charlie , everything works beautifully.

import pandas as pd
import numpy as np
import pymc3 as pm
import theano.tensor as tt

df = pd.DataFrame([["alice","x",1],
["alice","y",1],
["bob","x",1],
["bob","y",0],
["charlie","x",1]], columns=['user','question','correct'])

data = df.pivot(index='user', columns='question', values='correct')
observations = np.ma.masked_invalid(data.values)

with pm.Model() as model:

    ## Independent priors
    alpha = pm.Normal('User', mu = 0, sigma = 3, shape = (1, len(data)))
    gamma = pm.Normal('Task', mu = 0, sigma = 3, shape = (data.shape[1], 1))

    ## Log-Likelihood
    def logp(d):
        rasch = tt.nnet.sigmoid(alpha - (gamma - gamma.mean(0)))
        #P(x=1)
        v1 = tt.transpose(d) * tt.log(rasch)
        #P(x=0)
        v2 = tt.transpose((1-d)) * tt.log(1 - rasch)
        return v1 + v2

    ll = pm.DensityDist('ll', logp, observed = {'d': observations})
    trace = pm.sample(cores=1)
    trace = trace[250:]

What’s the correct way of solving this?

You would need to add a random method to your DensityDist. Otherwise pymc3 won’t know how to sample from the prior for that missing datapoint.

That did it! Thank you very much

Sorry, spoke to soon. Having defined random, I still have the following error when working with incomplete data. Say for example, I adjust the script as follows:

ll = pm.DensityDist('ll', logp, random=gamma.random, observed = {'d': observations})

When working with incomplete data, I still get the error:
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV User.ravel()[2] is zero.

That doesn’t sound like an issue on the random side anymore but on the observed data. You still see the problem even without the missing value no?

What do you see if you write model.check_test_point()?

When all values are present it runs fine. If I remove the “charlie” row from the dataframe, everything is good, no errors.

With missing data:

User        -6.05
Task        -4.04
d_missing    0.00
ll          -4.16
Name: Log-probability of test_point, dtype: float64

Without missing data:

User   -4.04
Task   -4.04
ll     -2.77
Name: Log-probability of test_point, dtype: float64

Sorry I don’t think the random has anything to do with it. The sampling of missing values will still be done with MCMC.

Is it possible that proposed values ever evaluate to -inf / nan in your custom logp?

It could be a failure in the DensityDist as well. I don’t know if it was ever tested with missing values.

Ok, I’m still not sure, but I think we’re warm.

Hypothetically, if rasch evaluated to 0 or negative, I suppose I’d get a log(0) = -inf
or more likely… where I transpose my original data containing NaNs and multiply by the log of rasch

Sorry, don’t think it’s either of those. In fact I don’t think logp ever sees the missing values. It looks like it’s converted to a theano tensor with that partial row omitted.

The logp shouldn’t see the missing values, it should see whatever values are proposed by the NUTS inplace of the missing values.

This is the minimal working example I can get to work.

    ...: with pm.Model() as model: 
    ...:  
    ...:     ## Independent priors 
    ...:     alpha = pm.Normal('User', mu = 0, sigma = 1) 
    ...:     gamma = pm.Normal('Task', mu = 0, sigma = 1) 
    ...:  
    ...:     ## Log-Likelihood 
    ...:     def logp(d): 
    ...:         return pm.Normal.dist(alpha, pm.math.exp(gamma)).logp(d) 
    ...:  
    ...:     ll = pm.DensityDist('ll', logp, observed = observations)          
    ...:     trace = pm.sample() 

Having the dictionary in observed seems to fail.
But removing it is not enough for your example. I still think it has something to do with your logp returning an invalid value or gradient at some point. The sampler always seems to fail after 38%… or something. We can also look at where it fails to have a better idea.

Very interesting. That makes me wonder if I’m overcomplicating this?

The current approach presents itself very concisely in terms of implementation, but perhaps I would be better off somehow modeling an observation shaped Bern.

If you can use a standard distribution that’s always better.

I believe I have solved my issue. In the spirit of good forum practice, to so whomever it may concern in future, here is the solution!:

import pandas as pd
import numpy as np
import pymc3 as pm
import theano.tensor as tt

df = pd.DataFrame([["alice","x",1],
["alice","y",1],
["bob","x",1],
["bob","y",0],
["charlie","x",1]
], columns=['user','question','correct'])

data = df.pivot(index='user', columns='question', values='correct')
obs = tt._shared(np.ma.masked_invalid(data.values))

with pm.Model() as model:

    ## Independent priors
    alpha = pm.Normal('User', mu = 0, sigma = 3, shape = (1, len(data)))
    gamma = pm.Normal('Task', mu = 0, sigma = 3, shape = (data.shape[1], 1))

    ## Log-Likelihood
    def logp(obs):
        rasch = tt.nnet.sigmoid(alpha - (gamma - gamma.mean(0)))
        corrects = tt.switch(tt.isnan(obs), 0, obs)
        incorrects = tt.switch(tt.isnan(obs), 0, (1-obs))
        correct = tt.transpose(corrects) * tt.log(rasch)
        incorrect = tt.transpose(incorrects) * tt.log(1 - rasch)
        result = correct + incorrect
        return result

    ll = pm.DensityDist('ll', logp, observed = obs)
    trace = pm.sample(cores=1)
    trace = trace[250:]

I check for NaNs in my observations, and remove those from the computation. An important additional detail here is that I no longer pass the masked numpy array in. A masked numpy array is converted to a theano tensor with NaNs as 0, so they are lost. Instead I create the theano tensor using theano’s _shared, then switch on that to remove those from the computation.

Thank you very much ricardoV94 for helping me debug this :slight_smile: