Dealing with missing data and custom distribution

chrism · March 12, 2021, 1:13pm

I’m a PyMC3 newbie. I’m trying to create a Bayesian Rasch model, but I don’t have all the observations (i.e. an observation for every user, for every question). I understand that ordinarily, for missing observations PyMC3 automatically performs imputation by sampling from the prior. In my case however, I suspect as a consequence of my custom distribution, this doesn’t seem to be working.

In the example below I am deliberately missing an observation for charlie on item y . Removing charlie , everything works beautifully.

import pandas as pd
import numpy as np
import pymc3 as pm
import theano.tensor as tt

df = pd.DataFrame([["alice","x",1],
["alice","y",1],
["bob","x",1],
["bob","y",0],
["charlie","x",1]], columns=['user','question','correct'])

data = df.pivot(index='user', columns='question', values='correct')
observations = np.ma.masked_invalid(data.values)

with pm.Model() as model:

    ## Independent priors
    alpha = pm.Normal('User', mu = 0, sigma = 3, shape = (1, len(data)))
    gamma = pm.Normal('Task', mu = 0, sigma = 3, shape = (data.shape[1], 1))

    ## Log-Likelihood
    def logp(d):
        rasch = tt.nnet.sigmoid(alpha - (gamma - gamma.mean(0)))
        #P(x=1)
        v1 = tt.transpose(d) * tt.log(rasch)
        #P(x=0)
        v2 = tt.transpose((1-d)) * tt.log(1 - rasch)
        return v1 + v2

    ll = pm.DensityDist('ll', logp, observed = {'d': observations})
    trace = pm.sample(cores=1)
    trace = trace[250:]

What’s the correct way of solving this?

ricardoV94 · March 12, 2021, 2:56pm

You would need to add a random method to your DensityDist. Otherwise pymc3 won’t know how to sample from the prior for that missing datapoint.

chrism · March 12, 2021, 3:14pm

That did it! Thank you very much

chrism · March 12, 2021, 3:33pm

Sorry, spoke to soon. Having defined random, I still have the following error when working with incomplete data. Say for example, I adjust the script as follows:

ll = pm.DensityDist('ll', logp, random=gamma.random, observed = {'d': observations})

When working with incomplete data, I still get the error:
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV User.ravel()[2] is zero.

ricardoV94 · March 12, 2021, 3:47pm

That doesn’t sound like an issue on the random side anymore but on the observed data. You still see the problem even without the missing value no?

What do you see if you write model.check_test_point()?

chrism · March 12, 2021, 4:07pm

When all values are present it runs fine. If I remove the “charlie” row from the dataframe, everything is good, no errors.

With missing data:

User        -6.05
Task        -4.04
d_missing    0.00
ll          -4.16
Name: Log-probability of test_point, dtype: float64

Without missing data:

User   -4.04
Task   -4.04
ll     -2.77
Name: Log-probability of test_point, dtype: float64

ricardoV94 · March 12, 2021, 4:15pm

Sorry I don’t think the random has anything to do with it. The sampling of missing values will still be done with MCMC.

Is it possible that proposed values ever evaluate to -inf / nan in your custom logp?

It could be a failure in the DensityDist as well. I don’t know if it was ever tested with missing values.

chrism · March 12, 2021, 5:00pm

Ok, I’m still not sure, but I think we’re warm.

Hypothetically, if rasch evaluated to 0 or negative, I suppose I’d get a log(0) = -inf
or more likely… where I transpose my original data containing NaNs and multiply by the log of rasch

chrism · March 12, 2021, 5:29pm

Sorry, don’t think it’s either of those. In fact I don’t think logp ever sees the missing values. It looks like it’s converted to a theano tensor with that partial row omitted.

ricardoV94 · March 12, 2021, 5:44pm

The logp shouldn’t see the missing values, it should see whatever values are proposed by the NUTS inplace of the missing values.

ricardoV94 · March 12, 2021, 6:08pm

This is the minimal working example I can get to work.

    ...: with pm.Model() as model: 
    ...:  
    ...:     ## Independent priors 
    ...:     alpha = pm.Normal('User', mu = 0, sigma = 1) 
    ...:     gamma = pm.Normal('Task', mu = 0, sigma = 1) 
    ...:  
    ...:     ## Log-Likelihood 
    ...:     def logp(d): 
    ...:         return pm.Normal.dist(alpha, pm.math.exp(gamma)).logp(d) 
    ...:  
    ...:     ll = pm.DensityDist('ll', logp, observed = observations)          
    ...:     trace = pm.sample()

Having the dictionary in observed seems to fail.
But removing it is not enough for your example. I still think it has something to do with your logp returning an invalid value or gradient at some point. The sampler always seems to fail after 38%… or something. We can also look at where it fails to have a better idea.

chrism · March 12, 2021, 9:21pm

Very interesting. That makes me wonder if I’m overcomplicating this?

The current approach presents itself very concisely in terms of implementation, but perhaps I would be better off somehow modeling an observation shaped Bern.

ricardoV94 · March 13, 2021, 7:41am

If you can use a standard distribution that’s always better.

chrism · March 14, 2021, 2:13pm

I believe I have solved my issue. In the spirit of good forum practice, to so whomever it may concern in future, here is the solution!:

import pandas as pd
import numpy as np
import pymc3 as pm
import theano.tensor as tt

df = pd.DataFrame([["alice","x",1],
["alice","y",1],
["bob","x",1],
["bob","y",0],
["charlie","x",1]
], columns=['user','question','correct'])

data = df.pivot(index='user', columns='question', values='correct')
obs = tt._shared(np.ma.masked_invalid(data.values))

with pm.Model() as model:

    ## Independent priors
    alpha = pm.Normal('User', mu = 0, sigma = 3, shape = (1, len(data)))
    gamma = pm.Normal('Task', mu = 0, sigma = 3, shape = (data.shape[1], 1))

    ## Log-Likelihood
    def logp(obs):
        rasch = tt.nnet.sigmoid(alpha - (gamma - gamma.mean(0)))
        corrects = tt.switch(tt.isnan(obs), 0, obs)
        incorrects = tt.switch(tt.isnan(obs), 0, (1-obs))
        correct = tt.transpose(corrects) * tt.log(rasch)
        incorrect = tt.transpose(incorrects) * tt.log(1 - rasch)
        result = correct + incorrect
        return result

    ll = pm.DensityDist('ll', logp, observed = obs)
    trace = pm.sample(cores=1)
    trace = trace[250:]

I check for NaNs in my observations, and remove those from the computation. An important additional detail here is that I no longer pass the masked numpy array in. A masked numpy array is converted to a theano tensor with NaNs as 0, so they are lost. Instead I create the theano tensor using theano’s _shared, then switch on that to remove those from the computation.

Thank you very much ricardoV94 for helping me debug this

Topic		Replies	Views
Automatic imputation and posterior predictive sampling with new data Questions	5	2024	October 29, 2020
How to modify variational inference PyMC codes to handle missing data v5 bug , modeling	5	723	August 20, 2022
Marginalizing over missing categories Questions	1	712	June 17, 2020
Automatic imputation - array dimension problem Questions	2	669	February 10, 2022
Could Someone Give me Advice for Handling Missing Data in Bayesian Modeling with PyMC? v5 theano , modeling	1	113	January 31, 2025

Dealing with missing data and custom distribution

Related topics