Problem with imputation of missing data for a Bernoulli distribution

theano

#1

I’m trying to test out some simple imputation of missing observed values with a Bernoulli distribution and hit a theano problem, and was wondering if anyone had any ideas about solving it, or if it’s a theano bug. I’m using PyMC3 version 3.6 and theano version 1.0.3. A simple version of my code is as follows:

import pymc3 as pm
from scipy.stats import bernoulli 

# set "true" probability of rain
true_rain = 0.41

# set number of previous "observations"
nobs = 1000

# set the observations
has_rained = bernoulli.rvs(true_rain, size=nobs)

# try subsituting in a miss sample
has_rained[20] = -1  # add missing samples as -1
has_rained = np.ma.masked_values(has_rained, value=-1)  # create masked array

with pm.Model() as model:
    prain = pm.Uniform('prain', 0.0, 1.0)  # prior on probability of rain

    # distribution of prain given the number of observed times it has rained
    rain = pm.Bernoulli('rain', p=prain, observed=has_rained)

    trace = pm.sample(2000, tune=6000, discard_tuned_samples=True, chains=2)

The final lines of the error message that this produces are:

~/.conda/envs/survival/lib/python3.6/site-packages/theano/tensor/type.py in 
filter_variable(self, other, allow_convert)
    232             dict(othertype=other.type,
    233                  other=other,
--> 234                  self=self))
    235 
    236     def value_validity_msg(self, a):

TypeError: Cannot convert Type TensorType(int64, vector) (of Variable 
rain_missing_shared__) into Type TensorType(int64, (True,)). You can try to manually 
convert rain_missing_shared__ into a TensorType(int64, (True,)).

I can only assume that this is failing due to an issue with the Bernoulli distributions use of integer or boolean types, as this isn’t a problem that is noted in this example.

I also see the same error if trying to pass a theano shared variable, created from a numpy array of ones and zeros, as observations to a Bernoulli distribution.


#2

Yes, there is an issue with masking only 1 value: https://github.com/pymc-devs/pymc3/issues/3122

Unfortunately, we dont currently have a fix yet…


#3

Thanks, I’ll have a think about whether this might be a problem for me and if I have any ideas for a fix I’ll be sure to post them on the open issue (although with my very, very limited theano knowledge I doubt I’ll be much help!)


#4

I’ve just posted a potential fix for this here. It just involves adding the lines

if isinstance(var.tag.test_value, np.ndarray):
    if len(var.tag.test_value) == 1:
        shared.type = theano.tensor.TensorType(var.dtype, (True,))

in model.py after this line.


#5

This is fixed in PyMC3 with this PR. This is not in a release yet though.