Trying to impute missing categorical data

I’m trying to use a pymc model to impute missing categorical values in my dataset.

When I set the observed data in a Categorical distribution to a masked array:

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"{var}_idx", p=[0.1, 0.2, 0.7], observed=data)
    print(idx.dtype)

this prints ‘float64’. So I can’t use the output as an index in my graph.

If the observed data is not a masked array:

data = [1, 1, 0, 0, 2, 0, 0]
with pm.Model():
    idx = pm.Categorical(f"{var}_idx", p=[0.1, 0.2, 0.7], observed=data)
    print(idx.dtype)

then this prints ‘int64’, as expected.

Any ideas or workarounds? Looks like it might be a bug to be honest.

What version of PyMC are you using?

pm.version

‘4.3.0’

and I get the same issue in 4.4.0 and 5.0.1

Looks like a bug. As a short solution you can cast the result variable

idx = idx.astype("float64")

Do you mind opening an issue in the GitHub repo?

I believe this is a bug related to how the interpolated values are re-combined with the observed values, and opened an issue here.

In the meantime, you can work around this using at.cast to manually convert the datatype of idx, as in:

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"{var}_idx", p=[0.1, 0.2, 0.7], observed=data)
    idx = at.cast(idx, 'int64')

This work around doesn’t seem to work for me.

When I cast to int64, the sampled values aren’t in the range of my categorical variable and hence I get index errors:

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical("idx", p=[0.1, 0.2, 0.7], observed=data)
    idx = at.cast(idx, 'int64')
    num = pm.Normal("num", 0, 1, size=3)[idx]
    pm.sample()

Output:

    rval = inputs[0].__getitem__(tuple(inputs[1:]))
IndexError: index 5 is out of bounds for axis 0 with size 3

I ran into this as well when trying to track down the first bug, I was hoping it was just something wrong with my system. I think there’s a glitch that sets the shape of the observed data as the number of classes in the categorical (rather than inferring it from the length of p).

Could you test that:

  1. There’s no error if the length observed data (still with missing values) is exactly equal to the length of p, and;
  2. If (1) works successfully, that if you sample from data with fewer observations than the number of classes, the largest class you get in your samples is the length of the data, not the length of p?

If I’m right I think I can track the bug down pretty quick.

I opened a separate issue for this bug here:

1 Like