Trying to impute missing categorical data

wnorcbrown · January 1, 2023, 10:36pm

I’m trying to use a pymc model to impute missing categorical values in my dataset.

When I set the observed data in a Categorical distribution to a masked array:

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"{var}_idx", p=[0.1, 0.2, 0.7], observed=data)
    print(idx.dtype)

this prints ‘float64’. So I can’t use the output as an index in my graph.

If the observed data is not a masked array:

data = [1, 1, 0, 0, 2, 0, 0]
with pm.Model():
    idx = pm.Categorical(f"{var}_idx", p=[0.1, 0.2, 0.7], observed=data)
    print(idx.dtype)

then this prints ‘int64’, as expected.

Any ideas or workarounds? Looks like it might be a bug to be honest.

ricardoV94 · January 2, 2023, 8:34am

What version of PyMC are you using?

wnorcbrown · January 2, 2023, 9:53am

pm.version

‘4.3.0’

and I get the same issue in 4.4.0 and 5.0.1

ricardoV94 · January 2, 2023, 5:49pm

Looks like a bug. As a short solution you can cast the result variable

idx = idx.astype("float64")

Do you mind opening an issue in the GitHub repo?

jessegrabowski · January 2, 2023, 6:26pm

I believe this is a bug related to how the interpolated values are re-combined with the observed values, and opened an issue here.

In the meantime, you can work around this using at.cast to manually convert the datatype of idx, as in:

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical(f"{var}_idx", p=[0.1, 0.2, 0.7], observed=data)
    idx = at.cast(idx, 'int64')

wnorcbrown · January 2, 2023, 8:59pm

This work around doesn’t seem to work for me.

When I cast to int64, the sampled values aren’t in the range of my categorical variable and hence I get index errors:

data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1)
with pm.Model():
    idx = pm.Categorical("idx", p=[0.1, 0.2, 0.7], observed=data)
    idx = at.cast(idx, 'int64')
    num = pm.Normal("num", 0, 1, size=3)[idx]
    pm.sample()

Output:

    rval = inputs[0].__getitem__(tuple(inputs[1:]))
IndexError: index 5 is out of bounds for axis 0 with size 3

jessegrabowski · January 2, 2023, 9:04pm

I ran into this as well when trying to track down the first bug, I was hoping it was just something wrong with my system. I think there’s a glitch that sets the shape of the observed data as the number of classes in the categorical (rather than inferring it from the length of p).

Could you test that:

There’s no error if the length observed data (still with missing values) is exactly equal to the length of p, and;
If (1) works successfully, that if you sample from data with fewer observations than the number of classes, the largest class you get in your samples is the length of the data, not the length of p?

If I’m right I think I can track the bug down pretty quick.

jessegrabowski · January 3, 2023, 12:24am

I opened a separate issue for this bug here:

github.com/pymc-devs/pymc

BUG: Number of classes in `pm.Categorical` is inferred from the data shape when `observed` has missing values

opened 11:49PM - 02 Jan 23 UTC

jessegrabowski

bug

### Describe the issue: Originally raised on discourse [here](https://discour…se.pymc.io/t/trying-to-impute-missing-categorical-data/11093/7). It appears that when `pm.Categorical` has `observed` data with missing values, the variable is re-instantiated inside `model.make_obs_var` with the wrong number of categories. This example shows that the new number of categories is indeed controlled by the shape of the data: ```python import pymc as pm import numpy as np # No error data = np.ma.masked_equal([1, -1, -1], -1) with pm.Model(): idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data) pm.draw(idx, 100).max() >>>Out: 2.0 data = np.ma.masked_equal([1, -1], -1) with pm.Model(): idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data) pm.draw(idx, 100).max() >>> Out: 1.0 ``` If the data are longer than the number of classes, the code will error out, as shown below: ### Reproduceable code example: ```python import pymc as pm import numpy as np data = np.ma.masked_equal([1, 1, 0, 0, 2, -1, -1], -1) with pm.Model(): idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data) pm.draw(idx, 100).max() ``` ### Error message: <details> <p> ```shell IndexError: index 5 is out of bounds for axis 0 with size 3 Apply node that caused the error: AdvancedSubtensor1(TensorConstant{[0.1 0.2 0.7]}, TensorConstant{[5 6]}) Toposort index: 1 Inputs types: [TensorType(float64, (3,)), TensorType(uint8, (2,))] Inputs shapes: [(3,), (2,)] Inputs strides: [(8,), (1,)] Inputs values: [array([0.1, 0.2, 0.7]), array([5, 6], dtype=uint8)] Outputs clients: [[categorical_rv{0, (1,), int64, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x12AD72820>), TensorConstant{(1,) of 2}, TensorConstant{4}, AdvancedSubtensor1.0)]] Backtrace when the node is created (use PyTensor flag traceback__limit=N to make it longer): File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3194, in run_cell_async has_raised = await self.run_ast_nodes(code_ast.body, cell_name, File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3373, in run_ast_nodes if await self.run_code(code, result, async_=asy): File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/var/folders/7b/rzxy96cj0w751_6td3g2yss00000gn/T/ipykernel_96907/3148429013.py", line 3, in <module> idx = pm.Categorical(f"hi_idx", p=[0.1, 0.2, 0.7], observed=data) File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/distributions/distribution.py", line 457, in __new__ return super().__new__(cls, name, *args, **kwargs) File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/distributions/distribution.py", line 310, in __new__ rv_out = model.register_rv( File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/model.py", line 1348, in register_rv rv_var = self.make_obs_var(rv_var, observed, dims, transform) File "/Users/jessegrabowski/mambaforge/envs/econ/lib/python3.9/site-packages/pymc/model.py", line 1425, in make_obs_var (missing_rv_var,) = local_subtensor_rv_lift.transform(fgraph, fgraph.outputs[0].owner) HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node. ``` </p> </details> ### PyMC version information: <details> PyMC version: 5.0.1 Pytensor version: 2.8.11 </details> ### Context for the issue: _No response_

Topic		Replies	Views
Trouble specificying X \| a, b, c, d ~ Categorical( . ) Questions	5	503	March 2, 2019
Out of memory for simple Categorical model version agnostic	2	375	May 30, 2022
Marginalizing over missing categories Questions	1	718	June 17, 2020
Problem with categorical index variable in v5 v5	1	300	January 3, 2024
pm.Categorical behaves differently in a model versus as pm.Categorical.dist Questions	2	917	August 8, 2018

Trying to impute missing categorical data

Related topics