Why are MaskedArrays not supported anymore and is there suggested circumvention?

HongyuXie · June 25, 2023, 3:16am

In old v4 models I used MaskedArrays to deal with missing data. Now running the same model on v5.5.0 gave an error saying MaskedArrays are not supported anymore. Is there any explanation of it?

/opt/conda/envs/pm5/lib/python3.11/site-packages/numpy/ma/core.py:467: RuntimeWarning: invalid value encountered in cast
  fill_value = np.array(fill_value, copy=False, dtype=ndtype)
/root/Codes/Bagger/development/../../Bagger/bagger/models_pm5.py:170: RuntimeWarning: invalid value encountered in cast
  TD_observed = pm.MutableData('TD_observed', dataDict['TD'].astype(int), dims=('sample', 'k4'))
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[4], line 1
----> 1 m = SNV_depth_GLM_1bc_v46(bcDict['trainData'], 8)

File ~/Codes/Bagger/development/../../Bagger/bagger/models_pm5.py:170, in SNV_depth_GLM_1bc_v46(dataDict, mu_pattern_init, mu_pattern_lower, adDis)
    167 m.add_coord('k4', dataDict['dataDF']['k4Str'], mutable=True)
    168 m.add_coord('sample', np.arange(dataDict['nSample']), mutable=True)
--> 170 TD_observed = pm.MutableData('TD_observed', dataDict['TD'].astype(int), dims=('sample', 'k4'))
    171 AD_observed = pm.MutableData('AD_observed', dataDict['AD'].astype(int), dims=('sample', 'k4'))
    172 context_idx = pm.MutableData('context_idx', dataDict['dataDF']['context'].values, dims='k4')

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pymc/data.py:314, in MutableData(name, value, dims, coords, export_index_as_coords, infer_dims_and_coords, **kwargs)
    308     infer_dims_and_coords = export_index_as_coords
    309     warnings.warn(
    310         "Deprecation warning: 'export_index_as_coords; is deprecated and will be removed in future versions. Please use 'infer_dims_and_coords' instead.",
    311         DeprecationWarning,
    312     )
--> 314 var = Data(
    315     name,
    316     value,
    317     dims=dims,
    318     coords=coords,
    319     infer_dims_and_coords=infer_dims_and_coords,
    320     mutable=True,
    321     **kwargs,
    322 )
    323 return cast(SharedVariable, var)

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pymc/data.py:437, in Data(name, value, dims, coords, export_index_as_coords, infer_dims_and_coords, mutable, **kwargs)
    435     mutable = False
    436 if mutable:
--> 437     x = pytensor.shared(arr, name, **kwargs)
    438 else:
    439     x = pt.as_tensor_variable(arr, name, **kwargs)

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pytensor/compile/sharedvalue.py:202, in shared(value, name, strict, allow_downcast, **kwargs)
    199     raise TypeError("Shared variable values can not be symbolic.")
    201 try:
--> 202     var = shared_constructor(
    203         value,
    204         name=name,
    205         strict=strict,
    206         allow_downcast=allow_downcast,
    207         **kwargs,
    208     )
    209     add_tag_trace(var)
    210     return var

File /opt/conda/envs/pm5/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    905 if not args:
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pytensor/tensor/sharedvar.py:69, in tensor_constructor(value, name, strict, allow_downcast, borrow, shape, target, broadcastable)
     59 r"""`SharedVariable` constructor for `TensorType`\s.
     60 
     61 Notes
   (...)
     66 
     67 """
     68 if isinstance(value, np.ma.MaskedArray):
---> 69     raise NotImplementedError("MaskedArrays are not supported")
     71 if broadcastable is not None:
     72     warnings.warn(
     73         "The `broadcastable` keyword is deprecated; use `shape`.",
     74         DeprecationWarning,
     75     )

NotImplementedError: MaskedArrays are not supported

ricardoV94 · June 25, 2023, 8:22am

They were not properly supported and could result in buggy behavior. You can ~~use ConstantData with nan~~ pass them directly to observed to trigger automatic imputation

HongyuXie · June 25, 2023, 8:45am

In this case a follow-up question would be that what if one needs to predict on hold-out data given it’s why MutableData was used in the first place?

ricardoV94 · June 25, 2023, 9:37am

Out of sample predictions with imputed variables is a bit tricky and depends on the specific model structure. If you can share more details about your model and how you wanted to do out of sample predictions it would be helpful.

troypie314 · April 14, 2024, 10:04pm

Dang, is this still the case?

I am having to impute missing values for my testing X variable, and would like to be able to test the model fit by giving predictions with test data.

Is there more info I can give to help answer whether this is possible or not?

ricardoV94 · April 15, 2024, 3:00am

Can you share some example of what you need to do?

troypie314 · April 16, 2024, 9:21pm

Certainly,

I have this model (that I’m still figuring out):

with pm.Model() as model:
    
    # Priors
    home_intercept = pm.Poisson('home_ntercept', mu=2)
    away_intercept = pm.Poisson('away_intercept', mu=2)
    home_coeff = pm.Normal('home_coeff', mu=0, sigma=1, shape=home_train.shape[1])
    away_coeff = pm.Normal('away_coeff', mu=0, sigma=1, shape=away_train.shape[1])
   
    home_imp = pm.Normal("home_imp", mu=0, sigma=3, observed = home_train)
    away_imp = pm.Normal("away_imp", mu=0, sigma=3, observed = away_train)
    
    # Model error
    eps = pm.HalfCauchy('eps', beta=1)
   
    # Likelihood function
    mu = (home_intercept + pm.math.dot(home_coeff, home_imp.T)) - \
                    (away_intercept + pm.math.dot(away_coeff, away_imp.T))
    
                    
    likelihood = pm.Normal('y', mu=mu, sigma=eps, observed = y_train)
   
    # Inference
    trace = pm.sample(draws=1000, tune=1000, chains=4)

And I train it with x_train, y_train, but I would like to be able to then make predictions with some data I have stored as x_test to investigate how well it fits y_test.

ricardoV94 · April 16, 2024, 9:48pm

Is the default use of set_data not sufficient?

https://www.pymc.io/projects/docs/en/latest/api/model/generated/pymc.model.core.set_data.html

Dekermanjian · July 5, 2024, 5:52pm

Hey @ricardoV94 I am facing a similar issue. I have a model where my observed outcome variable has missing data. I would like to automatically impute the data using the sampling distribution. I have a pymc model that looks like this:

with pm.Model(coords=coords) as STS:

    time = pm.Data("time", train_time_scaled)
    
    ft = pm.Data("fourier_series", train_fourier_series.to_numpy())

    k0 = pm.Normal("k0", 0, 1)
    m = pm.Normal("m", 0, 1)

    b = pm.Normal("b", 0, 1, dims="fourier_coef")
    seasonal = pm.Deterministic("seasonal", pm.math.dot(ft, b))

    mu = np.linspace(0.1, 0.8, len(coords['changepoints']))
    s = pm.Data("s", mu, dims="changepoints")

    delta = pm.Laplace("delta", mu=0, b=0.2, dims="changepoints")
    a = pm.Deterministic("a", pm.math.where(time[:, None] < s[None, :], np.array(0.), np.array(1.)))

    gamma = pm.Deterministic("gamma", -s*delta, dims="changepoints")
    func = (k0 + pm.math.dot(a, delta))*time + (m + pm.math.dot(a, gamma))

    trend = pm.Deterministic("trend", func)

    error = pm.HalfNormal("error", sigma=1)
    pm.Normal(
        "likelihood",
        mu=trend + seasonal,
        sigma=error,
        observed=train_xs_scaled
    )

I am passing train_xs_scaled directly because it has missing values so I am not able to use pm.Data()

Now I want to produce OOS forecasts using this model. For that I have:

with STS:
    pm.set_data(
        new_data=dict(
            time=test_time_scaled,
            fourier_series=test_fourier_series
        )
    )
    predictions = pm.sample_posterior_predictive(idata)

But this results in an error:

IndexError: boolean index did not match indexed array along dimension 0; dimension is 49 but corresponding boolean 
dimension is 432

Where the training examples have a length of 432 and the testing examples have a dimension of 49. Is there any way I can modify the model so that it does not use the old observed data for the ‘likelihood’ since they should not be needed anyway to produce the forecast?

ricardoV94 · July 5, 2024, 7:37pm

You can always create a new model for posterior predictive, which may be the simpler approach in your case?

Dekermanjian · July 5, 2024, 7:55pm

Thank you, @ricardoV94! This is very helpful.

jremmert · July 25, 2024, 8:25pm

Is this still the solution? Using observed? If so, can you provide an example? Or is it just:

x = pm.Normal('x', observed = train_x)

cluhmann · July 25, 2024, 9:14pm

Yes, just as in the examples above.

jremmert · July 29, 2024, 4:18pm

How does that fit with sampling posterior predictive results?

When I create a new model, with new dimensions, then try to sample, I get a shape compatibility error.

train_x = np.random.rand(10)*10
y = 1 + 0.5*train_x + np.random.rand(10)
train_x[5] = np.nan
train_x[3]= np.nan

with pm.Model() as model:
    y = pm.MutableData('y',y)
    x = pm.Normal('x', 0,1,observed = train_x)
    a = pm.Normal('a', 0,1)
    w = pm.Normal('w', 0,1)
    mu = pm.Deterministic('mu', a + w*x)
    sigma = pm.HalfNormal('sigma', 1)
    pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

with model:
    trace = pm.sample()

train_x2 = np.random.rand(20)*10
#y = 1 + 0.5*train_x2 + np.random.rand(10)
train_x2[15] = np.nan
train_x2[11]= np.nan
train_x2[7]= np.nan

with pm.Model() as model2:
    y = pm.MutableData('y',np.zeros(len(train_x2)))
    x = pm.Normal('x', 0,1,observed = train_x2)
    a = pm.Normal('a', 0,1)
    w = pm.Normal('w', 0,1)
    mu = pm.Deterministic('mu', a + w*x)
    sigma = pm.HalfNormal('sigma', 1)
    pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

with model2:
    ppc_samples = pm.sample_posterior_predictive(
            trace,
            var_names=["mu"],
            random_seed=1,
        )
    trace.extend(ppc_samples)

ricardoV94 · July 29, 2024, 4:38pm

if you’re only interested in mu why bother setting up y in the second model?

jremmert · July 29, 2024, 4:41pm

I have some function that makes a new call to generate the “new model”. So first time it gets called, it needs the y. Second time it gets called, it no longer needs it.

ricardoV94 · July 29, 2024, 4:47pm

The main issue is you new model has incompatible shapes with the original one regarding x, and you’re not resampling x.

Your error is similar to sampling a variable x of size 2 and then trying to use that trace in a new model where x is defined with size 3.

Partial observation creates two variables x_obs and x_unobs and in your second model x_unobs has a different shape than the one that’s found in the trace. I’m not sure what your goal was, as sample_posterior_predictive can’t do inference, it can only reuse draws from the trace or draw posterior predictive draws for a variable. In your case these would be prior draws since x does not depend on anything else

jremmert · July 29, 2024, 5:09pm

I guess how would I do the later, draw posterior predictive draws for the new variable, “mu” in this case. I think the model would need some prior draws for the missing x from a normal (0,1) distribution (x is always normally distributed).

Topic		Replies	Views
Trying to impute missing categorical data v5	7	552	January 3, 2023
MutableData for boolean masks v5 modeling	1	249	November 20, 2023
Logistic Regression w/ Missing Data? Questions	7	2862	September 11, 2017
Dealing with 1 missing observation Questions	4	605	January 15, 2019
Help! "Unsupported dtype for TensorType: object (PyMC) v5	4	376	June 20, 2024

Why are MaskedArrays not supported anymore and is there suggested circumvention?

Related topics