Why are MaskedArrays not supported anymore and is there suggested circumvention?

In old v4 models I used MaskedArrays to deal with missing data. Now running the same model on v5.5.0 gave an error saying MaskedArrays are not supported anymore. Is there any explanation of it?

/opt/conda/envs/pm5/lib/python3.11/site-packages/numpy/ma/core.py:467: RuntimeWarning: invalid value encountered in cast
  fill_value = np.array(fill_value, copy=False, dtype=ndtype)
/root/Codes/Bagger/development/../../Bagger/bagger/models_pm5.py:170: RuntimeWarning: invalid value encountered in cast
  TD_observed = pm.MutableData('TD_observed', dataDict['TD'].astype(int), dims=('sample', 'k4'))
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[4], line 1
----> 1 m = SNV_depth_GLM_1bc_v46(bcDict['trainData'], 8)

File ~/Codes/Bagger/development/../../Bagger/bagger/models_pm5.py:170, in SNV_depth_GLM_1bc_v46(dataDict, mu_pattern_init, mu_pattern_lower, adDis)
    167 m.add_coord('k4', dataDict['dataDF']['k4Str'], mutable=True)
    168 m.add_coord('sample', np.arange(dataDict['nSample']), mutable=True)
--> 170 TD_observed = pm.MutableData('TD_observed', dataDict['TD'].astype(int), dims=('sample', 'k4'))
    171 AD_observed = pm.MutableData('AD_observed', dataDict['AD'].astype(int), dims=('sample', 'k4'))
    172 context_idx = pm.MutableData('context_idx', dataDict['dataDF']['context'].values, dims='k4')

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pymc/data.py:314, in MutableData(name, value, dims, coords, export_index_as_coords, infer_dims_and_coords, **kwargs)
    308     infer_dims_and_coords = export_index_as_coords
    309     warnings.warn(
    310         "Deprecation warning: 'export_index_as_coords; is deprecated and will be removed in future versions. Please use 'infer_dims_and_coords' instead.",
    311         DeprecationWarning,
    312     )
--> 314 var = Data(
    315     name,
    316     value,
    317     dims=dims,
    318     coords=coords,
    319     infer_dims_and_coords=infer_dims_and_coords,
    320     mutable=True,
    321     **kwargs,
    322 )
    323 return cast(SharedVariable, var)

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pymc/data.py:437, in Data(name, value, dims, coords, export_index_as_coords, infer_dims_and_coords, mutable, **kwargs)
    435     mutable = False
    436 if mutable:
--> 437     x = pytensor.shared(arr, name, **kwargs)
    438 else:
    439     x = pt.as_tensor_variable(arr, name, **kwargs)

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pytensor/compile/sharedvalue.py:202, in shared(value, name, strict, allow_downcast, **kwargs)
    199     raise TypeError("Shared variable values can not be symbolic.")
    201 try:
--> 202     var = shared_constructor(
    203         value,
    204         name=name,
    205         strict=strict,
    206         allow_downcast=allow_downcast,
    207         **kwargs,
    208     )
    209     add_tag_trace(var)
    210     return var

File /opt/conda/envs/pm5/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    905 if not args:
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File /opt/conda/envs/pm5/lib/python3.11/site-packages/pytensor/tensor/sharedvar.py:69, in tensor_constructor(value, name, strict, allow_downcast, borrow, shape, target, broadcastable)
     59 r"""`SharedVariable` constructor for `TensorType`\s.
     60 
     61 Notes
   (...)
     66 
     67 """
     68 if isinstance(value, np.ma.MaskedArray):
---> 69     raise NotImplementedError("MaskedArrays are not supported")
     71 if broadcastable is not None:
     72     warnings.warn(
     73         "The `broadcastable` keyword is deprecated; use `shape`.",
     74         DeprecationWarning,
     75     )

NotImplementedError: MaskedArrays are not supported

They were not properly supported and could result in buggy behavior. You can use ConstantData with nan pass them directly to observed to trigger automatic imputation

1 Like

In this case a follow-up question would be that what if one needs to predict on hold-out data given it’s why MutableData was used in the first place?

Out of sample predictions with imputed variables is a bit tricky and depends on the specific model structure. If you can share more details about your model and how you wanted to do out of sample predictions it would be helpful.

Dang, is this still the case?

I am having to impute missing values for my testing X variable, and would like to be able to test the model fit by giving predictions with test data.

Is there more info I can give to help answer whether this is possible or not?

Can you share some example of what you need to do?

Certainly,

I have this model (that I’m still figuring out):

with pm.Model() as model:
    
    # Priors
    home_intercept = pm.Poisson('home_ntercept', mu=2)
    away_intercept = pm.Poisson('away_intercept', mu=2)
    home_coeff = pm.Normal('home_coeff', mu=0, sigma=1, shape=home_train.shape[1])
    away_coeff = pm.Normal('away_coeff', mu=0, sigma=1, shape=away_train.shape[1])
   
    home_imp = pm.Normal("home_imp", mu=0, sigma=3, observed = home_train)
    away_imp = pm.Normal("away_imp", mu=0, sigma=3, observed = away_train)
    
    # Model error
    eps = pm.HalfCauchy('eps', beta=1)
   
    # Likelihood function
    mu = (home_intercept + pm.math.dot(home_coeff, home_imp.T)) - \
                    (away_intercept + pm.math.dot(away_coeff, away_imp.T))
    
                    
    likelihood = pm.Normal('y', mu=mu, sigma=eps, observed = y_train)
   
    # Inference
    trace = pm.sample(draws=1000, tune=1000, chains=4)

And I train it with x_train, y_train, but I would like to be able to then make predictions with some data I have stored as x_test to investigate how well it fits y_test.

Is the default use of set_data not sufficient?

https://www.pymc.io/projects/docs/en/latest/api/model/generated/pymc.model.core.set_data.html

Hey @ricardoV94 I am facing a similar issue. I have a model where my observed outcome variable has missing data. I would like to automatically impute the data using the sampling distribution. I have a pymc model that looks like this:

with pm.Model(coords=coords) as STS:

    time = pm.Data("time", train_time_scaled)
    
    ft = pm.Data("fourier_series", train_fourier_series.to_numpy())

    k0 = pm.Normal("k0", 0, 1)
    m = pm.Normal("m", 0, 1)

    b = pm.Normal("b", 0, 1, dims="fourier_coef")
    seasonal = pm.Deterministic("seasonal", pm.math.dot(ft, b))

    mu = np.linspace(0.1, 0.8, len(coords['changepoints']))
    s = pm.Data("s", mu, dims="changepoints")

    delta = pm.Laplace("delta", mu=0, b=0.2, dims="changepoints")
    a = pm.Deterministic("a", pm.math.where(time[:, None] < s[None, :], np.array(0.), np.array(1.)))

    gamma = pm.Deterministic("gamma", -s*delta, dims="changepoints")
    func = (k0 + pm.math.dot(a, delta))*time + (m + pm.math.dot(a, gamma))

    trend = pm.Deterministic("trend", func)

    error = pm.HalfNormal("error", sigma=1)
    pm.Normal(
        "likelihood",
        mu=trend + seasonal,
        sigma=error,
        observed=train_xs_scaled
    )

I am passing train_xs_scaled directly because it has missing values so I am not able to use pm.Data()

Now I want to produce OOS forecasts using this model. For that I have:

with STS:
    pm.set_data(
        new_data=dict(
            time=test_time_scaled,
            fourier_series=test_fourier_series
        )
    )
    predictions = pm.sample_posterior_predictive(idata)

But this results in an error:

IndexError: boolean index did not match indexed array along dimension 0; dimension is 49 but corresponding boolean 
dimension is 432

Where the training examples have a length of 432 and the testing examples have a dimension of 49. Is there any way I can modify the model so that it does not use the old observed data for the ‘likelihood’ since they should not be needed anyway to produce the forecast?

You can always create a new model for posterior predictive, which may be the simpler approach in your case?

Thank you, @ricardoV94! This is very helpful.

Is this still the solution? Using observed? If so, can you provide an example? Or is it just:

x = pm.Normal('x', observed = train_x)

Yes, just as in the examples above.

How does that fit with sampling posterior predictive results?

When I create a new model, with new dimensions, then try to sample, I get a shape compatibility error.

train_x = np.random.rand(10)*10
y = 1 + 0.5*train_x + np.random.rand(10)
train_x[5] = np.nan
train_x[3]= np.nan

with pm.Model() as model:
    y = pm.MutableData('y',y)
    x = pm.Normal('x', 0,1,observed = train_x)
    a = pm.Normal('a', 0,1)
    w = pm.Normal('w', 0,1)
    mu = pm.Deterministic('mu', a + w*x)
    sigma = pm.HalfNormal('sigma', 1)
    pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

with model:
    trace = pm.sample()

train_x2 = np.random.rand(20)*10
#y = 1 + 0.5*train_x2 + np.random.rand(10)
train_x2[15] = np.nan
train_x2[11]= np.nan
train_x2[7]= np.nan

with pm.Model() as model2:
    y = pm.MutableData('y',np.zeros(len(train_x2)))
    x = pm.Normal('x', 0,1,observed = train_x2)
    a = pm.Normal('a', 0,1)
    w = pm.Normal('w', 0,1)
    mu = pm.Deterministic('mu', a + w*x)
    sigma = pm.HalfNormal('sigma', 1)
    pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

with model2:
    ppc_samples = pm.sample_posterior_predictive(
            trace,
            var_names=["mu"],
            random_seed=1,
        )
    trace.extend(ppc_samples)

if you’re only interested in mu why bother setting up y in the second model?

I have some function that makes a new call to generate the “new model”. So first time it gets called, it needs the y. Second time it gets called, it no longer needs it.

The main issue is you new model has incompatible shapes with the original one regarding x, and you’re not resampling x.

Your error is similar to sampling a variable x of size 2 and then trying to use that trace in a new model where x is defined with size 3.

Partial observation creates two variables x_obs and x_unobs and in your second model x_unobs has a different shape than the one that’s found in the trace. I’m not sure what your goal was, as sample_posterior_predictive can’t do inference, it can only reuse draws from the trace or draw posterior predictive draws for a variable. In your case these would be prior draws since x does not depend on anything else

I guess how would I do the later, draw posterior predictive draws for the new variable, “mu” in this case. I think the model would need some prior draws for the missing x from a normal (0,1) distribution (x is always normally distributed).