Modeling Bimodal Data with Missing Values

hayfreed · June 20, 2023, 2:20pm

I am trying to fit a model to various data features that are generally bimodal. The picture below shows the computed posteriors when a Normal distribution is used. I would like to find a way to fit the data better than I can achieve just using a Normal.

The obvious way would be to use a Bimodal NormalMixture model. However, this data has missing values, and PyMC does not support Bimodal models when there is missing data (when I test it out, I get a NotImplemented error).

Any recommendations for how this type of data can best be fit using PyMC?

Note: the exact error message I receive when using the NormalMixture model is:
Automatic inputation is only supported for univariate RandomVariables. {my_rv} of type <class 'pymc.distributions.mixture.MarginalMixtureRV'> is not supported.

cluhmann · June 21, 2023, 1:42pm

Can you say more about the missingness?

hayfreed · June 22, 2023, 11:07am

Hi @cluhmann, sure. I am integrating some environmental data from various sources, using year as an index. Some of the data has been observed at yearly intervals, whereas other data has been observed bi-yearly, or in some cases every 5 years. So for example, I have data about CO2 emissions in the United States for every year between 1949-2020, and also data about air pollution in the United States for every other year within that interval. The data is missing on regular intervals, not at random.

I am able to successfully model the fully observed variables using a bi-modal NormalMixture. The problem is that the NormalMixture distribution is not compatible with observed data that has missing values.

In this post Ricardo suggests using a Potential to address this. However, the implementation example he provided didn’t work for me, and I didn’t really understand the solution well enough to troubleshoot it.

hayfreed · October 10, 2023, 5:11pm

Note this feature seems to have been added in a since-released version

github.com/pymc-devs/pymc

Support automatic imputation for multivariate and symbolic distributions

pymc-devs:main ← ricardoV94:partially_observed_rv

opened 12:35PM - 27 Jun 23 UTC

ricardoV94

+492 -125

Closes #5260 Related to #6626 Related to #5255 Related to #6645 This PR c…reates a new Op: `PartialObservedRV` that splits the sample space according to a boolean mask and allows separate variables/values for these. This enables automatic imputation for cases not supported before. For multivariate cases it is not always possible to "attribute" the logp to one variable or the other, so they are all associated with the observed variable. This means the values will show up exclusively in the model `log_likelihood`, even if some (or all) entries were imputed. Directly related to: https://github.com/pymc-devs/pymc/issues/5255 There is some logic to avoid the use of the `PartialObservedRV` for pure Multivariate variables when it's safe to do so (i.e., there is no mixed indexing across the support dims). This has the benefit that automatic transforms will still apply. It also avoids the logp issue mentioned above. This behavior is the same that existed until now for univariate pure RandomVariables, which keep behaving as before. The new `PartialObservedRV` allows for symbolic (constant or mutable) mask, but this is not accessible to users using the current API based on `MaskedArray` or `nan` entries. Similarly, one can't use `ConstantData`/`MutableData` because those try to convert such arrays to `MaskedArray` which aren't supported in PyTensor: https://github.com/pymc-devs/pytensor/issues/259. I added an early error for that. An alternative would be to provide a separate API for users, something like `pm.Imputed(name, dist, obs_mask, obs_data)` where both data and the mask could be tensors wrapped in `Data`. ---- :books: Documentation preview :books:: https://pymc--6797.org.readthedocs.build/en/6797/

Topic		Replies	Views
Handling missing values in predictor when outcome is a Multivariate Normal distribution v5	7	102	October 25, 2024
Partial Missing Multivariate Observation and What to Do With Them by Junpeng Lao PyMCon2020	4	1727	October 31, 2020
MissingInputError: Undeclared input - Bimodal distribution Questions	4	522	September 7, 2021
Dealing with random missing values in a GLM model v5 modeling	0	303	July 18, 2023
Multivariate normal with missing data imputation operands could not be broadcast together with shapes (29,2) () (29,) Questions theano	12	1859	September 7, 2020

Modeling Bimodal Data with Missing Values

Related topics