Modeling Bimodal Data with Missing Values

I am trying to fit a model to various data features that are generally bimodal. The picture below shows the computed posteriors when a Normal distribution is used. I would like to find a way to fit the data better than I can achieve just using a Normal.

The obvious way would be to use a Bimodal NormalMixture model. However, this data has missing values, and PyMC does not support Bimodal models when there is missing data (when I test it out, I get a NotImplemented error).

Any recommendations for how this type of data can best be fit using PyMC?

Note: the exact error message I receive when using the NormalMixture model is:
Automatic inputation is only supported for univariate RandomVariables. {my_rv} of type <class 'pymc.distributions.mixture.MarginalMixtureRV'> is not supported.

Can you say more about the missingness?

Hi @cluhmann, sure. I am integrating some environmental data from various sources, using year as an index. Some of the data has been observed at yearly intervals, whereas other data has been observed bi-yearly, or in some cases every 5 years. So for example, I have data about CO2 emissions in the United States for every year between 1949-2020, and also data about air pollution in the United States for every other year within that interval. The data is missing on regular intervals, not at random.

I am able to successfully model the fully observed variables using a bi-modal NormalMixture. The problem is that the NormalMixture distribution is not compatible with observed data that has missing values.

In this post Ricardo suggests using a Potential to address this. However, the implementation example he provided didn’t work for me, and I didn’t really understand the solution well enough to troubleshoot it.

Note this feature seems to have been added in a since-released version