Handling NaN Values in Complex Economic Model with PyMC

Hello PyMC community,

I’m implementing a complex economic model using PyMC and have encountered an issue when handling data with potential NaN values. Here’s the error I’m facing:

# Print summary statistics
print(pm.summary(trace))
/anaconda3/lib/python3.11/site-packages/pymc/distributions/timeseries.py:619: UserWarning: Initial distribution not specified, defaulting to `Normal.dist(0, 100, shape=...)`. You can specify an init_dist manually to suppress this warning.
warnings.warn(
# This warning repeats 8 times

Traceback (most recent call last):

Cell In[7], line 166
model = estimate_economic_model(data)

Cell In[7], line 74 in estimate_economic_model
model_vars[var] = pm.Data(var, data_var) # Use pm.Data for observed variables

File ~/anaconda3/lib/python3.11/site-packages/pymc/data.py:410 in Data
raise NotImplementedError(

NotImplementedError: Masked arrays or arrays with `nan` entries are not supported. Pass them directly to `observed` if you want to trigger auto-imputation

Here’s a simplified version of the relevant code:

import pymc as pm
import numpy as np
import pandas as pd

def safe_get(data, key):
    if isinstance(data, pd.DataFrame):
        if key in data.columns:
            return pd.to_numeric(data[key], errors='coerce').values
        else:
            return np.zeros(len(data))
    elif isinstance(data, dict):
        if key in data:
            return np.array(pd.to_numeric(data[key], errors='coerce'))
        else:
            return np.zeros(len(next(iter(data.values()))))
    else:
        raise ValueError("Data must be either a pandas DataFrame or a dictionary")

def create_time_varying_parameter(name, data, model):
    with model:
        mu = pm.Normal(f'{name}_mu', mu=0, sigma=1)
        sigma = pm.HalfNormal(f'{name}_sigma', sigma=1)
        return pm.Normal(name, mu=mu, sigma=sigma, shape=len(data))

def estimate_economic_model(data):
    model = pm.Model()
    with model:
        # Create time-varying parameters
        param1 = create_time_varying_parameter('param1', safe_get(data, 'var1'), model)
        param2 = create_time_varying_parameter('param2', 1 / (1 + safe_get(data, 'var2') + 1e-10), model)
        # ... (other parameters)

        # Model variables
        var_names = ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10']
        
        model_vars = {}
        for var in var_names:
            data_var = safe_get(data, var)
            model_vars[var] = pm.Data(var, data_var) # Use pm.Data for observed variables

        # ... (rest of the model definition)

    return model

# Usage
data = pd.read_csv('processed_data.csv')
model = estimate_economic_model(data)

# Perform MCMC sampling
with model:
    trace = pm.sample(chains=2, cores=2)

# Print summary statistics
print(pm.summary(trace))

The main issue seems to be that pm.Data doesn’t support arrays with NaN entries. I’ve tried using safe_get to handle potential NaN values, but the error persists.

My questions are:

  1. What’s the best way to handle NaN values in this context? Should I be using pm.MutableData instead of pm.Data?
  2. How can I modify the safe_get function to properly handle NaN values while still providing the necessary data for the model?
  3. Are there any best practices for dealing with missing or NaN data in complex economic models using PyMC?

Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!

John

What do you want to do with the nans?

Regarding the handling of NaN values, I would like to simply ignore them. No dropping of rows nor filling etc. I am estimating a non-linear Bayesian TVP-DSGE model. I simply would like to solve the issue above

NotImplementedError: Masked arrays or arrays with nan entries are not supported. Pass them directly to observed if you want to trigger auto-imputation

You can’t have rectangular data with nans and have the values ignored, because pymc doesn’t support masked arrays. You have to manually remove the nan entries from your model if you’re not interested in imputation

It turns out that simply ignore nans is not the best option for the model.

If I understand well, allowing PyMC to impute missing values treats the missing data as latent variables and estimates them alongside the parameters of the model. The missing values are inferred from the observed data and the model structure.

I am running a non-linear Bayesian DSGE (Dynamic Stochastic General Equilibrium) Time varying Parameter model

How can I introduce to allow PyMC to impute the missing values during the sampling process automatically in my model to solve the issue?

In that case you can’t use pm.Data and must pass the data directly to observed