Hello PyMC community,
I’m implementing a complex economic model using PyMC and have encountered an issue when handling data with potential NaN values. Here’s the error I’m facing:
# Print summary statistics print(pm.summary(trace)) /anaconda3/lib/python3.11/site-packages/pymc/distributions/timeseries.py:619: UserWarning: Initial distribution not specified, defaulting to `Normal.dist(0, 100, shape=...)`. You can specify an init_dist manually to suppress this warning. warnings.warn( # This warning repeats 8 times Traceback (most recent call last): Cell In[7], line 166 model = estimate_economic_model(data) Cell In[7], line 74 in estimate_economic_model model_vars[var] = pm.Data(var, data_var) # Use pm.Data for observed variables File ~/anaconda3/lib/python3.11/site-packages/pymc/data.py:410 in Data raise NotImplementedError( NotImplementedError: Masked arrays or arrays with `nan` entries are not supported. Pass them directly to `observed` if you want to trigger auto-imputation
Here’s a simplified version of the relevant code:
import pymc as pm
import numpy as np
import pandas as pd
def safe_get(data, key):
if isinstance(data, pd.DataFrame):
if key in data.columns:
return pd.to_numeric(data[key], errors='coerce').values
else:
return np.zeros(len(data))
elif isinstance(data, dict):
if key in data:
return np.array(pd.to_numeric(data[key], errors='coerce'))
else:
return np.zeros(len(next(iter(data.values()))))
else:
raise ValueError("Data must be either a pandas DataFrame or a dictionary")
def create_time_varying_parameter(name, data, model):
with model:
mu = pm.Normal(f'{name}_mu', mu=0, sigma=1)
sigma = pm.HalfNormal(f'{name}_sigma', sigma=1)
return pm.Normal(name, mu=mu, sigma=sigma, shape=len(data))
def estimate_economic_model(data):
model = pm.Model()
with model:
# Create time-varying parameters
param1 = create_time_varying_parameter('param1', safe_get(data, 'var1'), model)
param2 = create_time_varying_parameter('param2', 1 / (1 + safe_get(data, 'var2') + 1e-10), model)
# ... (other parameters)
# Model variables
var_names = ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10']
model_vars = {}
for var in var_names:
data_var = safe_get(data, var)
model_vars[var] = pm.Data(var, data_var) # Use pm.Data for observed variables
# ... (rest of the model definition)
return model
# Usage
data = pd.read_csv('processed_data.csv')
model = estimate_economic_model(data)
# Perform MCMC sampling
with model:
trace = pm.sample(chains=2, cores=2)
# Print summary statistics
print(pm.summary(trace))
The main issue seems to be that pm.Data
doesn’t support arrays with NaN entries. I’ve tried using safe_get
to handle potential NaN values, but the error persists.
My questions are:
- What’s the best way to handle NaN values in this context? Should I be using
pm.MutableData
instead ofpm.Data
? - How can I modify the
safe_get
function to properly handle NaN values while still providing the necessary data for the model? - Are there any best practices for dealing with missing or NaN data in complex economic models using PyMC?
Any insights or suggestions would be greatly appreciated. Thank you in advance for your help!
John