Hi “long” time reader, first time writer, experiencing problems when trying to use the nuts-numpyro sampling method on data containing NaNs.
I originally tried to use, for different models, the nuts-numpyro sampling method on data sets that I’m currently working with, all of which contain some number of NaNs. Now the setup is as followed, for any of the models, each data set is assigned to the observation of the GEV distribution in PyMC Experimental. When preforming sample from prior predictive, each of the setups generates the following warning:
RNG Variable RandomGeneratorSharedVariable(<Generator(PCG64) at 0x2B84F390B760>) has multiple clients. This is likely an inconsistent random graph. warnings.warn(
From there, after running the simulation using the nuts-numpyro sampling to completion, the following error message is raised:
ValueError: Graph contains shared RandomType variables which cannot be safely replaced
It’s worth mentioning that the models work when using the PyMC nuts sampling.
From the above, I considered the GEV example notebook, using the above sampling method, which worked. However, when adding a np.nan to the example data set, the same error is raised.
Code below
import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import numpyro
import pymc_experimental
import pymc_experimental.distributions as pmx
print("numpy:", np.__version__)
print("pandas:", pd.__version__)
print("pymc:", pm.__version__)
print("ariz:", az.__version__)
print("numpyro:", numpyro.__version__)
print("pymc_expr:", pymc_experimental.__version__)
>>numpy: 1.23.5
>>pandas: 2.1.1
>>pymc: 5.9.0
>>ariz: 0.16.0
>>numpyro: 0.13.0
>>pymc_expr: 0.0.12
RANDOM_SEED = 19881129
rng = np.random.default_rng(RANDOM_SEED)
#Two datasets, the first is the same as in the GEV example, with two additional np.nan entries.
NaNs = np.array([4.03, np.nan, np.nan, 3.83, 3.65, 3.88, 4.01, 4.08, 4.18, 3.80,
4.36, 3.96, 3.98, 4.69, 3.85, 3.96, 3.85, 3.93,
3.75, 3.63, 3.57, 4.25, 3.97, 4.05, 4.24, 4.22,
3.73, 4.37, 4.06, 3.71, 3.96, 4.06, 4.55, 3.79,
3.89, 4.11, 3.85, 3.86, 3.86, 4.21, 4.01, 4.11,
4.24, 3.96, 4.21, 3.74, 3.85, 3.88, 3.66, 4.11,
3.71, 4.18, 3.90, 3.78, 3.91, 3.72, 4.00, 3.66,
3.62, 4.33, 4.55, 3.75, 4.08, 3.90, 3.88, 3.94,
4.33])
#The second one as in the example.
noNaNs = np.array([4.03, 3.83, 3.65, 3.88, 4.01, 4.08, 4.18, 3.80,
4.36, 3.96, 3.98, 4.69, 3.85, 3.96, 3.85, 3.93,
3.75, 3.63, 3.57, 4.25, 3.97, 4.05, 4.24, 4.22,
3.73, 4.37, 4.06, 3.71, 3.96, 4.06, 4.55, 3.79,
3.89, 4.11, 3.85, 3.86, 3.86, 4.21, 4.01, 4.11,
4.24, 3.96, 4.21, 3.74, 3.85, 3.88, 3.66, 4.11,
3.71, 4.18, 3.90, 3.78, 3.91, 3.72, 4.00, 3.66,
3.62, 4.33, 4.55, 3.75, 4.08, 3.90, 3.88, 3.94,
4.33])
#Two models, one acting on the set NaNs, the other on the set noNaNs.
with pm.Model() as model_NaNs:
# Priors
μ = pm.Normal("μ", mu=3.8, sigma=0.2)
σ = pm.HalfNormal("σ", sigma=0.3)
ξ = pm.TruncatedNormal("ξ", mu=0, sigma=0.2, lower=-0.6, upper=0.6)
# Estimation
gev = pmx.GenExtreme("gev", mu=μ, sigma=σ, xi=ξ, scipy=True, observed=NaNs)
# Return level
z_p = pm.Deterministic("z_p", μ - σ / ξ * (1 - (-np.log(1 - 1/10)) ** (-ξ)))
>>ImputationWarning: Data in gev contains missing values and will be automatically imputed from the sampling distribution.
warnings.warn(impute_message, ImputationWarning)
with pm.Model() as model_noNaNs:
# Priors
μ = pm.Normal("μ", mu=3.8, sigma=0.2)
σ = pm.HalfNormal("σ", sigma=0.3)
ξ = pm.TruncatedNormal("ξ", mu=0, sigma=0.2, lower=-0.6, upper=0.6)
# Estimation
gev = pmx.GenExtreme("gev", mu=μ, sigma=σ, xi=ξ, scipy=True, observed=noNaNs)
# Return level
z_p = pm.Deterministic("z_p", μ - σ / ξ * (1 - (-np.log(1 - 1/10)) ** (-ξ)))
#Next we sample from the priors.
with model_NaNs:
NaNsidata = pm.sample_prior_predictive(samples=5000, random_seed=rng)
>> UserWarning: RNG Variable RandomGeneratorSharedVariable(<Generator(PCG64) at 0x2AB96EA50740>) has multiple clients. This is likely an inconsistent random graph.
warnings.warn(
Sampling: [gev_observed, gev_unobserved, μ, ξ, σ]
with model_noNaNs:
noNaNsidata = pm.sample_prior_predictive(samples=5000, random_seed=rng)
>>>Sampling: [gev, μ, ξ, σ]
#We run the simulation
with model_NaNs:
NaNstrace = pm.sample(1000, tune=1000, chains=2, cores=2, nuts_sampler="numpyro")
>>>Compiling...
>>>Compilation time = 0:00:03.183715
>>>Sampling...
>>>
>>>Running chain 0: 100%
>>>2000/2000 [00:04<00:00, 5260.12it/s]
>>>Running chain 1: 100%
>>>2000/2000 [00:04<00:00, 5455.31it/s]
>>>
>>>Sampling time = 0:00:04.936128
>>>Transforming variables...
>>>ValueError Traceback (most recent call last)
Cell In[16], line 2
1 with model_NaNs:
----> 2 NaNstrace = pm.sample(1000, tune=1000, chains=2, cores=2, nuts_sampler="numpyro")
:::
:::
:::
>>> shared_variables = [var for var in graph_inputs(graph) if isinstance(var, SharedVariable)]
101 if any(isinstance(var.type, RandomType) for var in shared_variables):
--> 102 raise ValueError(
103 "Graph contains shared RandomType variables which cannot be safely replaced"
104 )
106 if any(var.default_update is not None for var in shared_variables):
107 raise ValueError(
108 "Graph contains shared variables with default_update which cannot "
109 "be safely replaced."
110 )
>>>ValueError: Graph contains shared RandomType variables which cannot be safely replaced
with model_noNaNs:
noNaNstrace = pm.sample(1000, tune=1000, chains=2, cores=2, nuts_sampler="numpyro")
>>>Compiling...
>>>Compilation time = 0:00:01.546892
>>>Sampling...
>>>
>>>Running chain 0: 100%
>>>2000/2000 [00:03<00:00, 7812.54it/s]
>>>Running chain 1: 100%
>>>2000/2000 [00:03<00:00, 6127.70it/s]
>>>
>>>Sampling time = 0:00:03.619498
>>>Transforming variables...
>>>Transformation time = 0:00:00.226946
print(noNaNstrace)
>>>Inference data with groups:
>>> > posterior
>>> > sample_stats
>>> > observed_data
Is there any way to work around this problem, which allows for missing values in the data set?