Why does adding extra RVs to the model (that are conditionally independent of likelihood) create more divergences?

Full outputs are in this note book but here’s the summary GroupLevelPredictions.ipynb · GitHub

This first model produces 81 divergences. The second model below produces something like ~1000 divergences, even though nothing has changed for any of the parameters used in the likelihood calculation. The logp for the test point is the same, so I’m curious why sampling is different. Maybe its something as simple as that the initial start point is different?

with pm.Model() as model_hierarchical_salad_sales_predictions:
   
    σ = pm.HalfNormal("σ", 20)
    
    β_μ_hyperprior = pm.Normal("β_μ_hyperprior", 10, 10)
    β_σ_hyperprior = pm.HalfNormal("β_σ_hyperprior", 10)
    β_offset = pm.Normal('β_offset', mu=0, sd=1, shape=6)
    
    β = pm.Deterministic("β", β_μ_hyperprior + β_offset * β_σ_hyperprior)
    
    μ = pm.Deterministic('μ', β[location_category.codes] * hierarchical_salad_df.customers)
    
    sales = pm.Normal("sales", mu=μ, sd=σ, observed=hierarchical_salad_df.sales)
    
    trace_hierarchical_salad_sales_noncentered = pm.sample(random_seed=0)
    
with pm.Model() as model_hierarchical_salad_sales_extra_nodes:
   
    # All this stuff is the same until the next comment
    σ = pm.HalfNormal("σ", 20)
    
    β_μ_hyperprior = pm.Normal("β_μ_hyperprior", 10, 10)
    β_σ_hyperprior = pm.HalfNormal("β_σ_hyperprior", 10)
    β_offset = pm.Normal('β_offset', mu=0, sd=1, shape=6)
    
    β = pm.Deterministic("β", β_μ_hyperprior + β_offset * β_σ_hyperprior)
    
    μ = pm.Deterministic('μ', β[location_category.codes] * customers)
    
    sales = pm.Normal("sales", mu=μ, sd=σ, observed=hierarchical_salad_df.sales)
    
    # Extra nodes for group and individual level predictions
    β_group = pm.Normal("group_beta_prediction", β_μ_hyperprior, β_σ_hyperprior)
    group_prediction = pm.Normal("group_prediction", β_group*out_of_sample_customers, σ)
    location_4_predictions = pm.Normal("location_4_predictions", β[4]*out_of_sample_customers, σ)
    
    trace_hierarchical_salad_sales_noncentered = pm.sample(random_seed=0)
1 Like

The model logp changes after you adding the new RVs, they becomes random variables where NUTS trying to sample. The model dimension changed and it might have consequence on the geometry of the posterior - which lead to more divergence.

If you compare with Stan, there is a generated quantities block - those does not contribute to the log_prob computation, and use only forward random generation. We can kind of do the same in PyMC3 with theano/aesara rng, but it is not a very common practice.

3 Likes

You can use numpy+xarray to do these kind of calculations from the inferencedata result which can then be done in a vectorized way and without having to care about broadcasting. I can add an example if you are interested, using xarray for that is a bit obscure but rng sampling is generally a single line with no strange indexing nor newaxis things

Thanks @junpenglao
@OriolAbril Dont worry about an example, I can make one. I appreciate you offering!

1 Like

I realized I already wrote an example for this in the sampling wrappers notebook: Refitting PyMC3 models with ArviZ (and xarray) — ArviZ dev documentation

3 Likes

I didnt know about this notebook, it is EXCELLENT! Thanks for sharing.

1 Like

Just a thought, instead of doing it in post, could you use dist, random and Deterministics?

EDIT I forgot to wrap in a Theano shared variable, since the output of random() is a numpy ndarray. Updated it

e.g.

 β_group = pm.Deterministic("group_beta_prediction", 
                            theano.shared(
                                pm.Normal.dist(mu=β_μ_hyperprior, 
                                               sigma=β_σ_hyperprior
                                ).random(size=1))
                            )
... etc   

I didnt realize this was possible. Does it work? Either way I’ll give it a try.

Cheers @jonsedar