Why does adding extra RVs to the model (that are conditionally independent of likelihood) create more divergences?

RavinKumar · March 3, 2021, 5:52am

Full outputs are in this note book but here’s the summary GroupLevelPredictions.ipynb · GitHub

This first model produces 81 divergences. The second model below produces something like ~1000 divergences, even though nothing has changed for any of the parameters used in the likelihood calculation. The logp for the test point is the same, so I’m curious why sampling is different. Maybe its something as simple as that the initial start point is different?

with pm.Model() as model_hierarchical_salad_sales_predictions:
   
    σ = pm.HalfNormal("σ", 20)
    
    β_μ_hyperprior = pm.Normal("β_μ_hyperprior", 10, 10)
    β_σ_hyperprior = pm.HalfNormal("β_σ_hyperprior", 10)
    β_offset = pm.Normal('β_offset', mu=0, sd=1, shape=6)
    
    β = pm.Deterministic("β", β_μ_hyperprior + β_offset * β_σ_hyperprior)
    
    μ = pm.Deterministic('μ', β[location_category.codes] * hierarchical_salad_df.customers)
    
    sales = pm.Normal("sales", mu=μ, sd=σ, observed=hierarchical_salad_df.sales)
    
    trace_hierarchical_salad_sales_noncentered = pm.sample(random_seed=0)

with pm.Model() as model_hierarchical_salad_sales_extra_nodes:
   
    # All this stuff is the same until the next comment
    σ = pm.HalfNormal("σ", 20)
    
    β_μ_hyperprior = pm.Normal("β_μ_hyperprior", 10, 10)
    β_σ_hyperprior = pm.HalfNormal("β_σ_hyperprior", 10)
    β_offset = pm.Normal('β_offset', mu=0, sd=1, shape=6)
    
    β = pm.Deterministic("β", β_μ_hyperprior + β_offset * β_σ_hyperprior)
    
    μ = pm.Deterministic('μ', β[location_category.codes] * customers)
    
    sales = pm.Normal("sales", mu=μ, sd=σ, observed=hierarchical_salad_df.sales)
    
    # Extra nodes for group and individual level predictions
    β_group = pm.Normal("group_beta_prediction", β_μ_hyperprior, β_σ_hyperprior)
    group_prediction = pm.Normal("group_prediction", β_group*out_of_sample_customers, σ)
    location_4_predictions = pm.Normal("location_4_predictions", β[4]*out_of_sample_customers, σ)
    
    trace_hierarchical_salad_sales_noncentered = pm.sample(random_seed=0)

junpenglao · March 3, 2021, 8:28am

The model logp changes after you adding the new RVs, they becomes random variables where NUTS trying to sample. The model dimension changed and it might have consequence on the geometry of the posterior - which lead to more divergence.

If you compare with Stan, there is a generated quantities block - those does not contribute to the log_prob computation, and use only forward random generation. We can kind of do the same in PyMC3 with theano/aesara rng, but it is not a very common practice.

OriolAbril · March 3, 2021, 10:32am

You can use numpy+xarray to do these kind of calculations from the inferencedata result which can then be done in a vectorized way and without having to care about broadcasting. I can add an example if you are interested, using xarray for that is a bit obscure but rng sampling is generally a single line with no strange indexing nor newaxis things

RavinKumar · March 3, 2021, 3:24pm

Thanks @junpenglao
@OriolAbril Dont worry about an example, I can make one. I appreciate you offering!

OriolAbril · March 3, 2021, 3:54pm

I realized I already wrote an example for this in the sampling wrappers notebook: Refitting PyMC3 models with ArviZ (and xarray) — ArviZ dev documentation

junpenglao · March 4, 2021, 8:44am

I didnt know about this notebook, it is EXCELLENT! Thanks for sharing.

jonsedar · March 8, 2021, 7:14am

Just a thought, instead of doing it in post, could you use dist, random and Deterministics?

EDIT I forgot to wrap in a Theano shared variable, since the output of random() is a numpy ndarray. Updated it

e.g.

 β_group = pm.Deterministic("group_beta_prediction", 
                            theano.shared(
                                pm.Normal.dist(mu=β_μ_hyperprior, 
                                               sigma=β_σ_hyperprior
                                ).random(size=1))
                            )
... etc

RavinKumar · March 12, 2021, 5:53pm

I didnt realize this was possible. Does it work? Either way I’ll give it a try.

Cheers @jonsedar

Topic		Replies	Views
Ideas for reparameterizing models/changing priors to avoid divergences Questions	3	907	December 5, 2019
Simpsons paradox and mixed models - large number of groups v5	18	407	March 14, 2024
Divergences with unused variables	11	553	November 16, 2023
Advice with hierarchical model Questions	9	2199	October 8, 2018
Problem with divergence in "localized" models: extreme values for Half Cauchy variables Questions	1	632	June 8, 2019

Why does adding extra RVs to the model (that are conditionally independent of likelihood) create more divergences?

Related topics