I was skimming through the code in pymc4.sample_prior_predictive
, and I wanted to clarify some concepts.
I realized that it always returns all the results in the prior_predictive
group of the inference data group, which I think will be confusing for users (especially as the importance of both prior and prior predictive checks increases) and can also make harder to use all of ArviZ features.
Would it be possible to divide the variables into
prior
andprior_predictive
groups in the same way as variables are divided betweenposterior
andposterior_predictive
?
I have found that conceptually distinguishing between prior
and prior_predictive
is generally harder than between posterior
and posterior_predictive
, and keeping them combined in PyMC4 will probably keep the confusion alive. Below I list the two main arguments that came to mind when keeping both quantities combined, because I am not sure I completely grasp the whole situation.
I know that both quantities can be sampled at the same time and therefore doing something like prior = pm.sample_prior(model); prior_pred = pm.sample_prior_predictive(prior, model)
is not efficient at all. However, both quantities can be sampled at the same time and still be stored each in the corresponding group of the resulting inference data. When the return value is an inference data object, computational efficiency and storing them in different groups seem perfectly compatible.
I have seen the argument sample_from_observed
which may difficult distinguishing between the two quantities, however, I have not been able to understand what it does conceptually. To me, neither of prior
=p(\theta) nor prior_predictive
=\int p(y^*|\theta) p(\theta) d\theta know about the observed data y, so I can’t wrap my head around what is computed by pm.sample_prior_predictive
with sample_from_observed=False
. We only get samples from \theta (prior/posterior variables) and their distribution is somehow conditional to the observed data y but it clearly isn’t the posterior.