Is there a way to generate synthetic data sets using PYMC where the synthetic data would capture the correct relationships and distributions of the model and the original data set? thanks. Paul
Hi Paul!
pm.sample_posterior_predictive
generates artificial (observed) data, given the model and covariates; this might be what you want? There’s an example notebook showing how they are used here
Welcome!
I’m not sure what the “correct” relationships/distribution of both the model and the data might be. You can generate posterior predictive samples, which are draws from a posterior used to generate credible (synthetic) data. Or, if you want to investigate the dependencies among model parameters (ignoring observed data), you can sample from your model without including any of the observed variables. That should yield a MCMC trace that includes draws from your posterior that are then pushed through the rest of your model:
with pm.Model() as model:
a = pm.Gamma("a", alpha=1, beta=1)
b = pm.Normal("b", mu=a, sigma=1)
c = pm.StudentT("c", mu=b, sigma=1, nu=3)
idata = pm.sample(10)
print(idata.posterior)
yields:
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
b (chain, draw) float64 1.368 1.93 1.213 3.031 ... 1.81 0.6625 0.2195
c (chain, draw) float64 3.798 3.018 4.84 5.91 ... 1.844 2.292 -0.6863
a (chain, draw) float64 0.4783 0.5831 0.8971 ... 0.2333 0.7063 0.4604
Attributes:
created_at: 2022-07-15T01:55:13.283567
arviz_version: 0.12.1
inference_library: pymc
inference_library_version: 4.1.2
sampling_time: 0.702103853225708
tuning_steps: 1000
Are either of those what you are looking for?
thank you
Paul
Yes, this is what I was looking for.
thanks
Paul