Is there a way to generate synthetic data

Is there a way to generate synthetic data sets using PYMC where the synthetic data would capture the correct relationships and distributions of the model and the original data set? thanks. Paul

Hi Paul!

pm.sample_posterior_predictive generates artificial (observed) data, given the model and covariates; this might be what you want? There’s an example notebook showing how they are used here

2 Likes

Welcome!

I’m not sure what the “correct” relationships/distribution of both the model and the data might be. You can generate posterior predictive samples, which are draws from a posterior used to generate credible (synthetic) data. Or, if you want to investigate the dependencies among model parameters (ignoring observed data), you can sample from your model without including any of the observed variables. That should yield a MCMC trace that includes draws from your posterior that are then pushed through the rest of your model:

with pm.Model() as model:
    a = pm.Gamma("a", alpha=1, beta=1)
    b = pm.Normal("b", mu=a, sigma=1)
    c = pm.StudentT("c", mu=b, sigma=1, nu=3)
    
    idata = pm.sample(10)
    print(idata.posterior)

yields:

Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    b        (chain, draw) float64 1.368 1.93 1.213 3.031 ... 1.81 0.6625 0.2195
    c        (chain, draw) float64 3.798 3.018 4.84 5.91 ... 1.844 2.292 -0.6863
    a        (chain, draw) float64 0.4783 0.5831 0.8971 ... 0.2333 0.7063 0.4604
Attributes:
    created_at:                 2022-07-15T01:55:13.283567
    arviz_version:              0.12.1
    inference_library:          pymc
    inference_library_version:  4.1.2
    sampling_time:              0.702103853225708
    tuning_steps:               1000

Are either of those what you are looking for?

thank you
Paul

Yes, this is what I was looking for.

thanks

Paul

2 Likes