Is there a standard idiom for “check for this cached file (inference data), and if it’s there, read it, otherwise run this sampling command”? Seems like this would be very useful for notebooks that use PyMC on models that are expensive to sample.
I have been hand-writing stuff like this:
idata_file = "myfilename.nc"
if os.path.exists(idata_file):
idata = az.from_netcdf(idata_file)
else:
idata = <some expensive computation>
if not os.path.exists(idata_file):
az.to_netcdf(idata, idata_file)
but it’s not great.
Probably this could be streamlined with a decorator, as long as one was careful to avoid name collisions.
I just wrote a function that wraps pm.sample()
and some other common functions (e.g. sample_posterior_predictive()
) that automatically checks for a cache and loads it if one is available. It’s not ideal, but works well enough at the moment. I’m tempted to turn it into a simple package so I can use it in other projects. It would be great to hear what others do because this is a rather annoying issue that I have to believe many others have dealt with.
Would this also be able to detect when a model has changed and not just whether the trace file with a given name already exists or not? That would be really helpful.
For my use case, I didn’t necessarily want automatic cache invalidation, but if you can create some sort of hash of the model, then it should be pretty straightforward. Since there is theano magic at play, it may be worth looking through the attributes of the pm.Model
to find some sort of unique identifier.
For example, say we have the following model:
with pm.Model() as model:
a = pm.Normal("a", 0, 1)
b = pm.Normal("b", 0, 1)
mu = a + b
sigma = pm.HalfNormal("sigma", 2)
y = pm.Normal("y", mu, sigma)
Then we could use the results from print(model)
:
a ~ Normal
b ~ Normal
sigma_log__ ~ TransformedDistribution
y ~ Normal
sigma ~ HalfNormal
Or we could gather the random, deterministic, and observed variables (using model.free_RVs
, model.deterministics
, model.observed_RVs
) to create specific identifier to recognize the model.
What would be ideal for reproducibility but no idea if it’s possible at all, would be to somehow store the lines of code within the model context in a string, which could then be added to the inferencedata object as an attribute.