Determining if a model has changed (caching)

How might one determine if a new model context is identical to an old one?

For example:

y = np.random.randn(10)

with pm.Model() as model_1:
    noise = pm.Gamma('noise', alpha=2, beta=1)
    y_observed = pm.Normal('y_observed', mu=0, sigma=noise, observed=y)
with pm.Model() as model_2:
    noise = pm.Gamma('noise', alpha=2, beta=1)
    y_observed = pm.Normal('y_observed', mu=0, sigma=noise, observed=y)

model_1 and model_2 are identical. If I’ve sampled from one I don’t need to waste time sampling from the other. But I’m not sure how to test for their identity. model_1 == model_2 and hash(model_1) == hash(model_2) are both False. Maybe this can be done by recursively checking through each element of __getstate()__ or __dict__, but before I try that I’d be interested to know if there is a neater solution.

pm.Model does not have __eq__, but maybe we can hash the compiled theano graph. @brandonwillard might have some idea here.

That would be cool. If it did that, would it use the content of Theano shared variables in its hash? If it did, then it would literally be the same model all the way through. If it didn’t, that could actually be an advantage because then the model and the data (assuming that is stored in shared variables for training/testing) would be separate (a hash could be computed separately for data).

First thoughts: I don’t know about the pm.Model object itself, but the Theano objects it references will need to be in canonical form in order to make a useful equality check (e.g. a check that doesn’t simply consider whether or not log-likelihood graphs are literally identical).

Here is the solution currently working for me, and only in an IPython-like environment (e.g. Jupyter Notebook or Lab). I import the built-in library joblib, and then at the very end of the cell where I define my model, I add this line:

model.code_hash = joblib.hash(In[-1].split('\n')[-1])

This takes all the text in the current cell (i.e. the model definition), removes the last line (where I make this assignment) and hashes it to a short string. Note that this is the literal text of the cell, so the hash will change if you update comments, and variables will not resolve to their actual values.

So I compute some other hashes as well, specifically any model hyperparameters that I had stored as variables (i.e. by hashing a dictionary of those variables). Finally I use theano shared variables for all of my data, so anytime I update the data I hash that data concurrent with the shared variable updates.

Altogether I have three hashes, one for basic model structure, one for model hyperparameters, and one for data, and I concatenate these together into one long hash. I use these long hashes to store and lookup sampling traces and posterior predictions made under those same model+parameter+data combinations, saving me a lot of computation time.

There are probably some holes in this approach, but so far it is working for me.