Determining if a model has changed (caching)

rgerkin · December 21, 2019, 6:57am

How might one determine if a new model context is identical to an old one?

For example:

y = np.random.randn(10)

with pm.Model() as model_1:
    noise = pm.Gamma('noise', alpha=2, beta=1)
    y_observed = pm.Normal('y_observed', mu=0, sigma=noise, observed=y)
    
with pm.Model() as model_2:
    noise = pm.Gamma('noise', alpha=2, beta=1)
    y_observed = pm.Normal('y_observed', mu=0, sigma=noise, observed=y)

model_1 and model_2 are identical. If I’ve sampled from one I don’t need to waste time sampling from the other. But I’m not sure how to test for their identity. model_1 == model_2 and hash(model_1) == hash(model_2) are both False. Maybe this can be done by recursively checking through each element of __getstate()__ or __dict__, but before I try that I’d be interested to know if there is a neater solution.

junpenglao · December 21, 2019, 6:59am

pm.Model does not have __eq__, but maybe we can hash the compiled theano graph. @brandonwillard might have some idea here.

rgerkin · December 21, 2019, 6:07pm

That would be cool. If it did that, would it use the content of Theano shared variables in its hash? If it did, then it would literally be the same model all the way through. If it didn’t, that could actually be an advantage because then the model and the data (assuming that is stored in shared variables for training/testing) would be separate (a hash could be computed separately for data).

brandonwillard · December 21, 2019, 6:20pm

First thoughts: I don’t know about the pm.Model object itself, but the Theano objects it references will need to be in canonical form in order to make a useful equality check (e.g. a check that doesn’t simply consider whether or not log-likelihood graphs are literally identical).

rgerkin · May 20, 2021, 11:46pm

Here is the solution currently working for me, and only in an IPython-like environment (e.g. Jupyter Notebook or Lab). I import the built-in library joblib, and then at the very end of the cell where I define my model, I add this line:

model.code_hash = joblib.hash(In[-1].split('\n')[-1])

This takes all the text in the current cell (i.e. the model definition), removes the last line (where I make this assignment) and hashes it to a short string. Note that this is the literal text of the cell, so the hash will change if you update comments, and variables will not resolve to their actual values.

So I compute some other hashes as well, specifically any model hyperparameters that I had stored as variables (i.e. by hashing a dictionary of those variables). Finally I use theano shared variables for all of my data, so anytime I update the data I hash that data concurrent with the shared variable updates.

Altogether I have three hashes, one for basic model structure, one for model hyperparameters, and one for data, and I concatenate these together into one long hash. I use these long hashes to store and lookup sampling traces and posterior predictions made under those same model+parameter+data combinations, saving me a lot of computation time.

There are probably some holes in this approach, but so far it is working for me.

Topic		Replies	Views
Copy/reinitialize a model from an existing model? Questions	9	1225	December 19, 2018
Too complicated model? Theano cache: refreshing lock and slow sampling Questions	4	1262	June 12, 2020
Change observed data without redefining model Questions	2	1077	February 21, 2019
Theano shared and prediction not working as expected Questions	2	597	July 13, 2019
How to define model depending on the output from the external program Questions	5	468	March 5, 2020

Determining if a model has changed (caching)

Related Topics