Simple question: correct way to save traces?

So, this is a bit embarrassing to ask, but using pm.save_trace(trace=my_trace) doesn’t seem to create any file in my jupyter notebook’s folder (which is also the root folder of the VM that I am running Python in).

Is there a reason for this, or do I need to specify some additional arguments in saving a trace?

I am using Python 3.6, as well as latest PyMC3.

1 Like

The reason you don’t see anything is that the default directory where the traces are saved is hidden. The default dirname is “.pymc_{number identifier}.trace”. Within that directory, each trace in the MultiTrace is saved to a separate file.

1 Like

That explains me not seeing it at least; however, loading the trace back in with the same name fails for me, because I am thinking that my VM potentially does not allow such hidden files?

If you would please help me out some more, I see two solutions to this problem:

  1. I am retrieving the traces using pm.load_trace() incorrectly somehow, so what specifically am I supposed to put into it?
  2. Is there a way to save the traces into a visible folder, so at least there is verification the issue lies not with the files being missing?

You can pass the argument directory to both the save and load trace functions. That gives you control on where the traces are stored

1 Like

Then that ends up giving me a warning that it doesn’t want to overwrite the current directory, which made me think I was doing something wrong. At any rate, thanks for letting me know, I’ll try again in a safe folder.

You can also set overwrite=True to overwrite the content of the folder. You should have a look at save_trace docstring to see all the available options.

So, I tried specifying an empty folder to save into within the same jupyter notebook directory with directory=, and specified overwrite=True, and consequently the folder becomes populated with two separate sub-folders, I assume corresponding to both of my chains.

However, how do I load these back into a singular trace within my Python session? pm.load_trace keeps giving me "list index out of range" errors, could you tell me how to fix this?

I could try to help if you add the full traceback here. One question, is the model being inferred from the context in your case?

Another question, did you try the more direct approach of just using pickle dump and load?

No, I was simply calling pm.load_trace() in its own cell, which, based on your post right now, I immediately inferred was wrong and corrected and put within the model context, where now it successfully works.

However, if I change some of the model settings, like a couple of variables’ sd= values, the trace still gets recovered, and everything seems fine. Would a trace be affected if I change all of a model’s variables to something other than what originally created the trace?

At any rate, thank you very much for answering my questions in this thread, lucianopaz.

For future reference, what I was being confused about in saving traces were the following factors:

  1. pm.save_trace()'s default behavior saves the traces in a hidden file, making it easy to think nothing happened
  2. To save the trace’s in a non-hidden way, both the directory= and overwrite=True options need to be set, which is not immediately obvious.
1 Like

Hi! These are helpful feedbacks – I wrote a lot of the code around the save/load functions.

For context, adding a non-pickle save function to pymc3 was difficult, due to some design decisions that gave the MultiTrace (which is a store of data) access to the Model object. This is why we have to load the trace in a model context. I am actually not sure what particular bugs will occur if you reload the trace in a model that is not yours! I am surprised and worried that it worked, actually. My guess is sample_posterior_predictive would go poorly for you.

That’s a good point about a hidden file! I have trouble deciding where to write by default to a user’s file system, especially when it is a directory, and not a file. Maybe the right answer is to provide no default, and give an informative error message instead.

My mental model was to make pm.save_trace() and pm.load_trace() very easy to use, but also to never delete anything on a user’s machine without prompting. I still think we should not overwrite without prompting. Is there a better way to phrase the error message that would have helped you?

4 Likes

And thank you for your reply.

The way I imagine any Python session should save things, if it should, is to the folder from which the current Python script/Jupyter notebook is operating out of. This makes the most sense, as files and folders should be organized hierarchically on everybody’s computers, and programmers do this automatically most of the time, such as in creating a Linear Regression folder, within the PyMC3 Projects folder.

How could anyone complain if PyMC3 created either a file or folder in such a location? If this default action were not preferred by some, locating the files that have already been created would at least be easy.

As for overwriting, I believe the sound of it is counter-intuitive, as having to set (as in my case) overwrite=True just to be able to save traces in their own visible folders, is not overwriting anything. To minimize the chances of overwriting anything, the default name for a saved object should be highly specific to PyMC3, like trace_nuts_4chains_041519.

The simplest error message I believe would be a “Save directory not specified”, and leave it at that. If something is saved, it saved there.

I hope I don’t come off as pedantic :smile:

1 Like

I think these are great suggestions and somewhat where we are thinking of moving towards. IIUC in Arviz something similar is already inplace?

1 Like

Yes – If we did this over, I think using xarray as a datastore directly (as in arviz) would be the right answer. Then all your data is in a common, portable format, and you could even think about weird things like sampling in Stan, and doing posterior predictive sampling in PyMC3.

You make a number of good points about default places to write - there are places where libraries are “allowed” to write, but those are typically in like ~/.local/pymc_data/. I haven’t had time to look at making a PR to update this, and any such PR I would want to carefully review, thinking about how easy or hard it will be for PyMC3 to accidentally delete an important folder, or the whole hard drive!

“Under no circumstances will PyMC be held responsible or liable in any way for any claims, damages, losses, expenses, costs or liabilities whatsoever (including, without limitation, any direct or indirect damages for loss of profits, business interruption or loss of information) resulting or arising directly or indirectly from your use of or inability to use this tool or any tool linked to it, or from your reliance on this tool, even if PyMC has been advised of the possibility of such damages in advance.”

3 Likes

I typically want to save the model as well, so just a small convenience function and python’s builtin pickle.

def pickle_model(output_path: str, model, trace):
    """Pickles PyMC3 model and trace"""
    with open(output_path, "wb") as buff:
        pickle.dump({"model": model, "trace": trace}, buff)
7 Likes

Note:
If you want to later reuse your model for predictions, you might also want to pickle the theano.shared observed variables.

6 Likes

Would someone mind providing code example for correct saving/loading model, trace as well as observed variables in order for it to be easily reinitialised and used for prediction?

If I pickle save pm.Model() object, are variables and structure going to be saved as well?
Will I i.e. be able to update priors (using from_posterior example) for online learning when new data is available?

Also, what would be the best way for me to have placeholder for predictors/observed variable in eg. logistic regression, to create python class BayesianLogisticRegression() without passing any data or even assumption on data shape?

Holy shit, I just killed everything in my folder with pm.save_trace(trace, directory=".", overwrite=True)

Managed to save the open notebook, but everything else is gone. Wasn’t under version control either, so …

I might humbly tag that as undesirable behaviour.

I have found it useful to instead create a builder function that takes the model’s data as input and returns the model and any shared variables. I then sample/fit the model and save those results to file (I use pm.save_trace() for the trace and just pickle everything else). When I need everything back again, I just use the builder to create a new model and shared variables (with the same data at before), and load in the sampling/fit results from disk (using pm.load_trace() for the trace). I’m not sure if this is a method that the core devs would endorse, but it has worked so far. :slight_smile:

I hope this helped.

2 Likes