How save PyMC v5 models?

What is a recommended v5 way to save a PyMC model object?

I looked at pickle but got AttributeError: Can't pickle local object '_make_nice_attr_error.<locals>.fn'.

Maybe dill would work, but I wanted to see what people generally do in the current state.

dill · PyPI

Here’s a tutorial: Using ModelBuilder class for deploying PyMC models — PyMC example gallery

1 Like

The ModelBuilder class is clearly the way to go, but if you’re looking for a quick and dirty solution, I’ve been wrapping my trace and model inside a python dict and saving it as a pickle.

import pickle
import cloudpickle

pickle_filepath = f'path/to/pickle.pkl'
dict_to_save = {'model': model_name,
                'idata': idata,
                'recovery_dict':z_score_recovery_dict,
                }

with open(pickle_filepath , 'wb') as buff:
    cloudpickle.dump(dict_to_save, buff)

Then the load would be :

pickle_filepath = f'path/to/pickle.pkl'
with open(pickle_filepath , 'rb') as buff:
    model_dict = cloudpickle.load(buff)

idata = model_dict['idata']
model = model_dict['model']

with model:
    ppc_logit = pm.sample_posterior_predictive(idata )

I’ve had issues saving NetCDF files on Databricks and as long as you keep the pickle version consistent you should be ok.

2 Likes

@twiecki What about the case of model checkpointing? I am working on a compute cluster where I may get pre-empted after a certain amount of time. Is there anyway I can save the model at set intervals with this workflow to be loaded and continue sampling where I left off?

Currently that’s not supported. You could just sample 100 samples each, then save, then continue etc. There’s also GitHub - pymc-devs/mcbackend: A backend for storing MCMC draws. by @michaelosthege to save traces on another machine.

@twiecki That makes sense, I really appreciate your response! I assume the model sampled 200 times is roughly equivalent to a model that has been sampled 100 times saved, loaded and sampled 100 more times

Hello Kraftfaust

We also ran into file size constrains in databricks while saving the .nc files.
As the Tread is 2 years old, have you found a better solution to save and load the model to mlflow in databricks?

Best wishes

Hello,

Great question. So far saving and loading a cloudpickle object has been sufficient for my day-to-day needs. However, I have also been wondering if there is a more robust MFLow based solution.

On a previous non-PyMC based project I build out a custom MLFlow pyfunc class which was flexible enough to handle a variety of edge cases. That’s where I would start if I wanted to wrap a PyMC model into MLFlow: mlflow.pyfunc.

Apologies for jumping in, but your answer seemed to line up with a recent question i just had. For out of sample prediction, how are you handling the use of pyfunc?

Mlflow recommends passing a config file to a custom predict method-but to my knoweldge we can’t do that as directly in pymc. Are you just passing the function you defined your pymc model to be in some fit method of the class, and mlflow allows this to be served on its endpoint? The documentation for mlflow has trivial examples, and I’m not really sure how to proceed at this time.

Thank you for your time!

Hello - no apologies necessary happy to try to help. It’s been a while since I built out the pyfunc class, but yes, I would wrap all your pymc model inference calls inside of a MLFlow custom predict class. When you call .predict() from a MFLow model object you are actually calling the MLFlow predict class, so whatever you put in there will be run, which can be anything. And then yes, when you host and serve the MLFlow endpoint and call predict everything in the .predict() class is run.

# Model wrapper class
# Classic Iris example that you can modify to use :

# Model wrapper class
class ModelWrapper(mlflow.pyfunc.PythonModel):
    # Initialize model in the constructor
    def __init__(self, model):
        self.model = model
 
    # Prediction function
    def predict(self, context, model_input):
        
        #This is where you would add pymc custom code.
        # with model:


        # Predict the probabilities and class
        prediction_probabilities = self.model.predict_proba(model_input)
        predictions = self.model.predict(model_input)
 
        # Create a DataFrame to hold the results
        result = pd.DataFrame(prediction_probabilities, columns=['prob_0', 'prob_1', 'prob_2'])
 
        # Inside the predict function:
        class_labels = ["setosa", "versicolor", "virginica"]
        predictions = self.model.predict(model_input)
        predicted_probabilities = self.model.predict_proba(model_input)
        result = pd.DataFrame(predicted_probabilities, columns=[f'prob_{label}' for label in class_labels])
        result['prediction'] = [class_labels[prediction] for prediction in predictions]
        
        return result

Deploy Python code with Model Serving | Databricks on AWS

Oh, that makes it look way easier than then mlflow docs suggest xD

So if my use case were more to train a model, save it, and then out of sample prediction (after I have a trained model) it would be appropriate within the wrapper to do something like:

# Model wrapper class
# Classic Iris example that you can modify to use :

# Model wrapper class
class ModelWrapper(mlflow.pyfunc.PythonModel):
    # Initialize model in the constructor
    def __init__(self, model):
     #assume i have some logic here to check if its been fit already   
    self.model = None
 
#build the model 

   def: build_model():
           #This is where you would add pymc custom code.
           with model:
           self.model = model
    

# Prediction function
    def predict(self, context, model_input):
        # set up new data containers
        with self.model:
            X_new = pm.set_data(model_input)
            pm.sample
        #leaving the rest here because I can just convert arviz to pandas via numpy)
        # Create a DataFrame to hold the results
        #result = pd.DataFrame(prediction_probabilities, columns=['prob_0', 'prob_1', 'prob_2'])
 
        # Inside the predict function:
        class_labels = ["setosa", "versicolor", "virginica"]
        predictions = self.model.predict(model_input)
        predicted_probabilities = self.model.predict_proba(model_input)
        result = pd.DataFrame(predicted_probabilities, columns=[f'prob_{label}' for label in class_labels])
        result['prediction'] = [class_labels[prediction] for prediction in predictions]
        
        return resul

Right now, for out of sample prediction I am basically dumping the model meta data to a netcdf that mlflow is logging and storing for me…but this seems inelegant and slower than necessary-for instance if I train a bart model, I must save the file that contains the split rules…which seems silly ( i realize that pkl and the like implicetly do this, but it just seems that I’m trying to do too much here).

After I train a model, I basically just want to save it and call it another notebook. I could do this with pkl, but I as you know I lost a lot of the ml functionality.