The ModelBuilder class is clearly the way to go, but if you’re looking for a quick and dirty solution, I’ve been wrapping my trace and model inside a python dict and saving it as a pickle.
@twiecki What about the case of model checkpointing? I am working on a compute cluster where I may get pre-empted after a certain amount of time. Is there anyway I can save the model at set intervals with this workflow to be loaded and continue sampling where I left off?
@twiecki That makes sense, I really appreciate your response! I assume the model sampled 200 times is roughly equivalent to a model that has been sampled 100 times saved, loaded and sampled 100 more times
We also ran into file size constrains in databricks while saving the .nc files.
As the Tread is 2 years old, have you found a better solution to save and load the model to mlflow in databricks?
Great question. So far saving and loading a cloudpickle object has been sufficient for my day-to-day needs. However, I have also been wondering if there is a more robust MFLow based solution.
On a previous non-PyMC based project I build out a custom MLFlow pyfunc class which was flexible enough to handle a variety of edge cases. That’s where I would start if I wanted to wrap a PyMC model into MLFlow: mlflow.pyfunc.
Apologies for jumping in, but your answer seemed to line up with a recent question i just had. For out of sample prediction, how are you handling the use of pyfunc?
Mlflow recommends passing a config file to a custom predict method-but to my knoweldge we can’t do that as directly in pymc. Are you just passing the function you defined your pymc model to be in some fit method of the class, and mlflow allows this to be served on its endpoint? The documentation for mlflow has trivial examples, and I’m not really sure how to proceed at this time.
Hello - no apologies necessary happy to try to help. It’s been a while since I built out the pyfunc class, but yes, I would wrap all your pymc model inference calls inside of a MLFlow custom predict class. When you call .predict() from a MFLow model object you are actually calling the MLFlow predict class, so whatever you put in there will be run, which can be anything. And then yes, when you host and serve the MLFlow endpoint and call predict everything in the .predict() class is run.
# Model wrapper class
# Classic Iris example that you can modify to use :
# Model wrapper class
class ModelWrapper(mlflow.pyfunc.PythonModel):
# Initialize model in the constructor
def __init__(self, model):
self.model = model
# Prediction function
def predict(self, context, model_input):
#This is where you would add pymc custom code.
# with model:
# Predict the probabilities and class
prediction_probabilities = self.model.predict_proba(model_input)
predictions = self.model.predict(model_input)
# Create a DataFrame to hold the results
result = pd.DataFrame(prediction_probabilities, columns=['prob_0', 'prob_1', 'prob_2'])
# Inside the predict function:
class_labels = ["setosa", "versicolor", "virginica"]
predictions = self.model.predict(model_input)
predicted_probabilities = self.model.predict_proba(model_input)
result = pd.DataFrame(predicted_probabilities, columns=[f'prob_{label}' for label in class_labels])
result['prediction'] = [class_labels[prediction] for prediction in predictions]
return result
Oh, that makes it look way easier than then mlflow docs suggest xD
So if my use case were more to train a model, save it, and then out of sample prediction (after I have a trained model) it would be appropriate within the wrapper to do something like:
# Model wrapper class
# Classic Iris example that you can modify to use :
# Model wrapper class
class ModelWrapper(mlflow.pyfunc.PythonModel):
# Initialize model in the constructor
def __init__(self, model):
#assume i have some logic here to check if its been fit already
self.model = None
#build the model
def: build_model():
#This is where you would add pymc custom code.
with model:
self.model = model
# Prediction function
def predict(self, context, model_input):
# set up new data containers
with self.model:
X_new = pm.set_data(model_input)
pm.sample
#leaving the rest here because I can just convert arviz to pandas via numpy)
# Create a DataFrame to hold the results
#result = pd.DataFrame(prediction_probabilities, columns=['prob_0', 'prob_1', 'prob_2'])
# Inside the predict function:
class_labels = ["setosa", "versicolor", "virginica"]
predictions = self.model.predict(model_input)
predicted_probabilities = self.model.predict_proba(model_input)
result = pd.DataFrame(predicted_probabilities, columns=[f'prob_{label}' for label in class_labels])
result['prediction'] = [class_labels[prediction] for prediction in predictions]
return resul
Right now, for out of sample prediction I am basically dumping the model meta data to a netcdf that mlflow is logging and storing for me…but this seems inelegant and slower than necessary-for instance if I train a bart model, I must save the file that contains the split rules…which seems silly ( i realize that pkl and the like implicetly do this, but it just seems that I’m trying to do too much here).
After I train a model, I basically just want to save it and call it another notebook. I could do this with pkl, but I as you know I lost a lot of the ml functionality.