How to make a TensorFlow ML model compatible with PyMC3?

I am trying to implement MCMC using PyMC3. The issue that I have is that I have to feed the pymc distributions a, b, c into a machine learning model (in TensorFlow). However, the TensorFlow model doesn’t recognize the distributions. Is there a way to make the ML TensorFlow model compatible with PyMC3?

ndraws=100
nburn=30

with pm.Model() as mcmc_model:
    
    a = pm.Uniform('yn', lower=15, upper=17)
    b = pm.Uniform('Tion', lower=3.5, upper=6.5)
    c = pm.Uniform('dsr', lower=2, upper=6)
    params = pm.math.stack([a,b,c])

    ml_output = pm.Deterministic('output', model.predict(params))
    pm.Normal('likelihood', mu = ml_output, sd = 0.01*ml_output , observed=data)
    trace = pm.sample(ndraws, tune=nburn, discard_tuned_samples=True)

# # plot the traces
_ = az.plot_trace(trace)

Yes! I have done this a few times. But it will take some legwork. Long story short, you will need to write a Theano Op that builds a bridge between PyMC/Theano and your TensorFlow model. There is an excellent blog post by @dfm on how to do this which you can find here. You should be able to adapt this code to your problem.

3 Likes

Thank you @jlindbloom for your pointers!

I wrote a custom Theano Op for my TensorFlow model, but without the grad method so I can’t use any gradient-based sampling technique. How would one compute the gradient of a ML model prediction?

You will need to use some code involving tf.gradients / tf.GradientTape (see more about these here). The blog post I linked to above uses tf.gradients, I think your surrest bet would be to mimic that code. The idea here is that your ML model is just some function F: \Theta \to \mathbb{R}^d. Your grad method needs to compute \frac{\partial F}{\partial \theta} which is what you can get with tf.gradients.

Maybe this will make the grad Op a little less daunting. Here’s a bare-bones example of a grad Op that I’ve used before:

class GradOp(Op):

    itypes = [tt.dscalar, tt.dvector] # First input is theta, second is for gradient
    otypes = [tt.dscalar] 

    def __init__(self,  nn_model):
        self.nn_model = nn_model

    def perform(self, node, inputs, outputs):
        theta, g  = inputs
        result = np.float64(np.squeeze(self.nn_model.derivative(theta)))
        outputs[0][0] = np.dot(result, g)

The difference between this and the grad Op in that post is that in reality getting the derivative in TensorFlow isn’t as simple as using something like nn_model.derivative(). The grad Op from the post is:

class _TensorFlowGradOp(tt.Op):
    """A custom Theano Op defining the gradient of a TensorFlowOp
    
    Args:
        base_op (TensorFlowOp): The original Op
    
    """
    def __init__(self, base_op):
        self.base_op = base_op
        
        # Build the TensorFlow operation to apply the reverse mode
        # autodiff for this operation
        # The placeholder is used to include the gradient of the
        # output as a seed
        self.dy = tf.placeholder(tf.float64, base_op.output_shape)
        self.grad_target = tf.gradients(base_op.target,
                                        base_op.parameters,
                                        grad_ys=self.dy)

        # This operation will take the original inputs and the gradient
        # seed as input
        types = [_to_tensor_type(shape) for shape in base_op.shapes]
        self.itypes = tuple(types + [_to_tensor_type(base_op.output_shape)])
        self.otypes = tuple(types)
 
    def infer_shape(self, node, shapes):
        return self.base_op.shapes

    def perform(self, node, inputs, outputs):
        feed_dict = dict(zip(self.base_op.parameters, inputs[:-1]),
                         **self.base_op._feed_dict)
        feed_dict[self.dy] = inputs[-1]
        result = self.base_op.session.run(self.grad_target, feed_dict=feed_dict)
        for i, r in enumerate(result):
            outputs[i][0] = np.array(r)

You can think of the extra code in the __init__ and perform methods as just the extra overhead you need to do to actually fetch the gradients from TensorFlow.

p.s. If you are able to share your code through something like a Google Collab notebook I’d be happy to help you get this working.

1 Like

I can’t share my code easily, but I’ve decided to use the Boston housing dataset and scikit-learn RandomForest as an example. Google colab notebook
I tried implementing according to the link you attached, but was having issues getting it compiled and run.

The first issue you would run into is that it isn’t clear to me how you can get the gradient term \frac{\partial F}{\partial \theta} from the sci-kit learn interface, and particularly for RandomForestRegressor. I think this is probably a herculean task and you would have to write a bunch of code for this yourself. If you were just doing something like a multivariate linear regression then this wouldn’t be too hard since you could just access the coefficient/intercept matrices and pass them through an analytic formula for the model gradient.

The second issue is more fundamental in that I don’t think you are playing to the strengths of PyMC (and Bayesian analysis in general) by doing this. By training a neural network you’re estimating a single set of parameters \theta^* for a model that takes you from [AGE, RM, DIS] to [PRICE, TAX], so a mapping F_{\theta^*}: \{ \text{all ages} \} \times \{ \text{all rm} \} \times \{ \text{all dis} \} \to \{ \text{all prices} \} \times \{ \text{all taxes} \}. You’re then trying to gauge the uncertainty (given \theta^*) in [AGE, RM, DIS] given an observation of [PRICE, TAX]. IMO the “Bayesian” way of doing this would be to learn estimating a mapping in the reverse direction and with uncertainties in \theta, i.e. the mapping F_{\theta}: \{ \text{all prices} \} \times \{ \text{all taxes} \} \to \{ \text{all ages} \} \times \{ \text{all rm} \} \times \{ \text{all dis} \} ), and then given some [PRICE, TAX] you would automatically get the uncertainties on [AGE, RM, DIS]. But these uncertainties would include the parametric uncertanties coming from your knowledge of \theta, in addition to noisiness from observations.

This basically comes down to the difference between estimating a posterior on your machine learning model parameters \theta versus fixing \theta^* and asking for the posterior on inputs given some outputs. You certainly can do the later, but in this case it isn’t clear to me how you could get the model gradients from sci-kit learn.

As for the former, you have to options. First, you could use a much simpler model (linear regression or some other model you could code in Theano) to estimate the inverse mapping, sample the posterior, then get the uncertainties to free. You wouldn’t have to worry about gradients or a custom Theano Op. Your second option gets complicated very quickly, which is to use a Bayesian neural network. Here you would be training a neural network for the inverse mapping that learns distributions on its weights and biases rather than single values (the normal way). Then, when you query the neural network for the output given some input [PRICE, TAX], you get back a probability distribution over the vector [AGE, RM, DIS] which seems to be what you’re after in the first place.

I initially thought you were trying to train a fixed parameter mapping F_{\eta*} : \Theta \to \mathcal{Y} from a parameter space \Theta to a feature space \mathcal{Y} but you are actually training a fixed mapping F_{\eta^*}: \mathcal{X} \to \mathcal{Y} between two feature spaces and then trying to treat an underlying x \in \mathcal{X} as your “parameters”. The first approach will work for things like surrogate models (I was learning solutions of parametric ODEs), but unfortunately not for your case.

1 Like

Thank you for your detailed reply, @jlindbloom. It helped with my understanding.

It is not straightforward to me on how to compute gradients using an existing trained ML model (from sklearn/tensorflow/pytorch) that doesn’t have Theano as the backend. Converting the trained model to Theano may be the way to go. I will leave this to expert.

Using a Bayesian neural network to obtain probability distributions for a set of parameters \theta^* was my first intuition in solving the problem. After much deliberation, it is easier to explain the posterior probability of \theta^* obtained using the mapping from \theta^* \rightarrow \theta (this mapping is described by a semi-analytical physics model) than to explain using the mapping from \theta \rightarrow \theta^* from learned weights and bias distributions of BNN.