NUTS can sample even failing to differentiate?

This model learns a optimal value for x.
However, we can not solve the derivative of s wrt. x. I show another Tensorflow example below to verify this. I am more familiar with TF so I use it instead of Theano, assuming they are equally powerful in automatic differentiation.

NUTS is a variant of HMC, and thus needing the first order derivative to guide the sampling.
So why can it still work?

Example 1:

with pm.Model() as model:
    x = pm.Normal('x', mu=0, sd=3)
    
    a = theano.shared(np.array([1.,2.,3.,4.,5.]))
    b = a + x
    c = a[(b > 5)]
    s = tt.sum(c)
    
    sigma = pm.HalfCauchy('sigma', 2.5)
    y_obs = pm.Normal("s", mu=s, sd=sigma, observed=15)
    
    trace = pm.sample(500)

Example 2:

a = tf.constant([1.,2.,3.,4.,5.])
x = tf.constant([3.])
# x = tf.get_variable("x1", initializer=tf.constant([3.]))
b = a + x
c = tf.boolean_mask(a, b>5)
s = tf.reduce_sum(c)
ga = tf.gradients(s, [a])
gx = tf.gradients(s, [x])

with tf.Session() as sess:
    print(sess.run(s))
    print(sess.run(ga))
    print(sess.run(gx)) # This fail, same for a variable x.

Stealing part from Dave’s comment from a TFP issue here:
The way you’ve written the model with a continuous prior, all RVs are technically speaking a continuous variable, and it seems to work because HMC resorts to random-walk behavior in the absence of gradients. So while there is no gradient of s re x, the sampler following the initial random momentum in each HMC step to essentially get some kind of discrete MH proposal.

I see. HMC degenerates to MH in the absence of grad.
Thank you very much, @junpenglao !

Actually, my answer is not correct.
You can take gradient re x:


So my guess is that the posterior is piece-wise differentiated, as long as the sampler or leapfrog step does not land exactly on the discontinuous point in the posterior parameter space, everything should works fine.
As for the tensorflow example, since sampling and gradient are done using the log-likelihood as target, you might want to replicate it similarly in tensorflow as well.

Something like:

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

tf.reset_default_graph()
a = tf.constant([1., 2., 3., 4., 5.])
x = tf.get_variable("x", initializer=tf.constant(3.))
sigma = tf.get_variable("sigma", initializer=tf.constant(0.1))

b = a + x
c = tf.boolean_mask(a, b > 5)
s = tf.reduce_sum(c)

loglike = tfd.Normal(loc=s, scale=sigma).log_prob(15.) + tfd.HalfCauchy(
    loc=0., scale=2.5).log_prob(sigma) + tfd.Normal(loc=0., scale=3.).log_prob(x)

ga = tf.gradients(loglike, [a])
gx = tf.gradients(loglike, [x])

with tf.Session() as sess:
    init = tf.initialize_all_variables()
    sess.run(init)
    print(sess.run(s))
    print(sess.run(ga))
    print(sess.run(gx))

Thank you so much for your detailed further investigation, Junpeng.
As I see in your TF example, grad of x really comes from only its prior, which forces samplers to sample more x around zero. However, in its nature, it is still doing model selection in terms of x.

Thanks again for your help! Really learn a lot of useful skills from you.

1 Like

That’s a great observation! It would be interesting to run some experiment to see how much it effects the efficiency of HMC sampling.