Using stick breaking algorithm example but receiving gradient problem

Good evening!
I am attempting to use the example found at http://docs.pymc.io/notebooks/dependent_density_regression.html# to model a simpler data set so I know what is going on within the example. I am able to successfully run the example above itself so it must be an issue on my end.

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
from theano import shared
import theano.tensor as tt
import pandas as pd

x1 = np.linspace(0., 9.9, 10) 
x2 = np.linspace(10., 30.9, 30)
x = np.concatenate((x1, x2), axis = 0)

y1 = 10.*x1 + 5 + np.random.ranf(x1.size)*100.
y2 = y1[-1:] + 40.*x2 + -40.*x2[0] + np.random.ranf(x2.size)*300.
y = np.concatenate((y1, y2), axis = 0)

plt.plot(x1, y1, linestyle = 'None', marker = '.')
plt.plot(x2, y2, linestyle = 'None', marker = '.')

def norm_cdf(z):
    return 0.5 * (1 + tt.erf(z / np.sqrt(2)))

def stick_breaking(v):
    return v * tt.concatenate([tt.ones_like(v[:, :1]), tt.extra_ops.cumprod(1. - v, axis=1)[:, :-1]], axis=1)

x = x[:, np.newaxis]
y = y[:, np.newaxis]
X = shared(x, broadcastable=(False, True))
K = 20

with pm.Model() as model:
    alpha = pm.Normal('alpha', 0., 1., shape=K)
    beta = pm.Normal('beta', 0., 1., shape=K)
    v = norm_cdf(alpha + beta * X)
    w = pm.Deterministic('w', stick_breaking(v))
    
    gamma = pm.Normal('gamma', 0., 10., shape=K)
    delta = pm.Normal('delta', 0., 10., shape=K)
    mu = pm.Deterministic('mu', gamma + delta * X)
    
    tau = pm.Gamma('tau', 1., 1., shape=K)
    obs = pm.NormalMixture('obs', w, mu, tau=tau, observed=y)

with model:
    trace = pm.sample(20000, init='advi+adapt_diag', progressbar = True)

The error is: “Auto-assigning NUTS sampler…
Initializing NUTS using advi+adapt_diag…
ERROR (theano.gof.opt): Optimization failure due to: local_grad_log_erfc_neg
ERROR (theano.gof.opt): node: Elemwise{true_div,no_inplace}(Elemwise{mul,no_inplace}.0, Elemwise{erfc,no_inplace}.0)
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File “/Users/dhooghe/anaconda3/lib/python3.6/site-packages/theano/gof/opt.py”, line 2019, in process_node
replacements = lopt.transform(node)
File “/Users/dhooghe/anaconda3/lib/python3.6/site-packages/theano/tensor/opt.py”, line 6780, in local_grad_log_erfc_neg
if not exp.owner.inputs[0].owner:
AttributeError: ‘NoneType’ object has no attribute ‘owner’”

I am fundamentally missing something here so any illumination that you could give would be greatly appreciated. Thank you!

Could you please put your notebook on gist? It is not rendering here.

A quick thing to try is to change the initialization to init='adapt_diag'. The jitter sometimes push the starting value outside of support.

Here is the gist: https://gist.github.com/jdhooghe/9719a464071bb9fa899ca31a1a1fa6d3

Sorry about the terrible formatting earlier.

No problem!

So here is my typical workflow of how to debug a model:

1, I run the code until before sampling, the model complied without a problem.

2, Try to sample using the default option:

with model:
    trace = pm.sample()

and confirm the same error. (tips: run with njobs=1 or cores=1 if you are on master, as the error is much cleaner)

3, run with

with model:
    trace = pm.sample(cores=1, init='adapt_diag')

and it can sample, indeed as explained above, the jitter is causing the problem in this case (we are trying to find a proper solution for it.)

4, However, you got lots of warning:

The chain contains only diverging samples. The model is probably misspecified.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

That’s quite likely because we have too few data and too many parameters. Also mixture model is difficult to sample, see also some of the related post on this discourse: Gaussian Mixture of regression, Mixture with multiple observations.

5, Try to optimized the model by setting K=2, got a ValueError: Bad initial energy: inf. The model might be misspecified.

6, debug following Get nan or inf from model.logp (model.test point) is an attestation of incorrectly configured model?

for RV in model.basic_RVs:
    print(RV.name, RV.logp(model.test_point))

alpha -1.8378770664093453
beta -1.8378770664093453
gamma -6.443047252397437
delta -6.443047252397437
tau_log__ -2.0
obs -inf

found -inf logp of the node obs

7, double check the input to obs and its logp:

# use .tag.test_value to check the current default
w.tag.test_value
mu.tag.test_value
tau.tag.test_value

the problem is that w is not summed to 1, a normalization within the model should do the trick.

    w_ = stick_breaking(v)
    w = pm.Deterministic('w', w_/w_.sum(axis=1, keepdims=True))

As a final note:

Mixture models are difficult to sample, please be very very careful!!!

Thank you so much, junpenglao. Your post not only helped with the main post but gave me invaluable tools for future problems I might have. Unfortunately I’m still getting infinite energies but perhaps that is just because, as you said, they’re hard to sample. I have some follow up questions if you don’t mind.

  1. How would you construct a more robust model for this hierarchal data? I was thinking perhaps a categorical probability?

  2. why did the original example not need normalization for the mixtures?

Thank you so much!

Depends on your model. From the data generation, it seems that it would be best to model it as two linear regression with a latent break point. Currently, you are modelling the mixture weight as a linear function of X also, which is quite difficult and not at all realistic to your data generation process.

When you are using lots of component (large K), the stick-breaking is almost summed to 1.