Numerical Issues with StickBreaking in ADVI

After further diagnoses, it turns out the distribution of decomp is not multimodal. Using the NUTS sampler reveals distributions of decomp_stickbreaking__ that could be aproximated nicely by ADVI with a normal distribution:

>>> trace = pm.sample(model=model, draws=5000)
>>> ax = pd.DataFrame(trace['decomp_stickbreaking__']).plot.kde()
>>> ax.set_xlim(-10, 10)

image
The transformed decomp_stickbreaking__ seems biased but this may be due to the Stickbreaking transformation as decomp is as desired:

>>> np.mean(trace['decomp'], axis=0)
array([0.10333791, 0.09897287, 0.10272751, 0.09945381, 0.09798678,
       0.09716101, 0.09705229, 0.09979954, 0.10609969, 0.09740859])

However, sampling with NUTS produced some divergences:

>>> trace['diverging'].nonzero()[0].size
106

The points of divergence seem to be at low values of the first coordinates of decomp:

def pairplot_divergence(trace, var1, var2, i1=0, i2=0):
    v1 = trace.get_values(varname=var1, combine=True)[:, i1]
    v2 = trace.get_values(varname=var2, combine=True)[:, i2]
    _, ax = plt.subplots(1, 1, figsize=(10, 5))
    ax.plot(v1, v2, 'o', color='b', alpha=.5)
    divergent = trace['diverging']
    ax.plot(v1[divergent], v2[divergent], 'o', color='r')
    ax.set_xlabel('{}[{}]'.format(var1, i1))
    ax.set_ylabel('{}[{}]'.format(var2, i2))
    ax.set_title('scatter plot between {}[{}] and {}[{}]'.format(var1, i1, var2, i2));
    return ax
pairplot_divergence(trace4, 'decomp', 'decomp', i1=0, i2=1)

image
In the image above, that looks at the first two coordinates of decomp, the divergences are close to the origin as opose to the image below, where we look at the last two coordinates of decomp.

pairplot_divergence(trace4, 'decomp', 'decomp', i1=8, i2=9)

image
However, looking at the transformed values does not reveal such a tendency:

pairplot_divergence(trace4, 'decomp_stickbreaking__', 'decomp_stickbreaking__', i1=0, i2=8)

image
So I still suspect the StickBreaking transformation itselfe to introduce some numerical difficulties.

Increase the target_accept and additional tuning, hoever, seems to remove this issue entirely:

>>> trace2 = pm.sample(model=model, draws=50000, tune=2000, target_accept=.99)
>>> trace2['diverging'].nonzero()[0].size
0

One could argue that extremely large gradients throw off the stochastic optimization of ADVI as well. The high sensitivity of the model to the choice of the ADVI optimizer further supports this theory. I was able to eliminate the divergences in the NUTS sampling by increasing the target acceptance rate and hence decrease the step size. Equivalently using a much more conservative stochastic optimizer seems to mediate the problem:

>>> mean_field2 = pm.fit(model=model, method='advi', n=int(1e6), progressbar=False,
...                      obj_optimizer=pm.adamax(learning_rate=1e-4, beta1=1-1e-4, beta2=1-1e-5))
>>> decomp = mean_field2.bij.rmap(mean_field2.mean.get_value())
>>> print(StickBreaking().backward(decomp['decomp_stickbreaking__']).eval())
[0.06584135 0.06966171 0.07110197 0.07978009 0.08543297 0.08582187
 0.11319634 0.11781051 0.15662726 0.15472593]

There is still a notable bias in the result, so I would welcome any suggestions how to improve it.