Questions about the convergence of sampling the Mixture of hierarchical model

I’m benginner and I just follow the examples “Dirichlet process mixtures for density estimation” in the websitePYMC3.
https://docs.pymc.io/notebooks/dp_mix.html

For convergence, I also post codes here.

from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import tensor as tt
import pandas as pd



blue, *_ = sns.color_palette()
SEED = 5132290 # from random.org

np.random.seed(SEED)
N = 20
K = 30

alpha = 2.
P0 = sp.stats.norm

beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:]*(1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
x_plot = np.linspace(-3, 3, 200)
sample_cdfs = (w[..., np.newaxis]*np.less.outer(omega, x_plot)).sum(axis=1)
fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(x_plot, sample_cdfs.T, c='gray', alpha=0.75,
        label='DP sample CDFs');
ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');

ax.set_title(r'$\alpha = {}$'.format(alpha));
ax.legend(loc=2);
fig, (l_ax, r_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))  
K = 50 
alpha = 10.  
beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) 
w = np.empty_like(beta) 
w[:, 0] = beta[:, 0] 
w[:, 1:] = beta[:, 1:]*(1 - beta[:, :-1]).cumprod(axis=1)  
omega = P0.rvs(size=(N, K))  
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)  
l_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,           label='DP sample CDFs')
l_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75)
l_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');  l_ax.set_title(r'$\alpha = {}$'.format(alpha)); 
l_ax.legend(loc=2)
K = 200 
alpha = 50.  
beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) 
w = np.empty_like(beta) 
w[:, 0] = beta[:, 0] 
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)  
omega = P0.rvs(size=(N, K))  
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)  
r_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,label='DP sample CDFs')
r_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75)
r_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF')
r_ax.set_title(r'$\alpha = {}$'.format(alpha))
r_ax.legend(loc=2);

N = 5
K = 30

alpha = 2
P0 = sp.stats.norm
f = lambda x, theta: sp.stats.norm.pdf(x, theta, 0.3)

beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)

theta = P0.rvs(size=(N, K))

dpm_pdf_components = f(x_plot[np.newaxis, np.newaxis, :], theta[..., np.newaxis])
dpm_pdfs = (w[..., np.newaxis] * dpm_pdf_components).sum(axis=1)

fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(x_plot, dpm_pdfs.T, c='gray');

ax.set_yticklabels([]);

old_faithful_df = pd.read_csv(pm.get_data('old_faithful.csv'))
old_faithful_df['std_waiting'] = (old_faithful_df.waiting - old_faithful_df.waiting.mean()) / old_faithful_df.waiting.std()
old_faithful_df.head()

fig, ax = plt.subplots(figsize=(8, 6))

n_bins = 20
ax.hist(old_faithful_df.std_waiting, bins=n_bins, color=blue, lw=0, alpha=0.5);

ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_ylabel('Number of eruptions');

def stick_breaking(beta):
    portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]])

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1., alpha, shape=K)
    w = stick_breaking(beta)
    w = pm.Deterministic('w', w/w.sum(keepdims=True))
    tau = pm.Gamma('tau', 1., 1., shape=K)
    lambda_ = pm.Uniform('lambda_', 0, 5, shape=K)
    mu = pm.Normal('mu', 0, tau=lambda_ * tau, shape=K)
    obs = pm.NormalMixture('obs', w, mu, tau=lambda_ * tau,
                           observed=old_faithful_df.std_waiting.values)

   with model:
       trace = pm.sample(1000, cores=1)

The sample doesn’t converge, the following shows the details

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (2 chains in 1 job) NUTS: [mu, lambda_, tau, beta, alpha] 100%|██████████████████████████████████████| 1500/1500 [02:02<00:00, 12.26it/s] 100%|██████████████████████████████████████| 1500/1500 [01:59<00:00, 14.55it/s] C:\Users\PrettyGirlFeiWu\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; usearr[tuple(seq)]instead ofarr[seq]. In the future this will be interpreted as an array index,arr[np.array(seq)], which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes) There were 560 divergences after tuning. Increasetarget_acceptor reparameterize. The acceptance probability does not match the target. It is 0.3341195732727439, but should be close to 0.8. Try to increase the number of tuning steps. There were 649 divergences after tuning. Increasetarget_acceptor reparameterize. The acceptance probability does not match the target. It is 0.5850455150818334, but should be close to 0.8. Try to increase the number of tuning steps. The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge. The estimated number of effective samples is smaller than 200 for some parameters.

For fixing this problem, I have resorted to Reparameterization method and transform the model to non-centered version.
The modified model is shown as following

2nd Model

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1., alpha, shape=K)
    w = pm.Deterministic('w', stick_breaking(beta))
    tau = pm.Gamma('tau', 1., 1., shape=K)
    lambda_ = pm.Uniform('lambda_', 0, 5, shape=K)
    mu_offset = pm.Normal('mu_offset', 0, tau=1, shape=K)
    mu = pm.Deterministic('mu', 0 + mu_offset / (lambda_ * tau))
    obs = pm.NormalMixture('obs', w, mu, tau=lambda_ * tau,
                           observed=old_faithful_df.std_waiting.values)

Divergences still can not be eleminited. The details are shown:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [mu_offset, lambda_, tau, beta, alpha]
100%|██████████████████████████████████████| 1500/1500 [03:44<00:00,  7.21it/s]
100%|██████████████████████████████████████| 1500/1500 [03:19<00:00,  7.10it/s]
C:\Users\PrettyGirlFeiWu\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)
There were 112 divergences after tuning. Increase `target_accept` or reparameterize.
There were 288 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.

Continuing on, the remaining code is posted

sunspot_df = pd.read_csv(pm.get_data('sunspot.csv'), sep=';', names=['time', 'sunspot.year'], usecols=[0, 1])
sunspot_df.head()
K = 30
N = sunspot_df.shape[0]

#3nd Model

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1, alpha, shape=K)
    w = pm.Deterministic('w', stick_breaking(beta))

    mu = pm.Uniform('mu', 0., 5., shape=K)
    obs = pm.Mixture('obs', w, pm.Poisson.dist(mu), observed=sunspot_df['sunspot.year'])

with model:
    trace = pm.sample(1000, random_seed=SEED)

The sample doesn’t converge, the following shows the details

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [mu, beta, alpha] Sampling 2 chains: 100%|████████████████| 3000/3000 [01:35&lt;00:00, 29.83draws/s] C:\Users\PrettyGirlFeiWu\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; usearr[tuple(seq)]instead ofarr[seq]. In the future this will be interpreted as an array index,arr[np.array(seq)], which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes) There were 207 divergences after tuning. Increasetarget_acceptor reparameterize. There were 340 divergences after tuning. Increasetarget_acceptor reparameterize. The estimated number of effective samples is smaller than 200 for some parameters.

Hower, we cannot following the Reparameterization method above for transforming the model to non-centered version.

Finnaly, My questions are:

  1. How to make the sampling with the 3nd model converge

  2. How to eliminate the Divergences with the 2nd model.

  3. I really donot feel safe with the sample method due to its Divergences. With the 2nd model, in the prensence of Divergences, the sampling result can be guarateed to converge to the true value, or not? If not, how can get the right one.

  4. I just copy the code in the example, the convergence cannot be achieved. How the convergence is achieved in the example. I’m confusing.

Have you tried the tips given in the warnings, namely increasing tuning steps and target_accept?

If those don’t help, you should look at where exactly the divergences happen. ArviZ is building some nice plots that show you this, like this one: https://arviz-devs.github.io/arviz/examples/plot_parallel.html That might tell you were exactly the divergences occur and where you might want to try and reparameterize. Feel free to post the divergence plots here.

Mixture model is difficult to inference, there are quite a few discussion on this discourse, e.g. Properly sampling mixture models

Also note that the Dirichlet process mixture example need some update, it is generally a hard problem - I would change the inference to ADVI and check the model fitting very carefully (as the result is generally biased).