Questions about the convergence of sampling the Mixture of hierarchical model

drchaowang · November 8, 2018, 11:42am

I’m benginner and I just follow the examples “Dirichlet process mixtures for density estimation” in the websitePYMC3.
https://docs.pymc.io/notebooks/dp_mix.html

For convergence, I also post codes here.

from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import tensor as tt
import pandas as pd



blue, *_ = sns.color_palette()
SEED = 5132290 # from random.org

np.random.seed(SEED)
N = 20
K = 30

alpha = 2.
P0 = sp.stats.norm

beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:]*(1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
x_plot = np.linspace(-3, 3, 200)
sample_cdfs = (w[..., np.newaxis]*np.less.outer(omega, x_plot)).sum(axis=1)
fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(x_plot, sample_cdfs.T, c='gray', alpha=0.75,
        label='DP sample CDFs');
ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');

ax.set_title(r'$\alpha = {}$'.format(alpha));
ax.legend(loc=2);
fig, (l_ax, r_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))  
K = 50 
alpha = 10.  
beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) 
w = np.empty_like(beta) 
w[:, 0] = beta[:, 0] 
w[:, 1:] = beta[:, 1:]*(1 - beta[:, :-1]).cumprod(axis=1)  
omega = P0.rvs(size=(N, K))  
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)  
l_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,           label='DP sample CDFs')
l_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75)
l_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');  l_ax.set_title(r'$\alpha = {}$'.format(alpha)); 
l_ax.legend(loc=2)
K = 200 
alpha = 50.  
beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) 
w = np.empty_like(beta) 
w[:, 0] = beta[:, 0] 
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)  
omega = P0.rvs(size=(N, K))  
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)  
r_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,label='DP sample CDFs')
r_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75)
r_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF')
r_ax.set_title(r'$\alpha = {}$'.format(alpha))
r_ax.legend(loc=2);

N = 5
K = 30

alpha = 2
P0 = sp.stats.norm
f = lambda x, theta: sp.stats.norm.pdf(x, theta, 0.3)

beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)

theta = P0.rvs(size=(N, K))

dpm_pdf_components = f(x_plot[np.newaxis, np.newaxis, :], theta[..., np.newaxis])
dpm_pdfs = (w[..., np.newaxis] * dpm_pdf_components).sum(axis=1)

fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(x_plot, dpm_pdfs.T, c='gray');

ax.set_yticklabels([]);

old_faithful_df = pd.read_csv(pm.get_data('old_faithful.csv'))
old_faithful_df['std_waiting'] = (old_faithful_df.waiting - old_faithful_df.waiting.mean()) / old_faithful_df.waiting.std()
old_faithful_df.head()

fig, ax = plt.subplots(figsize=(8, 6))

n_bins = 20
ax.hist(old_faithful_df.std_waiting, bins=n_bins, color=blue, lw=0, alpha=0.5);

ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_ylabel('Number of eruptions');

def stick_breaking(beta):
    portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]])

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1., alpha, shape=K)
    w = stick_breaking(beta)
    w = pm.Deterministic('w', w/w.sum(keepdims=True))
    tau = pm.Gamma('tau', 1., 1., shape=K)
    lambda_ = pm.Uniform('lambda_', 0, 5, shape=K)
    mu = pm.Normal('mu', 0, tau=lambda_ * tau, shape=K)
    obs = pm.NormalMixture('obs', w, mu, tau=lambda_ * tau,
                           observed=old_faithful_df.std_waiting.values)

   with model:
       trace = pm.sample(1000, cores=1)

The sample doesn’t converge, the following shows the details

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (2 chains in 1 job) NUTS: [mu, lambda_, tau, beta, alpha] 100%|██████████████████████████████████████| 1500/1500 [02:02<00:00, 12.26it/s] 100%|██████████████████████████████████████| 1500/1500 [01:59<00:00, 14.55it/s] C:\Users\PrettyGirlFeiWu\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; usearr[tuple(seq)]instead ofarr[seq]. In the future this will be interpreted as an array index,arr[np.array(seq)], which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes) There were 560 divergences after tuning. Increasetarget_acceptor reparameterize. The acceptance probability does not match the target. It is 0.3341195732727439, but should be close to 0.8. Try to increase the number of tuning steps. There were 649 divergences after tuning. Increasetarget_acceptor reparameterize. The acceptance probability does not match the target. It is 0.5850455150818334, but should be close to 0.8. Try to increase the number of tuning steps. The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge. The estimated number of effective samples is smaller than 200 for some parameters.

For fixing this problem, I have resorted to Reparameterization method and transform the model to non-centered version.
The modified model is shown as following

2nd Model

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1., alpha, shape=K)
    w = pm.Deterministic('w', stick_breaking(beta))
    tau = pm.Gamma('tau', 1., 1., shape=K)
    lambda_ = pm.Uniform('lambda_', 0, 5, shape=K)
    mu_offset = pm.Normal('mu_offset', 0, tau=1, shape=K)
    mu = pm.Deterministic('mu', 0 + mu_offset / (lambda_ * tau))
    obs = pm.NormalMixture('obs', w, mu, tau=lambda_ * tau,
                           observed=old_faithful_df.std_waiting.values)

Divergences still can not be eleminited. The details are shown:

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (2 chains in 1 job)
NUTS: [mu_offset, lambda_, tau, beta, alpha]
100%|██████████████████████████████████████| 1500/1500 [03:44<00:00,  7.21it/s]
100%|██████████████████████████████████████| 1500/1500 [03:19<00:00,  7.10it/s]
C:\Users\PrettyGirlFeiWu\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  output = mkl_fft.rfftn_numpy(a, s, axes)
There were 112 divergences after tuning. Increase `target_accept` or reparameterize.
There were 288 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.

Continuing on, the remaining code is posted

sunspot_df = pd.read_csv(pm.get_data('sunspot.csv'), sep=';', names=['time', 'sunspot.year'], usecols=[0, 1])
sunspot_df.head()
K = 30
N = sunspot_df.shape[0]

#3nd Model

with pm.Model() as model:
    alpha = pm.Gamma('alpha', 1., 1.)
    beta = pm.Beta('beta', 1, alpha, shape=K)
    w = pm.Deterministic('w', stick_breaking(beta))

    mu = pm.Uniform('mu', 0., 5., shape=K)
    obs = pm.Mixture('obs', w, pm.Poisson.dist(mu), observed=sunspot_df['sunspot.year'])

with model:
    trace = pm.sample(1000, random_seed=SEED)

The sample doesn’t converge, the following shows the details

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [mu, beta, alpha] Sampling 2 chains: 100%|████████████████| 3000/3000 [01:35<00:00, 29.83draws/s] C:\Users\PrettyGirlFeiWu\Anaconda3\lib\site-packages\mkl_fft\_numpy_fft.py:1044: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; usearr[tuple(seq)]instead ofarr[seq]. In the future this will be interpreted as an array index,arr[np.array(seq)], which will result either in an error or a different result. output = mkl_fft.rfftn_numpy(a, s, axes) There were 207 divergences after tuning. Increasetarget_acceptor reparameterize. There were 340 divergences after tuning. Increasetarget_acceptor reparameterize. The estimated number of effective samples is smaller than 200 for some parameters.

Hower, we cannot following the Reparameterization method above for transforming the model to non-centered version.

Finnaly, My questions are:

How to make the sampling with the 3nd model converge
How to eliminate the Divergences with the 2nd model.
I really donot feel safe with the sample method due to its Divergences. With the 2nd model, in the prensence of Divergences, the sampling result can be guarateed to converge to the true value, or not? If not, how can get the right one.
I just copy the code in the example, the convergence cannot be achieved. How the convergence is achieved in the example. I’m confusing.

twiecki · November 9, 2018, 8:57am

Have you tried the tips given in the warnings, namely increasing tuning steps and target_accept?

If those don’t help, you should look at where exactly the divergences happen. ArviZ is building some nice plots that show you this, like this one: https://arviz-devs.github.io/arviz/examples/plot_parallel.html That might tell you were exactly the divergences occur and where you might want to try and reparameterize. Feel free to post the divergence plots here.

junpenglao · November 9, 2018, 9:07am

Mixture model is difficult to inference, there are quite a few discussion on this discourse, e.g. Properly sampling mixture models

Also note that the Dirichlet process mixture example need some update, it is generally a hard problem - I would change the inference to ADVI and check the model fitting very carefully (as the result is generally biased).

Topic		Replies	Views
Issue with .sample() method used on a dirichlet mixture model with normal prior version agnostic	0	297	April 13, 2023
Mixture Gaussian mode dose not converges with high iterations Questions	0	285	November 5, 2021
Improving sampling / parameterization of complex mixture model	0	280	May 25, 2023
Difficulties fitting a mixture of dirichlet distributions Questions	4	716	September 24, 2020
HurdleGamma diverges, where as a Mixture samples perfectly well v5 modeling	5	190	December 27, 2024

Questions about the convergence of sampling the Mixture of hierarchical model

Related topics