Improving model convergence and sampling speed

researcher · August 23, 2017, 1:43pm

I have a question about how to improve model convergence and sampling speed.

Currently, I am fitting a hierarchical model with Negative Binomial(alpha, mu) distributed likelihood. The priors for the parameters alpha and mu are Exponential(hyper_lam) and Gamma(hyper_mu, hyper_sd) distributed respectively. The hyperpriors are flat Uniform distributions for hyper_lam, hyper_mu and hyper_sd. I wish to generate not only the distributions of the parameters but also the posterior predictive distribution as well.

I am fitting approximately 14,000 observations with 200 classes to this model and the data itself is count data on the order of magnitude of 10^5.

On a modestly powered laptop, the model only goes through about 2 iterations a second using a Metropolis step. It has even less iterations using NUTS.

Is there anything I can change in the model or the data to improve sampling speed or require less than 200,000 samples for convergence?

Code used for model shown below.

encoder = preprocessing.LabelEncoder()
classes_index = encoder.fit_transform(df['hierarchical_group_subset'])
classes = encoder.classes_

upper_bound_lam = math.floor(counts.quantile(q=0.9))

upper_bound_mu = math.floor(means.quantile(q=0.9))
upper_bound_sd = math.floor(means.std())


with pm.Model() as model:
    hyper_alpha_lam = pm.Uniform('hyper_alpha_lam', lower=0, upper=upper_bound_lam)

    hyper_mu_mu = pm.Uniform('hyper_mu_mu', lower=0, upper=upper_bound_mu)
    hyper_mu_sd = pm.Uniform('hyper_mu_sd', lower=0, upper=upper_bound_sd)

    alpha = pm.Exponential(
        'alpha', lam=hyper_alpha_lam, shape=len(classes)
    )

    mu = pm.Gamma(
        'mu', mu=hyper_mu_mu, sd=hyper_mu_mu, shape=len(classes)
    )

    y_est = pm.NegativeBinomial(
        'y_est',
        mu=mu[classes_index],
        alpha=alpha[classes_index],
        observed=df['count'].values
    )

    y_pred = pm.NegativeBinomial(
        'y_pred',
        mu=mu[classes_index],
        alpha=alpha[classes_index],
        shape=df['hierarchical_group_subset'].shape
    )

    start = pm.find_MAP()
    step = pm.Metropolis()
    hierarchical_trace = pm.sample(500, start=start, step=step, progressbar=True)

junpenglao · August 23, 2017, 2:13pm

A few suggestion:
1, Avoid Uniform priors. You can have a look at the recommendation from Stan. Generally a weakly informative prior helps speed up the sampler, while providing much better regulation.

2, You dont need to specify

    y_pred = pm.NegativeBinomial(
        'y_pred',
        mu=mu[classes_index],
        alpha=alpha[classes_index],
        shape=df['hierarchical_group_subset'].shape
    )

for posterior prediction. You should use pm.sample_ppc for posterior prediction sample.

3, You can also first try to fit it with ADVI and check the result.

junpenglao · August 23, 2017, 2:17pm

Also, generally Metropolis is not suitable for high dimensional problem (although in your case I think it should perform OK). The speed in Metropolis samples is usually an illusion, as what counts is the effective samples. For this model, you can also try SMC, see an example here: http://docs.pymc.io/en/latest/notebooks/SMC2_gaussians.html

researcher · August 23, 2017, 2:18pm

Thank you for the suggestions!

kpei · August 23, 2017, 2:26pm

From my personal experience, I usually find hierarchical models tend to sample very slowly with NUTS. A method I found that could help especially with large set of observations is to either approximate with ADVI and mini batch (which can speed up computation by 2 fold) and then sampling, if its good enough, using sample_approx or from that starting point sample using Metropolis or NUTS. My experience with metropolis is that it tends to drift away from that starting point and provide bad estimates of the posterior.

Seeing the following (some methods are outdated):

junpenglao · August 23, 2017, 2:44pm

Something to keep in mind is that ADVI with the default option (i.e., mean-field approximation) will underestimated the uncertainty of the parameters (much narrower posterior).
You can have a look at this code of using different sampling on a mixed (hierarchical) model: https://github.com/junpenglao/GLMM-in-Python/blob/master/pymc3_different_sampling.py

dycontri · April 11, 2019, 5:06pm

@junpenglao - is there any literature about the amount by which ADVI underestimates, or the conditions under which underestimation is most severe for mean-field ADVI? Are other approximtions, such as fullrank ADVI also prone to this problem?

junpenglao · April 12, 2019, 5:36am

The original ADVI paper have some simulation study to show that (for hierarchical regression, at least). Full-rank ADVI also have the same problem but it is usually a bit better.

junpenglao · September 27, 2019, 4:29pm

A post was split to a new topic: Slowness in multinomial softmax regression

Topic		Replies	Views
Improving speed with large dimensional problem Questions	12	4004	September 27, 2017
Hierarchical Model - Slow Sampling ( 1.8 sec per draw ) - Primary suspect is how the multi-level hierarchy is coded Questions	5	809	November 2, 2021
NUTS sampler slowing down on a hierarchical model Questions	6	898	February 28, 2021
Hierarchical Binomial regression - ineffective sampling Questions	4	897	January 17, 2018
Hierarchical Model - Slow Sampling Questions	4	1159	March 26, 2020

Improving model convergence and sampling speed

Related topics