Improving model convergence and sampling speed

I have a question about how to improve model convergence and sampling speed.

Currently, I am fitting a hierarchical model with Negative Binomial(alpha, mu) distributed likelihood. The priors for the parameters alpha and mu are Exponential(hyper_lam) and Gamma(hyper_mu, hyper_sd) distributed respectively. The hyperpriors are flat Uniform distributions for hyper_lam, hyper_mu and hyper_sd. I wish to generate not only the distributions of the parameters but also the posterior predictive distribution as well.

I am fitting approximately 14,000 observations with 200 classes to this model and the data itself is count data on the order of magnitude of 10^5.

On a modestly powered laptop, the model only goes through about 2 iterations a second using a Metropolis step. It has even less iterations using NUTS.

Is there anything I can change in the model or the data to improve sampling speed or require less than 200,000 samples for convergence?

Code used for model shown below.

encoder = preprocessing.LabelEncoder()
classes_index = encoder.fit_transform(df['hierarchical_group_subset'])
classes = encoder.classes_

upper_bound_lam = math.floor(counts.quantile(q=0.9))

upper_bound_mu = math.floor(means.quantile(q=0.9))
upper_bound_sd = math.floor(means.std())


with pm.Model() as model:
    hyper_alpha_lam = pm.Uniform('hyper_alpha_lam', lower=0, upper=upper_bound_lam)

    hyper_mu_mu = pm.Uniform('hyper_mu_mu', lower=0, upper=upper_bound_mu)
    hyper_mu_sd = pm.Uniform('hyper_mu_sd', lower=0, upper=upper_bound_sd)

    alpha = pm.Exponential(
        'alpha', lam=hyper_alpha_lam, shape=len(classes)
    )

    mu = pm.Gamma(
        'mu', mu=hyper_mu_mu, sd=hyper_mu_mu, shape=len(classes)
    )

    y_est = pm.NegativeBinomial(
        'y_est',
        mu=mu[classes_index],
        alpha=alpha[classes_index],
        observed=df['count'].values
    )

    y_pred = pm.NegativeBinomial(
        'y_pred',
        mu=mu[classes_index],
        alpha=alpha[classes_index],
        shape=df['hierarchical_group_subset'].shape
    )

    start = pm.find_MAP()
    step = pm.Metropolis()
    hierarchical_trace = pm.sample(500, start=start, step=step, progressbar=True)

A few suggestion:
1, Avoid Uniform priors. You can have a look at the recommendation from Stan. Generally a weakly informative prior helps speed up the sampler, while providing much better regulation.

2, You dont need to specify

    y_pred = pm.NegativeBinomial(
        'y_pred',
        mu=mu[classes_index],
        alpha=alpha[classes_index],
        shape=df['hierarchical_group_subset'].shape
    )

for posterior prediction. You should use pm.sample_ppc for posterior prediction sample.

3, You can also first try to fit it with ADVI and check the result.

1 Like

Also, generally Metropolis is not suitable for high dimensional problem (although in your case I think it should perform OK). The speed in Metropolis samples is usually an illusion, as what counts is the effective samples. For this model, you can also try SMC, see an example here: http://docs.pymc.io/en/latest/notebooks/SMC2_gaussians.html

Thank you for the suggestions!

From my personal experience, I usually find hierarchical models tend to sample very slowly with NUTS. A method I found that could help especially with large set of observations is to either approximate with ADVI and mini batch (which can speed up computation by 2 fold) and then sampling, if its good enough, using sample_approx or from that starting point sample using Metropolis or NUTS. My experience with metropolis is that it tends to drift away from that starting point and provide bad estimates of the posterior.

Seeing the following (some methods are outdated):

Something to keep in mind is that ADVI with the default option (i.e., mean-field approximation) will underestimated the uncertainty of the parameters (much narrower posterior).
You can have a look at this code of using different sampling on a mixed (hierarchical) model: https://github.com/junpenglao/GLMM-in-Python/blob/master/pymc3_different_sampling.py

2 Likes

@junpenglao - is there any literature about the amount by which ADVI underestimates, or the conditions under which underestimation is most severe for mean-field ADVI? Are other approximtions, such as fullrank ADVI also prone to this problem?

The original ADVI paper have some simulation study to show that (for hierarchical regression, at least). Full-rank ADVI also have the same problem but it is usually a bit better.

2 Likes

A post was split to a new topic: Slowness in multinomial softmax regression