NUTS pauses after init

mcguenther · August 12, 2019, 1:12pm

Hi everyone,

I am facing a similar problem as OP in Long pause after initialization. My model, however, is mostly composed of Normal RVs. They resemble the influence of configuration options of software systems on performance. My code for the model looks like this:

with pm.Model() as linear_model:
    root = pm.Normal('root', mu=0, sd=10)
    noise = pm.HalfNormal(noise_str, sd=10)
    pred = root

    for feature_name in feature_names:
        idx = pos_map[feature_name]
        vals = self.x_shared[:, idx]
        rv_id = "influence_{}".format(feature_name)
        term_val = pm.Normal(rv_id, mu=0, sd=coef_sd) * vals
        pred += term_val
    y_observed = pm.Normal('y_observed', mu=pred, sd=noise, observed=self.y)

    lin_trace = pm.sample(mcmc_samples, init='advi+adapt_diag', random_seed=seed_lin, tune=mcmc_tune, cores=mcmc_cores, chains=mcmc_cores, max_treedepth=tree_depth)

In some cases (have yet to find out when exactly), after init with “advi+adapt_diag” finishes in reasonable time, it pauses, yielding no output until I terminate the process after ~15 min.

The last lines of output typically read like this:

Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 3,307.1:  16%|█▌        | 31479/200000 [00:15<01:21, 2074.68it/s]
Convergence achieved at 31600
Interrupted at 31,599 [15%]: Average Loss = 6.6471e+07

In other cases, NUTS starts sampling after a couple of minutes or even seconds. With init=‘advi+adapt_diag’, I usually get 100 - 1000 draws/s. I did not see an improvement in sampling rate by vectorizing my loop.
How could I find out whether theano gets stuck optimizing? What could I do to prevent that?
What else could I check to start sampling quicker?

Any help would be appreciated, thanks.

Nadheesh · August 13, 2019, 10:56am

I have also ran into similar situations, I assumed that the sampler is exploring in a region with points that cannot be accepted according to the acceptance criteria. Therefore, it will be seems as stuck until it find a sample that is acceptable.

I also notice that you are trying to draw 200000 samples, that may be the issue, try reducing that to 20000.

junpenglao · August 13, 2019, 11:53am

I dont think this is the issue, as this is 200000 is the training epoch and ADVI converged before that.
But you are right reducing the samples would help diagnose the problem. Would you be able to draw just a few samples?

Topic		Replies	Views
NaN occurred in optimization with NUTS Questions	8	4708	May 29, 2018
ADVI start with initialization Questions	7	2086	September 23, 2017
Init progress bar gone? Questions	7	1017	February 16, 2018
Sampling time is very long (minutes per sample) Questions	13	4535	November 2, 2017
Unknown NUTS failure with error but ADVI and Metropolis works Development bug	3	1121	March 7, 2018

NUTS pauses after init

Related topics