NUTS pauses after init

Hi everyone,

I am facing a similar problem as OP in Long pause after initialization. My model, however, is mostly composed of Normal RVs. They resemble the influence of configuration options of software systems on performance. My code for the model looks like this:

with pm.Model() as linear_model:
    root = pm.Normal('root', mu=0, sd=10)
    noise = pm.HalfNormal(noise_str, sd=10)
    pred = root

    for feature_name in feature_names:
        idx = pos_map[feature_name]
        vals = self.x_shared[:, idx]
        rv_id = "influence_{}".format(feature_name)
        term_val = pm.Normal(rv_id, mu=0, sd=coef_sd) * vals
        pred += term_val
    y_observed = pm.Normal('y_observed', mu=pred, sd=noise, observed=self.y)

    lin_trace = pm.sample(mcmc_samples, init='advi+adapt_diag', random_seed=seed_lin, tune=mcmc_tune, cores=mcmc_cores, chains=mcmc_cores, max_treedepth=tree_depth)

In some cases (have yet to find out when exactly), after init with “advi+adapt_diag” finishes in reasonable time, it pauses, yielding no output until I terminate the process after ~15 min.

The last lines of output typically read like this:

Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = 3,307.1:  16%|█▌        | 31479/200000 [00:15<01:21, 2074.68it/s]
Convergence achieved at 31600
Interrupted at 31,599 [15%]: Average Loss = 6.6471e+07

In other cases, NUTS starts sampling after a couple of minutes or even seconds. With init=‘advi+adapt_diag’, I usually get 100 - 1000 draws/s. I did not see an improvement in sampling rate by vectorizing my loop.
How could I find out whether theano gets stuck optimizing? What could I do to prevent that?
What else could I check to start sampling quicker?

Any help would be appreciated, thanks.

I have also ran into similar situations, I assumed that the sampler is exploring in a region with points that cannot be accepted according to the acceptance criteria. Therefore, it will be seems as stuck until it find a sample that is acceptable.

I also notice that you are trying to draw 200000 samples, that may be the issue, try reducing that to 20000.

I dont think this is the issue, as this is 200000 is the training epoch and ADVI converged before that.
But you are right reducing the samples would help diagnose the problem. Would you be able to draw just a few samples?