Why does MCMC without observations have difficulty sampling?

I’ve been trying to implement qa_ver0 model by Michael Betancourt from here and found that removing the observed keyword from the likelihood specification massively slows down the sampling and causes significant sampling difficulties (divergences, very low effective sample size). Why is that?

I understand that with observed data, we are effectively sampling a prior. And I understand that the recommended way to sample the prior is via sample_prior_predictive (which uses ancestral sampling instead of MCMC). But I am not sure why MCMC sampling should be quite so ineffective. Yes it needs to convege, but it needs to converge with the posteriors too. And yes with a broad prior there is a large space for MCMC to explore - but still the sheer extent of the difficulties is surprising to me.

I think it’s a general question but for the sake of completeness here is my code. First with observations:

with pm.Model(coords={"observation_sample": range(N), "config": range(N_factor_configs)}) as qa_ver0:
    t_data = pm.ConstantData("obs_times", t, dims="observation_sample")
    y_data = pm.ConstantData("obs_passing_items", y, dims="observation_sample")
    
    baseline = pm.Normal("baseline", mu=-3, sigma=2)
    alpha = pm.Normal("alpha", mu=4, sigma=2)
    
    p = pm.Deterministic("p", pm.math.invlogit(alpha + baseline * t_data / 365.0), dims="observation_sample")
    
    passing_items = pm.Binomial("passing_items", p=p, n=N_samples, observed=y, dims="observation_sample")

with qa_ver0:
    trace_qa_ver0 = pm.sample()

No problems sampling:

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.

Now if I remove observed data:

with pm.Model(coords={"observation_sample": range(N), "config": range(N_factor_configs)}) as qa_ver0_prior:
    t_data = pm.ConstantData("obs_times", t, dims="observation_sample")
    y_data = pm.ConstantData("obs_passing_items", y, dims="observation_sample")
    
    baseline = pm.Normal("baseline", mu=-3, sigma=2)
    alpha = pm.Normal("alpha", mu=4, sigma=2)
    
    p = pm.Deterministic("p", pm.math.invlogit(alpha + baseline * t_data / 365.0), dims="observation_sample")
    
    passing_items = pm.Binomial("passing_items", p=p, n=N_samples, dims="observation_sample")

with qa_ver0_prior:
    trace_qa_ver0_prior = pm.sample()
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 920 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

To emphasise the key points here: 15 min to sample, high rhat, low effective sample size. In this specific example there are no divergences but previously I had also observed divergences

(For replication, the data (y, t, N, N_samples etc) are from here)

I suggest plotting the prior draws and seeing how they look like

To be honest, I am not sure what I am looking for; hence the question, really…

Here is the plot of the prior:

az.plot_trace(trace_qa_ver0_prior.isel(observation_sample=0));

Evidently, the baseline and alpha did not converge to Normal(-3,2) and Normal(4,2), respectively. But I guess we could have said that before too. There is hardly any mixing and chains are completely different - but again I am not sure how to link this to the the fact that we are sampling from a prior (since that doesn’t happen when I add observed and sample from the posterior)

There is definitely a problem with the fact that the baseline and alpha are then fed into a Binomial. If I remove the Binomial entirely then pm.sample() samples Normal(-3,2) and Normal(4,2) just fine:

with pm.Model(coords={"observation_sample": range(N), "config": range(N_factor_configs)}) as qa_ver0_minimal_prior:
    baseline = pm.Normal("baseline", mu=-3, sigma=2)
    alpha = pm.Normal("alpha", mu=4, sigma=2)
    
with qa_ver0_minimal_prior:
    trace_qa_ver0_minimal_prior = pm.sample()

az.plot_trace(trace_qa_ver0_minimal_prior.isel(observation_sample=0));

Conversely, if I fix baseline and alpha and sample only from the Binomial, again I have problems (although less extreme)

with pm.Model(coords={"observation_sample": range(N), "config": range(N_factor_configs)}) as qa_ver0_prior_minimal_binom:
    t_data = pm.ConstantData("obs_times", t, dims="observation_sample")
    y_data = pm.ConstantData("obs_passing_items", y, dims="observation_sample")
    
    alpha = -3
    baseline = 4
    
    p = pm.Deterministic("p", pm.math.invlogit(alpha + baseline * t_data / 365.0), dims="observation_sample")
    
    passing_items = pm.Binomial("passing_items", p=p, n=N_samples, dims="observation_sample")

with qa_ver0_prior_minimal_binom:
    trace_qa_ver0_minimal_binom_prior = pm.sample()

az.plot_trace(trace_qa_ver0_minimal_binom_prior.isel(observation_sample=0));
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 550 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details

So that tells me that pm.sample() only recovers priors using NUTS but not Metroplis. Is that correct? If so, why? And why only for priors and not posteriors?

The second question that then arises - given that pm.sample() seems to deal fine with sampling normals via NUTS but doesn’t deal as well with the Binomial, why do the errors in the Binomial sampling “propagate back” to the Normals (and become exacerbated), even though there is no observed data - is that just a fundamental feature of MCMC? Is that one of the reasons why we use ancestral sampling for prior sampling - to “decouple” the variables in more complex models?

I guess another way of stating the question would be: If we forget all the structure that we know about the prior, and just treat is as a black-box with a likelihood (plus a bit of info about the domain of the prior), how hard is it to sample from that distribution?

I guess intuitively it might seem that the prior should be easier than the posterior, but I don’t think that really needs to be the case. It certainly can be, but it can just as easily be much harder.

Take for instance a simple model like this:

with pm.Model():
    sigma = pm.HalfNormal("group_sigma")
    group_effect = pm.Normal("group_effect", mu=0, sigma=sigma, dims="group")
    
    # Some likelihood that allows precise estimation of sigma:
    pm.Normal("y", mu=group_effect[group_idx], observed=some_large_dataset)

With the dataset, the prior might have a nice shape, and might be quite easy to sample. But without the data, this is a terrible funnel, that will not converge at all!

It looks like this is not really what’s happening in your example though. A big difference between sampling the prior and the posterior, is that in the first case pymc will also sample the dataset, but when you sample the posterior the dataset is assumed to be fixed. Since you are using a discrete likelihood, sampling the prior means sampling discrete parameters, where nuts doesn’t work. Sampling the posterior though doesn’t have that issue, and we can happily use gradient based methods.

But sampling distributions with discrete parameters is really hard. So I think it is not at all unexpected that this might fail horribly for even relatively small and simple (looking) problems.

2 Likes