Why does MCMC without observations have difficulty sampling?

I guess another way of stating the question would be: If we forget all the structure that we know about the prior, and just treat is as a black-box with a likelihood (plus a bit of info about the domain of the prior), how hard is it to sample from that distribution?

I guess intuitively it might seem that the prior should be easier than the posterior, but I don’t think that really needs to be the case. It certainly can be, but it can just as easily be much harder.

Take for instance a simple model like this:

with pm.Model():
    sigma = pm.HalfNormal("group_sigma")
    group_effect = pm.Normal("group_effect", mu=0, sigma=sigma, dims="group")
    
    # Some likelihood that allows precise estimation of sigma:
    pm.Normal("y", mu=group_effect[group_idx], observed=some_large_dataset)

With the dataset, the prior might have a nice shape, and might be quite easy to sample. But without the data, this is a terrible funnel, that will not converge at all!

It looks like this is not really what’s happening in your example though. A big difference between sampling the prior and the posterior, is that in the first case pymc will also sample the dataset, but when you sample the posterior the dataset is assumed to be fixed. Since you are using a discrete likelihood, sampling the prior means sampling discrete parameters, where nuts doesn’t work. Sampling the posterior though doesn’t have that issue, and we can happily use gradient based methods.

But sampling distributions with discrete parameters is really hard. So I think it is not at all unexpected that this might fail horribly for even relatively small and simple (looking) problems.

2 Likes