General advice on modifying priors to avoid multiple modes/peaks for parameters?

I would not like to ask such questions unduly, but, despite my research and efforts, I would appreciate some advice validated by experience on how to optimally change priors to improve a model’s fit.

For the following sine wave data, I tried to fit a basic GP with a Periodic kernel, with all the results (under many possible sampler settings) producing good-looking posterior plots:

download-18

However, having generated the data, I knew that the estimate of period, in the periodic kernel, should have been “around 6”, yet I consistently received traceplots that were multi-modal/possessed multiple peaks:

41%20PM

I ended up fixing this situation by specifying a tighter pm.HalfNormal() prior on period, and the following results seem to be good:

45%20PM

So my general question - that I would extremely appreciate any advice on - is two-fold:

  1. In which situations might parameters genuinely possess multiple peaks (apart from explicit mixture models), and, if multiple peaks are just a symptom of MCMC difficulties,
  2. what is the optimal heuristic on how to modify priors to get rid of multiple peaks? Is there even one?

For 2, I simply naively specified a prior closer to 0 and that ended up working: what if I accidentally made my prior more concentrated in the other direction, and the original secondary peak around 12.5 became the only one the chains picked up?

Code:

## Data generation
x = np.linspace(0,40, num=300)
noise1 = np.random.normal(0,0.3,300)
y = np.sin(x) + noise1

temp = x[150:]
noise2 = 0.004*temp**2 + np.random.normal(0,0.1,150)
y[150:] = y[150:] + noise2

true_line = np.sin(x)
true_line[150:] = true_line[150:] + 0.004*temp**2

x_sin = x[:150]
y_sin = y[:150]

X_sin = np.expand_dims(x, axis=1)
Y_sin = np.expand_dims(y, axis=1)

test_X_sin_1dim = np.linspace(-20,40,500)
test_X_sin_2dim = np.expand_dims(test_X_sin_1dim, axis=1)

plt.plot(x_sin, y_sin)

## Model with correct prior on period
with pm.Model() as gp_model:
    
    ## Yearly periodic component x long term trend

    period = pm.HalfNormal("period", sd=5)

    ℓ_psmooth = pm.Gamma("ℓ_psmooth ", alpha=4, beta=3)

    cov_seasonal = pm.gp.cov.Periodic(1, period, ℓ_psmooth)
    gp_seasonal = pm.gp.Marginal(cov_func=cov_seasonal)

    σ  = pm.HalfNormal("σ", sd=10, testval=5)

    gp = gp_seasonal

    y_ = gp.marginal_likelihood("y", X=x_sin[:, None], y=y_sin, noise = σ, shape=1)

In general, these are unidentifiable model with multiple modes. Usually, they are models with exchangeable parameters, mixture models with unconstrained parameters, there is rotation or normalization in the model etc. You can see another example here Inferencing trigonometric time series model

Informative prior is the answer, and the way to choose the right one is using prior predictive samples. You should have a look at [1708.07487] The prior can generally only be understood in the context of the likelihood
In practices, beside following good practice, lots trial and error is unavoidable. In the case of GP, it is known to be difficult to do inference using MCMC and you really need to have strong priors. You can have a look at Mike Betancourt’s Robust GP series: Robust Gaussian Processes in Stan

2 Likes

Thanks so much for sharing those links, junpenglao! narendramukherjee’s post on his model issue, and the following visualizations, are helpful in forming intuitions, and they got me thinking about other things, but that is a digression.

Regarding the paper by Gelman, Simpson, and Betancourt, I have read it many times before, and it is full of many unexpected insights, and it flows well. Having re-read it again, however, I am still a little confused about their main point. Could you explain what you mean by “using prior predictive samples”? I assume this is what the paper refers to by “considering the context of the likelihood”?

Does this just mean generating samples from one’s model before you fit to any data? While it seems technically prudent, to make sure one’s model creates something akin to the data one has (that everything seems in order), this also seems like double-counting of the data: fitting to it before you fit it.

Yes - there is a somewhat more formal way to do this, you can check this paper on visual bayesian workflow: https://arxiv.org/pdf/1709.01449.pdf

There is ongoing debate on this (e.g., see Against Arianism 3: Consider the cognitive models of the field | Statistical Modeling, Causal Inference, and Social Science). Personally, I take it as a way to validate your intuition on your prior: the aim is not to come up with the prior that exactly generate the data, but to eliminate priors that generate impossible data. For this purpose, you dont even need to use the actually data to validate that (for example, if you are working with human population estimation and your prior gives you value larger than total population on earth), thus avoid the problem of peeking at the data or using the data twice.

Taking this a bit further, I would even argue there is no problem of using the data twice, if you come up with a prior that produce data likelihood exactly match with the observed, sampling using this prior would just gives you the same distribution as the input prior (well, in practice the MCMC sample also capture the covariance in the posterior). You are not gaining information by using the data twice (you dont have problem like in frequentist framework like more FP)

1 Like

Would Fourier transform help?