Learning GP prior covariance hyperparameters -- why does StudentT noise model perform better than Normal?

The original pymc3 example notebook is here: https://docs.pymc.io/notebooks/GP-Latent.html

This is my notebook: https://github.com/leka0024/pymc3/blob/master/latent_GPprior_covHyper.ipynb
My main goal is to learn the GP prior covariance hyperparameters. I don’t care about the noise model hyperparameters, unless they help better learn the covariance hyps.

The four cases in my notebook come from using a StudentT (like the pymc3 example notebook) or a Normal for the noise model, and using giving the true values of the noise hyperparameters or putting priors on them and learning them too (though not really my goal).

The math, a hierarchical model I believe: p(th,phi | y) = p(y | th) p(th | phi) p(phi)
y - noisy data points, th - latent values, phi - hyperparameters
p(th | y) - noise model, p(th | phi) - GP prior, p(phi) - priors on hyperparameters

Observations/questions:

  1. why does the StudentT noise model work better than the Normal?? It learns the cov hyps better, in either case of the noise model hyps. In fact, when using the Normal with noise hyps, there is divergence periods in the main trace.
  2. I know there is also a .Marginal GP method, that maybe would be better than the way I’ve done it with Normal … but I don’t see why I shouldn’t still be able to do this way (the “manual way” perhaps) and have it work just as well?
  3. Does it make sense in this scenario that we can use just StudentT or Normal, instead of MvStudentT or MvNormal ? Because my understanding is that GP is essentially a MvNormal, where dimension is the same as the length/size of the support (might be wrong word).

Appreciate any insights, especially mathematically on #3 ! Thank you

1 Like

Unfortunately @leka0024 never got a reply here, but I have more or less that same questions.

Maybe someone can help answering.

My best guesses would be:

  1. StudentT likelihoods have “nicer” gradients when the prediction is far away from the maximum. With x \to \inf, the logpdf_T(x) converges to a constant a line, wheras logpdf_{Normal}(x) keeps curving downwards. I could imagine this being a reason for divergences?

cheers

2 Likes