Apologies in advance for a potentially naive/incoherent question, I’m a scientist with quite a lot of experience with least-squares type analysis but am new to Bayesian modelling and MCMC methods. Thanks to a lot of the excellent resources around I am getting the hang of things slowly, but I am struggling with developing an intuitive understanding of the way a likelihood distribution is generated from observed data, and how it’s used in subsequent sampling.

For example, in the example of linear regression from the docs (quoted partially below), a likelihood is defined:

I am not sure I fully understand how the observed data are used to generate this distribution that is then sampled from. Is the idea that:

We assume that the data are normally distributed with a mean given by a combination of our parameters and a sigma corresponding to (for example) experimental uncertainty?

The observed data defines this distribution? Is there some initial optimisation/fitting that works out the normal distribution that best fits our data?

I think what I am struggling most with is that I am used to thinking about these problems from an optimisation/minimisation point of view - to my mind, I feel like there must be a way in which a likelihood distribution is pre-conditioned on our observed data to even generate a distribution that can be sampled from, but perhaps this is an incorrect way to think about it. I’ve looked at many many resources and none I can find that seem to explain this concept in detail, so I think it is likely really obvious but that I am just not seeing it!

Any insight anyone can provide as to how to intuitively think of this process (especially as compared to a general least-squares optimisation type approach) would be really appreciated. Not sure I have explained what I mean brilliantly…

The observed values together with a parametrized likelihood function define the model likelihood, which is used for posterior sampling. In that case the model is specified as having a normal likelihood parametrized by mu and sigma, and having observed specific values.

The posterior distribution of the unobserved variables that ultimately parametrize the likelihood will correspond to a compromise between their prior probability and the values under which the model likelihood is highest.

The analogy to vanilla optimization, is that you want to optimize the unobserved variables so as to produce the highest likelihood for the observed data (which is sort of analogous to a cost function) and the prior defines a sort of regularization on these variables.

The likelihood / observed distribution is a fixed component of the model. Nothing about it is inferred other than its parameter values.

Thanks for the quick reply, that does help a lot So if I were to try to explain the whole process by analogy to a vanilla optimisation problem (which I foresee is something I’ll have to do for reviewers in my field!), it would be something like:

Rather than simply minimising the difference between a model and some data to determine the best set of parameters, we are trying to find the set of parameters that maximise the likelihood for the observed data, subject to our prior belief about those parameters.

I suppose the closest analogy within a vanilla optimisation problem would be optimisation with bounds to restrict the “allowed” values of parameters to physically reasonable ones that fit our prior beliefs? However, this is much more clunky and doesn’t give information about the wider parameter space, and doesn’t allow any subtlety in encoding information about how likely we believe certain parameters are.

For context (apologies if going off topic!), I’m particularly interested in the application of these methods to problems that can end up being relatively ill-behaved under standard minimisation methods (especially common in spectroscopic and microscopic data analysis). A method I’ve previously used is one using normal least-squares optimisation to fit data to a model, but where the the least-squares algorithm is run over-and-over from many different initial starting points (within physically reasonable bounds) such that the entire parameter space is explored, to get a feel for the sensitivity of the fitted results to the starting point. Obviously, these MCMC methods are much better suited to that kind of problem as they explore the parameter space much more effectively (and are a lot less computationally expensive). Thanks a lot for the quick reply!

Yes. The posterior distribution is directly proportional to prior probability * likelihood. Nothing fancier than that (it’s just damn hard to sample from :D).

Bounds correspond to cases where either the prior probability or likelihood are zero. But as you said, you can convey much more rich information than just bounds. For instance prior values in range A are twice more probable than in range B…

Perhaps an instructive exercise is to think about
the connection between classical ridge regression and Bayesian linear regression. The regularization potential in ridge regression has a similar effect of having normal priors for your parameters:

So normal priors do have a regularization kind of effect on your model.

Another example: If you have collinearity in some in your covariates, then in the Bayesian linear regression setting the posterior for the slopes (effect sizes) corresponding to those covariates are heavily influenced by your choice of priors. Unlike classical linear regression (without regularization) you might still get very certain estimates for your effect sizes of collinear covariates if your priors are very informative: