Question Regarding Predictively Oriented Posteriors

Hey everyone,
I’ve been reading a bit about predictively oriented (PrO) posteriors and trying to understand how they behave beyond the high-level motivation.

From what I understand, the key idea is that instead of concentrating on a single parameter value, the posterior is defined through the induced predictive distribution. As a result, unlike standard Bayesian posteriors, PrO posteriors only collapse to a point mass when the model is exactly well specified; otherwise they stabilise to a non-degenerate distribution, where the remaining spread reflects model misspecification rather than lack of data.

What I’m still unclear about is how this limiting object looks mathematically in practice. In particular, I’m not sure under what conditions the predictively optimal posterior is unique, or how its variance relates quantitatively to the degree of misspecification. It also isn’t obvious to me how sensitive this behaviour is to the choice of predictive divergence being optimised.

On the computational side, I’ve seen proposals to sample PrO posteriors using mean-field Langevin dynamics, but I’m trying to understand how closely the resulting particle system actually tracks the intended predictive objective, especially in higher-dimensional or misspecified settings.

I’d really appreciate pointers to references, toy examples, or existing implementations that helped others build intuition around these questions. Thanks!

1 Like

One more thing I wanted to point out is that in the Project idea page of Gsoc where this is mentioned we are explicitly asked to interact with PyTensor, which is the backend used by PyMC, but the listed link redirects to a page which does not seem to be working (https://www.pymc.io/projects/docs/en/stable/projects/docs/en/v5.0.2/learn/core_notebooks/pymc_pytensor.html) It would be great if I get some similar sources to understand what was it directing towards

Hey @Vikram ! Interesting topic, I haven’t seen anything on it before so I can’t help you there. A corrected link to the pymc and pytensor notebook is here though.

@jessegrabowski Thanks for the link I noticed that the project page mentions potential mentors, but I don’t see their usernames listed here. Could you help tag Osvaldo Martin, Chris Fonnesbeck, and Yann McLatchie?

I think you got the gist of the topic. The PrO posterior is the distribution over parameters whose induced predictive distribution minimizes a chosen proper scoring rule. The limiting PrO posterior is unique if the scoring rule is strictly proper. Its variance reflects irreducible predictive uncertainty due to misspecification; the more the model cannot capture the data, the wider the spread. There are still some open questions about the properties of these objects (at least for me), and from my perspective, one goal (besides writing the code) of the project is to better understand some practical consequences and start thinking about good practices and recommendations.

1 Like

@aloctavodia are there any existing sources that go into this in more detail, I’d really appreciate pointers. In particular, I’m trying to understand whether there are papers, examples, or implementations that show how predictively oriented posteriors are actually constructed or approximated in practice. It would also help to know if there’s already an established way people approach this (for example within probabilistic programming frameworks), or if most of the work is still at the theoretical stage.

We are sailing into uncharted waters.

1 Like

Hi @Vikram, happy to hear that you are interested in these objects ! The paper we based the GSoC proposal on with @aloctavodia is the following ([2510.01915] Predictively Oriented Posteriors), which includes references to where these ideas and similar have been proposed in the past (Section 1.2). In terms of how these have been computed in practice, Section 6 presents an asymptotically-exact approach based on Wasserstein gradient flows, and I wrote a short blog post which includes some code to demonstrate how you could implement it in Python if you are interested.

1 Like

Hi everyone, thanks a lot for the explanations, references, and help so far — I really appreciate it. I’ve started working on my GSoC proposal and I’m about halfway through it. I’ll send a draft to fonnesbeck+gsoc2026@gmail.com (listed in discourse) by March 20 and would appreciate any feedback before the final submission deadline. Thanks again!

Interesting Blog @yann helps a lot

I don’t have time to read the PrO paper, but I’m familiar with related ideas around optimizing posterior predictive inference rather than just using probability theory over the model.

I want to clarify one point of possible confusion. Standard Bayesian posteriors are distributions and only collapse to delta functions in cases where the prior is a point mass distribution. Were you talking about Bayesian estimates, which are typically either posterior means (to minimize expected square error) or posterior medians (to minimize expected absolute error).

An alternative to changing Bayesian inference to try to adjust inferences to account for model misspecification is to figure out what’s going wrong in the model and fix that. One advantage to the keep-inference-just-probability-theory approach is that you only have one degree of freedom, the model, rather than two, the model and the inference algorithm. I say this after many years of working in ML where both are often in play, before switching to Bayesian statistics and just following the math.

I skimmed the paper and think it’s confused about Bayesian inference. It begins,

Often motivated through its favourable decision-theoretic properties, the primary goal of Bayesian statistical inference is to quantify uncertainty about a parameter of interest given a set of prior beliefs. All other inferential tasks are derivatives of this primary directive.

Let’s try to unpack this.

  1. This is important. It’s not just prior “beliefs”. We take all of our pre-existing knowledge and use it to define two things: (a) a data generating process p(y \mid \theta), and (b) a prior p(\theta). As a function of the parameters for a fixed data set y, the data generating process defines the likelihood function \mathcal{L}(\theta) = p(y \mid \theta), which is not itself a density.

  2. Downstream inferences conditioned on observed data are usually much more sensitive to the data generating distribution than they are to the prior distribution. Both the data generating process and prior are both “subjective” if you want to use that language. So if you’re going to say that priors involve beliefs, be sure to indicate that the data generating distribution is also based on beliefs. And the beliefs in the data generating distribution are also much harder to swallow, as they typically involve clearly false linearity assumptions, independence assumptions, homoskedasticity assumptions, etc.

  3. There is no “primary goal” for Bayesian inference. Exactly the same mathematical process is used to do posterior predictive inference for (a) parameter estimates, (b) event probability estimates, and (c) posterior predictions. Sometimes we might just care about the posterior distribution and not about reductive point estimates.

  4. When we do care about parameter estimation uncertainty, it’s joint, not marginal. We might summarize the marginals, but all of our posterior inference will use the joint distributions. This is important because if you fit a regression with a slope and intercept, they will typically be highly correlated and you’ll get the wrong inferences treating the posterior uncertainty as independent or as non-existent.

  5. Prediction and parameter inference are not fundamentally different. They’re both posterior expectations. For example, a common Bayesian parameter estimate if \widehat{\theta} = \mathbb{E}[\Theta \mid y]. A prediction for new data \tilde{y} is just p(\tilde{y} \mid y) = \mathbb{E}[p(\tilde{y} \mid \Theta) \mid y] and an event probability estimate is just \Pr[\Theta \in A] = \mathbb{E}[\textrm{I}_A(\Theta) \mid y]. All of these expectations are conditional on the data, hence average over the posterior. I would say if you look at something like the new Bayesian workflow book that’s about to come out or our workflow paper on which it is based, you’ll see that most of the focus is not on parameter estimation, but on more general forms of prior and posterior predictive inference.

The paper goes on to say another thing that is not generally true.

If a complete population is observed, then the parameter of interest can be assumed to be known precisely.

  1. First, we can pretty much never measure a whole population of interest. We can try with things like the census, but we never get there. Even if we did, we’re usually interested in superpopulations. That is, even if I measure every hare and lynx in Canada, what I care about is what’s going to happen next year, not to the population at this exact day and time. Even if I did measure that, as soon as an animal eats or dies, the exact population statistics are going to change.

  2. This is also only true if the parameter directly represents some statistic of the data. If I completely observe some measurement over a population, I will know the mean and standard deviation of the population exactly. But if I’m trying to fit a regression, it will be overfit and I won’t know the slope and intercept exactly even if I know the population exactly because there’s error (the measurements of covariates are not sufficient to predict the outcomes).

I couldn’t even figure out what they meant by their next statement, so I’m giving up.

In the context of prediction however, posterior collapse is generally not desirable: it implies that as the sample size increases, marginalising over the Bayes posterior ceases to function as a genuine combination of individual predictive distributions.

I don’t even know what “individual predictive distributions” means. And I’m not sure what they mean by posterior collapse, but I’m guessing it’s what we’d call posterior concentration, which only happens when the preconditions to the Bernstein/von Mises theorem are satisfied. This will be a problem if the model’s wrong. If the model’s right, our posterior inferences are calibrated by construction. When the model’s not right, then an alternative strategy is to try to understand where the model is breaking down and try to fix it. I would suspect this will lead to much better performance than trying to adjust bad inferences. For example, I might be using a linear regression when results are non-linear. Moving to a non-linear regression is going to be much better than trying to fix the inferences of a linear regression.

1 Like