# Common reasons for getting a map estimate that is far from the mode of the posterior

In many of the datasets with which I am working, I have observed that the mode of the parameter posterior often differs significantly from the MAP estimate. I am curious to know the common reasons for this phenomenon and what factors I should investigate.

I assume a reasonable answer would be as following:

The MAP estimate corresponds to the point in the parameter space that maximizes the posterior distribution, while the mode of the posterior distribution is the point in the parameter space where the distribution is at its maximum.

In some cases, the MAP estimate may be far from the mode of the posterior distribution, particularly when the posterior distribution is skewed or has multiple peaks. This can happen when the prior distribution strongly influences the posterior distribution or when there is insufficient data to fully constrain the posterior distribution.

Furthermore, I have also noticed that the MAP estimates demonstrate superior predictive power in terms of rolling window cross-validation on unseen time series data when compared to the mode of the posterior estimates. This observation appears peculiar to me. Can anyone bring some logical reasoning into this behaviour?
Consider the posteriors below:

and the map estimates below:

The MAP is the mode of the posterior distribution. Can you provide a brief example to illustrate what you are seeing? In practice, one must estimate the MAP and thus the estimate may deviate from the true MAP (and from other estimates of the MAP).

Note that the MAP is the mode of a the full, multivariate posterior distribution. In your case, this posterior lives in a parameter space that is at least 13-dimensional. It is not possible to visualize this posterior distribution. Instead, we visualize low dimensional â€śprojectionsâ€ť of this high dimensional distribution (averaging, or marginalizing over other parameters). The 13 distributions you have plotted are (Iâ€™m reasonably sure) the marginal posteriors associated with each of 13 parameters. The mode of the entire distribution (i.e., the MAP) need not be related in any simple way to the modes of these individual, marginal distributions. Perhaps this was what you were referring to when you mentioned multi-modality?

Does that help?

3 Likes

ahh of course, makes sense. follow up question - Indeed, it is evident that obtaining the posterior distribution for each variable and then extracting the mode by employing, for instance, the Scipy.stats.mode function, will not yield the maximum a posteriori estimate for this parameter. This aspect is also connected to my second assertion regarding predictive power and accounts for my reduced predictive ability when relying on mode-estimates of the parameters via the aforemented approach. I wonder if any relevant functions are accessible for obtaining the maximum a posteriori estimate from a trace, or if Arviz has conducted any investigations in this direction. Essentially, my aim is to construct a highest density interval around the maximum a posteriori estimate, with appropriate guardrails set at, for instance, 45%-55%. Consequently, I need to acquire the complete posterior distribution but restrict my attention to the 45-55% interval for each parameter surrounding the mode.

Approaches such as this are certainly possible (possibly with some difficulty). However, there is no guarantee that parameter values near the MAP are particularly credible. In fact, thereâ€™s no particular reason to think that the MAP itself is much more credible than other parameter values. This becomes increasingly true as the number of parameters increases. The easiest way to find credible parameter values is to sample from the posterior. Perhaps if you describe why you want the MAP + parameter nearby values, someone might suggest an easier/more appropriate approach.

The primary reason for my keen interest in the map-approach stems from two distinct advantages. Firstly, it offers significant speed enhancements. Secondly, it possesses what I refer to as â€śpredictive powerâ€ť (the explanation for which I shall elaborate on later).

As part of a larger system, I run a model-selection task evaluating around 100-200 models with the ultimate aim of capturing the effects of specific inputs on the response, hence accurately identifying parameter-values. With certain datasets, the runtime required for specific models with default sampling-settings (1k tune, 1k samples, 4 cores) when sampling from the posterior may take up to 20-40 minutes, whereas with map, this can be accomplished within 1-10 seconds, allowing me to run 100-200 more iterations during the model-selection.

Moreover, the metric utilized for model-selection is a rolling window cross-validation with mean absolute error (MAE) or mean squared error (MSE) as a metric and a lookahead of 10-50 days. The reason behind this is that the model must act as a predictor for a lookahead of 10-50 days in a downstream task. However, what troubles me about sampling from the full-posterior and then calculating the mean or mode of each parameter(this seems suboptimal due to independence assumptions, i assume calculating the centre of gravity of this multidimensional space would be better?) is that it usually yields inferior estimates in terms of the metric presented.

While I have contemplated turning to Bayesian model selection techniques such as Bayesian information criterion (BIC) or leave-one-out (LOO), I feel that the presented metric is more aligned with the ultimate goal. Nonetheless, I remain open to changing my perspective on this matter.

Also i would like to add, i highly appreciate your effort of teaching us beginners, its great to have such an community as this one. Ty.

I think the standard Bayesian approach would be to sample from the posterior and then calculate the MAE/MSE separately for each sample in your posterior. That would yield a distribution of MAE/MSE values. This propagates your uncertainty through the rest of your pipeline in the appropriate manner and allows you to make more meaning decisions at the end.

Yeah, this approach is not recommended for the reasons you mention (which are related to the explanation for the modes of the marginal distributions not matching the MAP).

Speed can definitely be a concern. You might consider something like variational inference if you feel like it would be appropriate for the models you are using (e.g., lots of normal posteriors). Alternatively, you can look into the new backends and samplers PyMC has recently made available (JAX, nutpie, etc.). This recent PyMCon Web Series event covered these in some detail.

Using the MAP can be ok, but the important thing to keep in mind is that doing so throws a tremendous amount of information away. You can fit 2 models to the same data set and calculate their MAPs and associated performance metrics. But one of those models may have assigned extremely high credibility to the MAP parameter values whereas the MAP of the other model may have been just as good (or bad) as any other set of parameter values. To put it succinctly: do you want a (likely) good answer slowly or a (potentially) incorrect answer quickly?

Lately i been trying out the full bayesian approach and i have just come to a bottleneck, but lets state the good things first, by using the jax backend and sample_blackjax_nuts i got the sampling time down by *10 making it possible for me to go fully bayesian in the modelselection part. Reading through some of the awesome docs out there on PSIS-loo i decided that it is fine in my scenario but that i use k-fold cross cv in case of a lot of strange pareto-k values for assumingly good models(reloo scheme can come to the rescue). Now to the bottleneck, in my task, i have an downstream optimization problem which needs to evaluate the predictor ALOT, currently i am setting new data(proposed by the optimizer) by set_data and subsequent sample_Posterior_predictive but this is way too slow(3-4 sec per evaluation), do you have any ideas on how to tackle this(using v5), the obvious one being downsizing samplesize etcâ€¦ thining out etcâ€¦ but maybe there is something else that i am not aware of?

Everytime you call posterior predictive a lot of things happen, including compiling the forward function and converting your posterior trace into something easier to work with.

If you are calling posterior predictive many times you should reuse the two.

If you look at the source code of posterior predictive, you want to intercept these variables (you can just copy paste the code and make it return them to you)

These are the parsed trace and the compiled forward function. Then you can just set your shared data and feed the trace into the function at your will (you can also do dynamic thinning at this point by skipping some points in the trace)

Another thing you can try is to compile your forward function to JAX (by passing `compile_kwargs=dict(mode="JAXâ€ť)` which could be faster)

There are other solutions from this point on, but they become more complex so I would check if this gets you there.

1 Like