How to reference posterior mode value without .find_MAP?

Hi,

Foreword: I’m very new to Bayesian statistics and coding.

The values that are displayed on the pm.plot_posterior() when mode is set as the point estimate are vastly different to those returned by .find_MAP for my model.

My model is likely rubbish but for the time being I wanted to be able to reference the values displayed on the pm.plot_posterior(). How can I obtain these values?

For some additional context I am trying to produce an L1 regularised Bayesian linear model for comparison with a classical L1 regularised linear model.

My data have been centred and min-max scaled and the model does not contain an intercept.

I’ve actually gone about extracting the relevant values from L1_blm_trace (see below) and then used sklearn’s KDE (type=“gaussian”, which isn’t a perfect approximating distribution as the lam_sq values are strictly positive) to try and obtain value of lam_sq that returns the maximum value of the log probability. The value returned by sklearn’s KDE is much closer to the pm.plot_posterior() mode than those returned by .find_MAP() but still not close enough.

Like I said, my model is probably rubbish, but it would be great to know how pm.plot_posterior() is getting its value for the mode.

L1_blm = pm.Model()

    with L1_blm:
        lam_sq = pm.Gamma("lam_sq", alpha=0.001, beta=0.001)
        sigma_sq = pm.InverseGamma("sigma_sq", alpha=0.001, beta=0.001)
        tau_sq = pm.Exponential("tau_sq", lam=(0.5 * lam_sq), shape=(X.shape[1], 1))
        betas = pm.Normal("betas", mu=0, tau=(1 / (sigma_sq * tau_sq)), shape=(X.shape[1], 1))
    
        mu = tt.dot(X_train, betas)

        likelihood = pm.Normal("likelihood", mu=mu, tau=(1 / sigma_sq), observed=y_train)

        L1_blm_step = pm.NUTS()
        L1_blm_trace = pm.sample(10750, L1_blm_step, chains=2)

THANKS.

Your model isn’t really a L1 regularized regression. Something like this is probably closer to what you want (not tested):

n_samples, n_features = X_train.shape
with pm.Model() as model:
    beta_scale = pm.HalfNormal('beta_scale', sd=0.5)
    betas = pm.Laplace('beta', mu=0, b=beta_scale, shape=n_features)
    mu = tt.dot(X_train, betas)
    sd = pm.HalfNormal('sd', sd=2)
    pm.Normal('y', mu=mu, sd=sd, observed=y_train)

    trace = pm.sample(draws=1000, tune=1000, chains=4)

About the point estimates:

  • The MAP is the point in the posterior that has the highest logp value. It has nothing to do with the trace that pm.sample computes, but we get it through optimization of the logp function with a scipy optimizer. If you think of the posterior as an object (say a house), where the mass is the probability, then the MAP is the point in the object where the density of the material is highest (say a gold ring on someones finger). Most of the mass of the object might be somewhere completely different (eg the walls). Their density isn’t as high as that of the ring, but they are much larger. High dimensional normal distribution look somewhat like that. The MAP is in the exact middle, but almost all of the mass is in a thin shell some distance from the center. Other cases where this might give you strange results can be funnels (which you might have in your original model): In one direction the posterior logp gets higher and higher, but the space where the logp is high gets more and more narrow. The MAP might be at the very tip of that funnel.
  • The posterior median/mean: We use the posterior draws (the trace) for this. For each variable we compute the median or mean independently, and then just concatenate all those point estimates into one vector. The mean would be the balancing point of our object. For a normal posterior both coincide with the MAP. They might still be very far from the actual mass. Just imagine a banana: the balancing point might not be inside of it.

Keep in mind that all three point estimates depend on the parametrization. (Often there are different equivalent ways of writing the same model, they are called parametrizations). So they aren’t really properties of the posterior itself.

All those problems for posteriors that are not normal are one major reason most Bayesians aren’t all that fond of point estimates. :slight_smile:

1 Like

Hi aseyboldt,

Thank you so much for your reply. I am trying to model the hierarchy presented in the Park and Casella paper shown on page 7 in this link: http://www.math.chalmers.se/Stat/Grundutb/GU/MSA220/S16/bayeslasso.pdf

The only difference is that instead of an improper prior on Sigma squared I am using a non-informative Inverse Gamma distribution.

Further down in section 5.2 they discuss the use of a hyperprior on Lambda squared. The benefit of this fully Bayesian model is that the after marginalising Lambda squared from the joint posterior you have more information about Lambda squared than would be available if Lambda was chose through Maximum Likelihood Estimation.

In the end I used sklearn’s KDE function over the posterior plot trace of Lambda squared to find the value of Lambda squared that gave the greatest logp. Maybe I’ll try the function available in the scipy package. Either way, I’ve been comparing the results of the of the shrunk beta parameters from my Bayesian model to those produced by sklearn’s lasso function. For the same value of Lamda, the Bayesian model doesn’t shrink the posterior Beta MAP values by as much. I think I need to tweak my parameters.

Anyway, my knowledge of what I’m doing here is really only skin deep so thanks again for your input.

In the end I went with the following model and it produces some reasonable results.

with pm.Model() as L1_blm :
 
    lam_sq = pm.Gamma("lam_sq", alpha=1, beta=1)
    sigma_sq = pm.HalfNormal("sigma_sq", tau=1/100)
    tau_sq = pm.Exponential("tau_sq", lam=(0.5 * lam_sq), shape=(X_train.shape[1], 1))
    betas = pm.Normal("betas", mu=0, tau=(1 / (sigma_sq * tau_sq)), shape=(X_train.shape[1], 1))

    mu = tt.dot(X_train, betas)

    likelihood = pm.Normal("likelihood", mu=mu, tau=(1 / sigma_sq), observed=y_train)
    L1_blm_step = pm.NUTS()
    L1_blm_trace = pm.sample(10750, L1_blm_step, chains=2)
    L1_blm_prior = pm.sample_prior_predictive()

I chose a=1 and b=1 for the parameter values of the Gamma hyperprior on the lambda squared RV. This gives an expected value roughly ten times that of the Lambda value selected by grid search and Monte Carlo CV in the classical framework.
The only reason the inverse Gamma was chosen as the prior on Sigma squared in the paper mentioned in my last post was for its conjugacy which allows for efficient computation in a Gibbs sampler and that a, b = 0, it represents the improper recommended by Andrews and Mallows (1974).
Because I’m using the NUT’s sampler I instead chose to use a flat Half Normal on Sigma squared.

Covariate coefficients set to zero by classical LASSO are not set to zero by Bayesian LASSO but their credible intervals do cross zero indicating they would be worth dropping.

I have no intercept because my response (and covariates) are centred. I still had to use a KDE of the posterior traces to get MAP values; I think my hierarchy might be too complicated for the .find_MAP() function?

Anyway, hopefully this can be useful for someone one day.

1 Like