The limits of using the traceplot to update priors?

There has always been a temptation I’ve felt, to, after the first version of my model has run, update all my variables’ priors to match the results of the pm.traceplot() graphs. Is this a valid method of improving one’s model?

I had read a quote a while back from Gelman that trace plots can indeed be utilized to further inform one’s priors, but there has to be some limit to this. Otherwise, why wouldn’t the default modelling process always be to

  1. run the model with vague priors, and
  2. if the traceplots create “stable enough” looking variable distributions, use kernel methods on each stable variable to gain its mean and std, and feed those into the respective model variables?

I feel this is an amateurish question, but would someone please elucidate me on how to properly use traceplots (assuming everything has already converged correctly so there are no problems there) to better one’s model?

It was recently debated on Twitter, that updating prior (beside in most cases un-interesting conjugate models) under Bayesian framework is actually not trivial. Personally, approximating the posterior with some heavy tail distribution (say a t distribution) is what I would do. Otherwise, Bayesian filter (Kalman filter, particle filter) sounds like a promising framework that I would love to explore more.

Bayesian filters sound interesting, will look into them.

For updating one’s priors based only on model results, there must be some best-practice that doesn’t take shortcuts statistically, or maintains rigor.

I’m naively imagining that best practice to simply be: start from the vaguest priors, and only utilize subsequent traceplot information to the minimum necessary amount until the model fully converges/displays no serious errors.

But, again, I have no clue, so I will try research this important question.

Do you have a link to this discussion? This sounded to me like “straightforward” Empirical Bayes, so I’m surprised that there’s much more to it.

1 Like

I just found a useful example in the docs that matches the idea I had in the first post.

However, very little justification is given for this procedure, or info on where it can go wrong.

That example more closely matches online learning, where an additional (say, kth) set of data comes in and you’d like to update the estimates you have after set 1, …, k-1. Whereas you’re more interested in making your priors “better” without new data.

Empirical Bayes is a simple way of doing this; and there are approaches for conjugate, non-conjugate, nonparametric (&c). In the simplest case, you have some prior \pi(\theta|\xi) which is parameterized by \xi (so N(0, \sigma^2) would be \xi=\sigma^2). You’d assign a hyperprior to \xi generating a full prior \pi(\theta)=\pi(\theta|\xi)p(\xi), (i.e. \sigma^2 \sim \mathrm{HalfNormal}(5.)).

Empirical Bayes (in its simplest version) seeks a point estimate \hat \xi = \mathrm{argmax}\{P(X|\theta)\pi(\theta|\xi)p(\xi)\}; i.e. the maximum marginal likelihood. You could read this directly off the trace as the mean of the sampled \sigma^2.

Importantly, the only parameters that update are the hyperparameters, which (unless you over-parameterized your prior) protects you from doing MAP estimation \hat \theta = \mathrm{argmax}\{P(X|\theta)\pi(\theta)\}.

(For example, using \pi(\theta|\xi) = \mathcal{N}(\theta| \mu_\xi, \sigma^2_\xi) will ultimately set \mu_\xi to MAP \theta, and \sigma^2_\xi to 0)

When I was doing my own research, Empirical Bayes only seemed to me like

  1. calculating the statistics of interest out of your data (for ex., if you’re calculating a true mean of sets of observations, then to just calculate the global mean from combining each set), and
  2. throw that into the respective prior.

This seemed like serious double-counting at first, but then seemed more reasonable, upon further thought. I think I will have to dip into the philosophy of priors and more of your post’s math. Thanks for the thoughtful reply!

1 Like

You should be aware that Empirical Bayes is not the same as continuously updating priors.
Also, one difficulty of Empirical Bayes is that some parameters are difficult to compute the (sufficient) statistics and used as prior. For example, hierarchical prior for a random effect model.