I’m looking for discussion of how to use an MCMC-based model in what you might call “online” or “closed-loop” mode. That is, taking a trained model, using it to make decisions, and taking information gathered as a result of those decisions to further train the model.

In old school statistical methods, we would have to use conjugate distributions and then we could just update them based on sufficient statistics.

But when we use MCMC our models are often more complex and non-conjugate. So how do we continuously train them? Retraining them from the priors with the full training set and any additional observations is obviously quite expensive in the general case.

I’m sure there is a literature about this aspect of practice, but I don’t have any idea what jargon words I should be looking for, so I’m not finding anything.

Could anyone point me in the right direction?

See @junpenglao’s relevant commentary here:

or here:

The fundamental issue is that MCMC doesn’t produce posterior distributions, but instead samples from posterior distributions. These can’t be directly “integrated out” as priors for new data – but *approximate* posteriors can be fit to the last model’s posterior samples, and used as the next model’s priors.

Thanks for the pointers. This link presents a simple introduction to the issue: https://docs.pymc.io/notebooks/updating_priors.html

I would have expected that there was some discussion of this in the literature, though, since it seems like an important problem for practical application of MCMC and Bayesian methods.

So far, I have just been pretending that my distributions are conjugate, and updating the priors from the means and standard deviations, but I know that’s not right in general.

I’m also not familiar with the literature on the subject but I would like to ask the people here that work on Sequential Monte Carlo (SMC) their input (for example @aloctavodia, @junpenglao). The basic idea of SMC is to approximate the distribution with an ensemble of particles in parameter space. The positions of the particles are updated through a sequence of steps that rely on the unnormalized posterior density.

- Could a trace sampled with NUTS be used as a hot start for the SMC ensemble?
- Could the update steps take advantage of HMC for the ensemble of particles?
- If both of these were true, could this be used as a form of “update” of the posterior given new information, as what is done by particle filters?

A Sequential Monte Carlo is a particle filter, in fact AFAIK it originated as a method to solve that kind problems. But I am not really familiar with particle filters I have always used SMC in the context of approximating a *static* distribution.

The SMC sampler use an ensemble of particles, that’s right. But also use annealing, the version implemented in PyMC3 performs annealing by increasing beta (the inverse temperature). You start from the prior distribution (beta=0) and move through a series of intermediate distributions (stages) until you reach the posterior (beta=1). You can also anneal the data (in fact I have being thinking about implementing this, I guess it should be faster). In principle you can initialize a SMC sampler from any distribution you want. It happens the prior is a very convenient one. One of the selling-points of SMC is it ability to sample from multimodal distributions. Starting from the prior make it easier to avoid getting trap in local minima (local modes). Also starting from the prior makes it possible to estimate the marginal likelihood (for example to compute Bayes factors).

Each particle at a particular stage is perturbed using a *kernel* method, generally the kernel is a MCMC method and generally this is Metropolis, but other kernels are allowed, including HMC or NUTS (we have discussed in the past about having HMC/NUTS as an option). I think the last changes in SMC code, make it easier to add other kernel methods.

So if I understand correctly, at each stage of the modeling, we would have a population of particles, and after the first iteration (first set of observations), each element of the population would be a trained PyMC3 model (i.e., a MultiTrace), and we would evolve the particles by posterior sampling?

I’m not sure how that would work, since PyMC3 will only sample tuples of latent values that are already seen in the trace (if I understand how PyMC3 does posterior sampling correctly). That seems like it would be very likely to fail if used repeatedly. So likely I’m not fully following the concept.

There are a couple of cases that work trivially out of the box.

Variational approximations work just fine, since at each presentation of a new dataset we have

\pi_{k+1}(\theta) = \mathrm{argmin}_\mathcal{q} \mathrm{KL}[q(\theta) || P(D_{k+1}|\theta)\pi_k(\theta)]

so the information propagates. The drawback is that none of the \pi_i will converge to the true posterior.

Resampling methods also work out of the box, since the weights update in the same cumulative way:

w_i^{(k+1)} = w_i^{(k)}P[D_{k+1}|\theta_i]

These are woefully, woefully inefficient.

Particle filters (SMC) are typically used in ordered data where t is an index for observations P(y=y_t|x=x_t) = g(y_t|x_t) (see http://www.irisa.fr/aspi/legland/ref/cappe07a.pdf and particularly figure 1 for setting and figure 5 for SMC approach). The annealing approach in pymc3 takes p(y_t|x_t) = p(D|\theta^{(t)})^{\beta(t)} where \beta(t) is some increasing function on \mathbb{N}\rightarrow[0, 1]; and D is fixed.

Technically this approach generates an ensemble of \theta, drawn from the distribution

\pi_0(\theta_0)\prod_{i=1}^t P(D|\theta_t)^{\beta(t)} and associates a separate \theta with every iteration. Trivially indexing the (online) dataset such as

\pi_0(\theta_0)\prod_{i=1}^t P(D_t|\theta_t)^{1}

would then associate a \theta_t to each dataset. This only represents the true joint likelihood when \theta_0 = \theta_1 = \dots = \theta_t so this approach cannot provide the posterior p(\theta|D_1, \dots, D_t).

I think the closest setting to the online updating of posteriors is parallel Bayesian computation; where a central aggregator is given access to samples (and likelihoods)

\{\theta_t, P(\theta_t|\mathcal{D}_t)\}_{t=1}^T

for T batches of data. These are termed “subposteriors” in the literature, and there are various methods of combining them into a consensus, see

Even in this setting one needs one or more of the following:

- Smooth approximation of posterior, given samples
- Large sample size for resampling
- Black-box access to every P(D_i|\theta) (not just the most recent)

My guess is, if you’re willing to store both P(\theta|D) as well as \nabla P(\theta | D) (and possibly higher-order derivatives) that one could do a bit better. But this is just a guess…

Thank you very much for the in depth and helpful response. I look forward to reading the papers.

@chartl and @junpenglao – I looked over the notebook about Sequential Monte Carlo, and it was very helpful, but it left me with a new question:

The discussion in the notebook is all about sampling the exact same model repeatedly with restarts. But the issue for me, one of those discussed in the “Scaling Bayes” paper, is that I get data in multiple tranches over time.

Conceptually, at least, this is not an issue, because the samples are i.i.d. (well, there’s a complication, but let’s ignore that for now).

However, there’s a programming complication, which is that PyMC3 models have their observations “baked in,” so I can’t just train the model on subset one of the observations, and then move on to subset 2 – for that I would have to rebuild the model, and then the old traces wouldn’t apply to the new model.

My guess is that it might be possible to effectively rip out the observations from the trace for subset 1, build a new model that is the same except for a new observed node, and restart. But I don’t believe the API for traces would easily support this.

Alternatively, could we build a new model and use the end point of the old chains as start points for new chains, and then merge the traces, after removing the observed node?

P.S. Things are a little more complicated, in my actual case, because I have a mixture of Gaussians model with multiple conditioning variables, so the model would need more extensive surgery. But if I can’t figure out how to answer the above questions, I can’t begin to worry about this complication.

Would something like this work for you?

It was your initial question which prompted me to play around with it a bit; and it’s the closest I could come to constraining a new trace by an old trace.

I’ll have a look: thank you very much. Sorry not to have responded more promptly – for some reason I didn’t get the email.

This is a long-delayed response. Looking at this, the critical flaw seems to be that it assumes that the variables to be updated – alpha, beta0, and beta1 – are independent.

This seems like an extremely strong assumption to bring to sequential Monte Carlo, and one that in general will not be justified.

In the case of PyMC3 this limitation seems to be baked into the `Interpolated`

class, because it is limited to giving an interpolated distribution of a univariate distribution, but I suppose that one might be able to use a `DensityDist`

or something to model multivariate interpolated distributions.

I’ll be looking into this, but if you have any further suggestions, they would be welcome.

Looking at this, the critical flaw seems to be that it assumes that the variables to be updated – alpha, beta0, and beta1 – are independent.

I’m not sure we’re looking at the same thing. This block:

```
blocks, mars, zscores = transform_variables(trace, varnames)
# compute the correlation of transformed variables
pcor = np.corrcoef(zscores.T)
print(pcor)
L_p = np.linalg.cholesky(pcor)
pr_lat_z = pm.Normal('pr_copula_z__', 0., 1., shape=(mars.shape[1],))
pr_lat = pm.Deterministic('pr_copula__', tt.dot(L_p, pr_lat_z))
```

explicitly models the correlation of parameters from the traces.

You are right. I clicked the wrong link and got into an earlier stage of the discussion. Sorry – you’re right: this is exactly the question I’m looking into.

To add a word of caution:

I’d only work with online learing like that if I really had to. If you *can* just rerun the whole model, even if it takes a bit more time, just do that. Especially if you keep updating posteriors in a long chain, I’m not sure the final result would be close to the actual posterior, even if all individual approximations are kind of ok. With each step you keep introducing an error, and it is not obvious how those errors interact. Small individual errors might add up to something gigantic. Or at least I don’t know of any results that would tell me that they don’t.