There are a couple of cases that work trivially out of the box.
Variational approximations work just fine, since at each presentation of a new dataset we have
\pi_{k+1}(\theta) = \mathrm{argmin}_\mathcal{q} \mathrm{KL}[q(\theta) || P(D_{k+1}|\theta)\pi_k(\theta)]
so the information propagates. The drawback is that none of the \pi_i will converge to the true posterior.
Resampling methods also work out of the box, since the weights update in the same cumulative way:
w_i^{(k+1)} = w_i^{(k)}P[D_{k+1}|\theta_i]
These are woefully, woefully inefficient.
Particle filters (SMC) are typically used in ordered data where t is an index for observations P(y=y_t|x=x_t) = g(y_t|x_t) (see http://www.irisa.fr/aspi/legland/ref/cappe07a.pdf and particularly figure 1 for setting and figure 5 for SMC approach). The annealing approach in pymc3 takes p(y_t|x_t) = p(D|\theta^{(t)})^{\beta(t)} where \beta(t) is some increasing function on \mathbb{N}\rightarrow[0, 1]; and D is fixed.
Technically this approach generates an ensemble of \theta, drawn from the distribution
\pi_0(\theta_0)\prod_{i=1}^t P(D|\theta_t)^{\beta(t)} and associates a separate \theta with every iteration. Trivially indexing the (online) dataset such as
\pi_0(\theta_0)\prod_{i=1}^t P(D_t|\theta_t)^{1}
would then associate a \theta_t to each dataset. This only represents the true joint likelihood when \theta_0 = \theta_1 = \dots = \theta_t so this approach cannot provide the posterior p(\theta|D_1, \dots, D_t).
I think the closest setting to the online updating of posteriors is parallel Bayesian computation; where a central aggregator is given access to samples (and likelihoods)
\{\theta_t, P(\theta_t|\mathcal{D}_t)\}_{t=1}^T
for T batches of data. These are termed “subposteriors” in the literature, and there are various methods of combining them into a consensus, see
Even in this setting one needs one or more of the following:
- Smooth approximation of posterior, given samples
- Large sample size for resampling
- Black-box access to every P(D_i|\theta) (not just the most recent)
My guess is, if you’re willing to store both P(\theta|D) as well as \nabla P(\theta | D) (and possibly higher-order derivatives) that one could do a bit better. But this is just a guess…