# Marginalization of variables in an incremental update setup

Hi all,

I’m very new to the PyMC3 community and also unexperienced with Bayesian modeling, so sorry in advance for any blunders in this post. I’m trying to implement a hierarchical Bayesian model and am struggling to figure out how to calculate the marginal probability of some discrete or continuous parameters in order to incrementally update them.

More specifically, this is how I’m doing the inference:

\begin{equation} P(l, t_{k}, \theta \, | \, D_{k}) \propto P(l) \, P(t_{k} \, | \, \theta) \, P(\theta) \, \sum_{i=1}^{len(D_{k})} P(D_{ki} \, | \, l, t_{k}) \end{equation}

where:

• D_{k} is the set of data associated with submodel k
• \theta is a continuous variable, the over-hypothesis about the distribution of t_{k} across multiple submodels
• l is a discrete variable with about 1k possible values
• t_{k} is a discrete variable with 4 possible values

And the following distributions:

\theta \sim Dirichlet(a), \, \, \, len(a)=3, \, a \, fixed
t_{k} \sim Categorical(\theta)
l \sim Categorical(p), \, \, \, len(p)=1024, \,p \, fixed

I have no problem calculating my posterior P(l, t_{1}, \theta \, | \, D_{1}), but after fitting a small number of D_{1} observations (about 100) I need to be able to update P(l) and P(\theta) in some way, so I can then replace them in the inference above and fit a new set of observations D_{2} for submodel 2. This needs to be repeated for about 10 submodels, and I’m particularly interested in seeing how the marginal posteriors over l and t change after fitting each submodel.

Generally, after fitting N submodels, I should be able to estimate these two distributions:

P(l \, | \, D) = \sum_{t} \int_{\theta} P(l, t, \theta \, | \, D) \, d\theta
P(\theta \, | \, D) = \sum_{l, t} P(l, t, \theta \, | \, D)
D = \cup_{k=1}^{N} D_{k}
t = t_{1} \times t_{2} \times ... \times t_{N}

However, I’m not really sure how to approach these two things:

1. My first issue would be being able to sample from these two distributions in the first place. I read the post here, but I’m not sure how the solution would be best adapted to my situation, since some of my prior parameters (t and \theta) are not conditionally independent. I also have significantly less data to fit.

2. Using this approach, even if I were able to sample from the two distributions I would still need to estimate them so I could incorporate information from every submodel into the global model of \theta. However, I’m not sure KDE would provide sufficiently accurate estimations. Since each set of data D_{k} is no larger than 100 points, I would prefer refitting all previous observations for every new submodel k and avoid any type of incremental updates completely, but I am not sure if that is possible.

Thank you very much for reading this! Any help or tips would be extremely appreciated.

(1) Your ability to marginalize depends on the structure of P(D|l, t). The general idea would be to replace l by a Dirichlet, introduce the parameter \alpha to specify its shape, and use a surrogate \tilde P(D_k|t_k, \alpha) = \int P(D_k | l, t_k)\mathrm{Dir}(l,\alpha)dl. Importantly this integral would need a closed-form expression. However given the dimensionality of \alpha I would expect the posterior sampling to be fiddly at best. Do you need a full joint distribution, or can you approximate with marginals like P(D|l,t_k) \approx \prod_j P(D[:, j] | l_j, t_k) (you’d need to re-normalize the posterior means of l_j).

(2)

P^{(\omega)}(l, t_k, \theta | D_k) \propto P(l)P(t_\omega|\theta)P(\theta)\sum P(D_{k\omega}|l,t_\omega)

Why not define a sequence of likelihoods

Q^{k}(l, t_1, \dots, t_k, \theta | D_1, \dots, D_k) = \prod_{\omega=1}^k P^{(\omega)}(l, t_\omega, \theta | D_\omega)

The posterior of \theta of Q^k would be the exact posterior of the “sequential update” of the first k submodels.

1 Like

Thanks a lot for the tips!

(2)

That makes sense, it did seem to me like I was over-complicating things trying to update the parameters directly. But just to check, is the first formula just supposed to represent the inference under submodel \omega? I’m not sure why k and \omega get mixed in P^{(\omega)}(l,t_k,\theta|D_k) and what D_{kw} represents. I used \sum_{i} P(D_{ki}|l,t_k) above to mean summing over all datapoints D_{ki} \in D_{k}. Was it supposed to be this instead:

P^{(\omega)}(l,t_\omega,\theta|D_\omega)\propto P(l) P(t_\omega | \theta) P(\theta) P(D_\omega | l, t_\omega)

(1)

I’m not exactly sure what approximating with marginals would entail. What I am interested in overtly calculating is just P(l | D_1, ...,D_k) and P(t | D_1,...,D_k) after each k.

If this is of any use, this is how the likelihood roughly looks like:

\begin{equation} P((m,u)=D_{ki} | l, t_k) \propto \sum_{c_i \in C(t_k)}^{} \frac{1}{|C(t_k)|} \, L(m \, | \, u, c_i, l) \end{equation}
\begin{equation} L(m \, | \, u, c_i, l) = \left\{ \begin{array}{@{}rl@{}} & g(u,l,c_i) \quad \quad f(m, u, l) > 1 \\ & h(u,l,c_i), \, \, \, \, \, \, f(m, u, l) < 0 \\ \end{array} \right . \end{equation}

But I think approximating using marginals might be my only viable solution.

Yes

This doesn’t really make clear the dependence over l. You can write

L(m|u, c_i, l) = L(m|u, c_i, 0)l + L(M|u, c_i, 1)(1-l), but if the L() terms don’t simplify it will be hard to integrate away l.

Ah, I wasn’t sure which details would be useful… In essence, each of the 1024 possible values of the categorical variable l corresponds to a set of ordered pairs:
l = \{(m, u)\, \in \mathbb{N} \times \mathbb{N}\}
And to a set of sets in the case of t:
t = \{c_i \, | ,\ c_i = \{m \in \mathbb{N} \}\}
So the likelihood depends on what these sets look like:
P((m,u)=D_{ki}|l,t)\propto \sum_{c_i\in t}{\frac{L(m|u,c_i,l)}{|t|}}
L(m \,|\, u,c_i,l)= \left\{ \begin{array}{@{}rl@{}} & g(u,l,c_i), \quad (m, u) \in l \quad and \quad m \in c_i; \\ & h(u,l,c_i), \, \, \, \, \, \, (m, u) \notin l \quad and \quad m \in c_i;\\ & 0, \quad \, \, \, \, \, \, \, \, \, \, \, \quad \quad m \notin c_i\\ \end{array} \right .
Where g is based on counting the number of pairs in l of the form (n, u), n \in c_i; h is chosen so everything can add up to 1.

But in a more general case, I am quite confused about what steps I would need to follow to get to (or estimate) the marginals Q^k(l|D_1,...,D_k) and Q^k(t_k|D_1,...,D_k) from my joint posterior Q^k(l,t_1,...,t_k,\theta|D_1,...,D_k).

Sorry if this thread is getting too long and I really appreciate the help so far!