Mixture model using pm.Mixture or writing in parts

I’m starting to look at mixture models. I have a toy liner regression where I’m trying to infer the same latent parameter by choosing the linear operator which has the least noise.
Using pm.Mixture, it works as expected:

import pymc3 as pm
np.random.seed(2)
m = 60
n = 3
eps = 0.01
siga = 1
H = np.random.random(size=(m,n))
xt = np.expand_dims(np.arange(1,n+1),1)
y = H @ xt + np.expand_dims(np.random.normal(loc=0,scale=eps, size=m),1)
H1 = 0.1 + H + np.random.normal(0,eps*10, size=(m,n))
H2 = H + np.random.normal(0,eps, size=(m,n))

nit = 3000 
with pm.Model() as model:
    x = pm.Normal("x", mu=xt.max()/2, sd = siga, shape = (n,1)) 
    likemix = []
    for Hi in [H1,H2]:
        likemix = likemix + [pm.Normal.dist(mu = pm.math.dot(Hi, x), sd=eps, shape=m)]
    w = pm.Dirichlet('w', a=np.array([2, 2]))
    Y = pm.Mixture('like', w=w, comp_dists = likemix, observed=y)
    trace = pm.sample(nit, tune=1000)
pm.plot_trace(trace)

But if I write out the problem in parts, using the addition of weighted likelihoods, the weightings (this time using an equivalent Beta distribution) just revert to the prior.

with pm.Model() as model2:
    x = pm.Normal("x", mu=xt.max()/2, sd = siga, shape = (n,1)) 
    a1 = pm.Beta(f"alpha", alpha=2, beta=2)
    Y = a1*pm.Normal(f'y1', mu = pm.math.dot(H1, x), sd=eps, observed=y)  + \
        (1-a1)*pm.Normal(f'y2', mu = pm.math.dot(H2, x), sd=eps, observed=y) 
    trace = pm.sample(nit, chains=3, tune=1000)
pm.plot_trace(trace)

What’s going on here? I want to build a more complex mixture model, which the pm.Mixture functionality doesn’t accommodate. This is is why I want to construct it as above.

When you write down a PyMC model you are specifying a generative process, not the probability expression (which is the case in Stan).

So this line

Y = a1*pm.Normal(f'y1', mu = pm.math.dot(H1, x), sd=eps, observed=y)  + \
        (1-a1)*pm.Normal(f'y2', mu = pm.math.dot(H2, x), sd=eps, observed=y) 

Is adding two observed variables on the same dataset. The Y variable itself is not doing anything (because it is not used anywhere else), and your model is actually equivalent to:

with pm.Model() as model2:
    x = pm.Normal("x", mu=xt.max()/2, sd = siga, shape = (n,1)) 
    a1 = pm.Beta(f"alpha", alpha=2, beta=2)
    pm.Normal(f'y1', mu = pm.math.dot(H1, x), sd=eps, observed=y)
    pm.Normal(f'y2', mu = pm.math.dot(H2, x), sd=eps, observed=y) 
    trace = pm.sample(nit, chains=3, tune=1000)
pm.plot_trace(trace)

Which explains why a1 matches the prior. When we convert the model to the probability graph, a1 is not linked whatsoever to the likelihoods coming from y1 and y2.

More subtly, it also explains why uncertainty about x is “halved” relative to your first model. You double counted the data!

If you want something that Mixture cannot offer, you need to use a CustomDistribution , or work on the probability side directly using a Potential. I suggest the first, because it allows you to retain the whole Bayesian workflow.

It would look something like (untested code and math):

def dist_fn(a, mu1, mu2, sd, size):
    idxs = pm.Categorical.dist(p=[a, 1-a], size=size)
    y1 = pm.Normal.dist(mu1, sd, size=size)
    y2 = pm.Normal.dist(mu2, sd, size=size)
    return pm.math.switch(idx, y2, y1) 

def logp_fn(value, a, mu1, mu2, sd):
    y1_logp = pm.logp(pm.Normal.dist(mu1, sd), value)
    y2_logp = pm.logp(pm.Normal.dist(mu2, sd), value)
    logp = pm.math.logaddexp(y1_logp + np.log(a) , y2_logp + np.log(1 - a))
    return logp

with pm.Model() as model2:
    x = pm.Normal("x", mu=xt.max()/2, sd = siga, shape = (n,1)) 
    a1 = pm.Beta(f"alpha", alpha=2, beta=2)
    Y = pm.CustomDist(
        "y",
        pm.math.dot(H1, x),
        pm.math.dot(H2, x),
        eps,
        dist=dist_fn,
        logp=logp_fn,
        observed=y)

Are you sure the Mixture class is too inflexible though? You can pass CustomDists as component distributions which makes it nearly as flexible as it can be. The only real restriction of Mixture has to do with the underlying marginalization: observations for univariate components are always treated as being completely exchangeable. If this is not the problem you are trying to overcome, you should be able to use Mixture.

For example, the kind of example in this issue is now supported (in that case a Mixture of loc/scale modified beta distributions): Mixture with CustomDist with inferred logp fails · Issue #6728 · pymc-devs/pymc · GitHub

Great, thank you. That’s very helpful.
The reason that (I think) the Mixture class is too inflexible is that I want to apply weights to the likelihood for chunks of observed data for each model in the mixture, e.g., the likelihood for first half of the observed data y[:k] will have a different weight to the second half y[k:] for a given model.

I’m still not sure how to do this by specifying a generative process but I think your example gets me closer.

You can do that with Mixture.

You can specify one set of weights and Mixture will broadcast it to each observation or you can define one set of weights per observation. Pretty much like any other paramerers in PyMC distributions.

For example:

pm.Mixture(
  "y",
  w=[[0.9, 0.1], [0.1, 0.9]], 
  comp_dists=[pm.Normal.dist(-1), pm.Normal.dist(1)],
  observed=[-1, 1],
)

Here I wrote the weights manually for each observation, just for illustration purposes. You can use advanced indexing, or concatenate the 2 kinds of weights. Or maybe another form that feels more natural to define your weights programmatically.

For you case, I would probably do something like:

import pytensor as pt

weights = pt.empty((len(y), 2)
weights = set_subtensor(weights[:t], [w1, 1-w1])
weights = set_subtensor(weights[t:], [w2, 1-w2])

Great! Thank you very much.

Apologies, another quick question. Is there some nuance if using the tensor object when assigning weights using a distribution?
I.e., something like,

    w = [pm.Dirichlet(f'w{i}', a=np.array([2, 2])) for i in range(2)]
    weights = pt.empty((len(y), 2))
    weights = pt.set_subtensor(weights[:t], w[0])
    weights = pt.set_subtensor(weights[t:], w[1])

leads to a weights w0=w1 of zero or one. Or should the tensor not be used here?

What do you mean they are all zero or one? After inference? That could happen if there is almost no uncertainty about the right weights. The prior is certainly not zero/one:

import numpy as np
import pymc as pm
import pytensor.tensor as pt

with pm.Model() as m:
    w = [pm.Dirichlet(f'w{i}', a=np.array([2, 2])) for i in range(2)]
    weights = pt.empty((10, 2))
    weights = pt.set_subtensor(weights[:5], w[0])
    weights = pt.set_subtensor(weights[5:], w[1])
pm.draw(weights, random_seed=1)
array([[0.4449662 , 0.5550338 ],
       [0.4449662 , 0.5550338 ],
       [0.4449662 , 0.5550338 ],
       [0.4449662 , 0.5550338 ],
       [0.4449662 , 0.5550338 ],
       [0.93481359, 0.06518641],
       [0.93481359, 0.06518641],
       [0.93481359, 0.06518641],
       [0.93481359, 0.06518641],
       [0.93481359, 0.06518641]])