Discrete RV and marginalisation

Hi everybody!

I made two versions of some model, one in which I explicitly have a categorical variable and another in which it is marginalised. It gave some tension between the final result of my two models (below one sigma, so nothing I was really worried about). First I thought it was just a Monte-Carlo error because sampling discrete variables is difficult, but it persisted. Eventually I reproduced the issue with a minimal example, here it is:

toydata = np.ones(10)

nonmargtoy_model = pm.Model()
with nonmargtoy_model:
    w=pm.Dirichlet('w', a=np.ones(5),shape=5)
    t=pm.Categorical('t',p=w,testval=1)
    Y_obs = pm.Normal("Y_obs",mu=t,sigma=1,observed=toydata)
    nonmargtoy_trace = pm.sample()
margtoy_model = pm.Model()
with margtoy_model:
    w=pm.Dirichlet('w', a=np.ones(5),shape=5)
    #t=pm.Categorical('t',p=w,testval=1)
    Ndists = [ pm.Normal.dist(mu=t,sigma=1) for t in range(5) ]
    Y_obs = pm.Mixture("Y_obs",w=w,comp_dists=Ndists,observed=toydata)
    margtoy_trace = pm.sample()

Do you agree that the two models are the same modulo a marginalisation and should give the same w posterior distribution?
Here are the results for the non-marginalised:
nonmargtoy
and the marginalised:
margtoy
Both have some preference for 1 (although in my more complicated model the preferred value does change a bit) but it’s not the same strength at all. The non-marginalised version seems to be wrong: w gives a weaker preference to 1, staying quite close to its prior despite my data being 1 everywhere, and it doesn’t select 0 and 2 as more likely than 3 or 4.

What am I missing?
I tried several samplers. In particular Metropolis, in case there was an issue with the automatic differentiation to get the force of the HMC (since w only impact the likelihood through t, I am not sure how these derivatives could be computed). It didn’t change anything.

This is on PyMC3, v3.11.4, and I’m not willing to update it right now because apart from this detail the project is quite advanced and I was about to submit a paper in the next few days.

Yes the models should derive equivalent conclusions for w, and the non-marginalized seems biased.

I would suggest you try:

  1. Many more samples with the non-marginalized sampler

  2. pymc4 on google colab quickly (it’s installed by default) as there was a weird bug sometimes when the data was all ones or zeros in Theano, which was fixed after pymc3 I think?

Thanks for your answer.

I did try (1) quite extensively, with something like 20,000 draws, up to the point where I sample even extremely unlikely values of t. There wasn’t the slightest change.

Regarding (2), that’s a good suggestion. I wasn’t familiar with Google Colab. But the result is the same. Here are the plots I get with pymc4 (v4.1.4)
nonmargtoy_pymc4
and for the other model
margtoy_pymc4

So you think I should try to install pymc5 now? :frowning:

By the way, I should add also what happens when I use 100 data points (still always 1) instead of 10: the non-marginalised result doesn’t really move and the difference is even more pronounced.
Non marginalised:
nonmargtoy_pymc4-100
Marginalised:
margtoy_pymc4-100

Hmmm…
Sometimes writing down things in a message is just what you need to clarify your mind. I think I just found the issue, for this thing which has been bothering me for weeks.

If I change the code into

toydata = np.ones(10)
nonmargtoy_model = pm.Model()
with nonmargtoy_model:
    w=pm.Dirichlet('w', a=np.ones(5),shape=5)
    t=pm.Categorical('t',p=w,testval=1,shape=10)
    Y_obs = pm.Normal("Y_obs",mu=t,sigma=1,observed=toydata)
    nonmargtoy_trace = pm.sample()
    display(az.summary(nonmargtoy_trace))

Then I get
nonmargtoy_pymc4-shape
and now the agreement looks perfect.

Is it me who didn’t understand how shapes should be specified or was it a bug?
It looks like it interpreted toydata as one data point of dimension 10, rather than 10 data points of dimension 1.

1 Like

Oh yeah that makes total sense. The Mixture of Normals treats each observation independently and as such marginalizes over separate latent categorical variables.

The first non-marginalized model would be more akin to a mixture of 5 multivariate normal components comp_dists=[MvNormal.dist(mu=np.full(10, i), np.eye(10)) for i in range(5)]

I don’t recommend testing that out in v3 as multivariate mixtures were a bit flaky back then.

Indeed.

toydata = np.ones(100)
marg2toy_model = pm.Model()
with marg2toy_model:
    w=pm.Dirichlet('w', a=np.ones(5),shape=5)
    Ndists = [ pm.MvNormal.dist(mu=t,cov=np.diag(np.ones_like(toydata)),shape=toydata.shape[0]) for t in range(5) ]
    Y_obs = pm.Mixture("Y_obs",w=w,comp_dists=Ndists,observed=toydata)
    marg2toy_trace = pm.sample()
    az.plot_trace(marg2toy_trace,legend=True)

marg2toy_pymc4-100

Now it’s clear. There were indeed two different models involved, with either one t in common for all observations or one t_i for each observation. So in one case the probability is \sum_t \left[\prod_i P(y_i|t)\right]P(t) and in the other case \prod_i \sum_{t_i} P(y_i|t_i) P(t_i), which is obviously not the same regardless of how the summation is implemented. If there is one t_i per observation then not much can be inferred on each of those t_i, except through the pooling, so w will have to contain most of the information. If t is common then w is less important because \prod_i P(y_i|t) already makes a strong choice.

1 Like