Understanding how logp of mixed collection of UV Normals vs MV Normals differ from each other

Hello,

I am continuing my series on better understanding how the mixed distribution works in pymc. Based on the discussion (thanks to @ricardoV94, @jessegrabowski) I already can understand the difference between how pymc samples when one uses a MvNormal vs a set of UvNormals of the same shape. I guess in summary, in MvNormal sampling is done by just mixing in the “components” vs in UvNormal mixing is done in both components and other dimensions. So for instance if you equally mix two normals with centers [0.5,0.5] and [2,2] in the Uv case you can get points near [0.5,2], [2,0.5] as well as [0.5,0.5] and [2,2] where as in Mv you just get near your cluster points (see code below).

The code for the models (as a bonus, it also includes the other suggested solution by @ricardoV94 to verify that it samples in the same way Mv):

import pymc as pm
import numpy as np
import pytensor.tensor as pt
import pymc_experimental as pme

mu = [0.5, 0.5, 2, 2]
sd = [0.1, 0.1, 0.1, 0.1]
w = [0.5, 0.5]
n_clusters = 2
lower = 0
upper = 3

coords = {
    "cluster": range(n_clusters),
    "ndim": range(2),
    "obs": range(1),
}


with pm.Model() as model_mv:

  components = [pm.MvNormal.dist(mu[:2], sd[:2]*np.eye(2), shape=(2)),
                pm.MvNormal.dist(mu[2:4], sd[2:4]*np.eye(2), shape=(2))]

  samples_mv = pm.Mixture("mix", w, components)


with pm.Model() as model_uv:

  components = [pm.Normal.dist(mu[:2], sd[:2]),
                pm.Normal.dist(mu[2:4], sd[2:4])]

  samples_uv = pm.Mixture("mix", w, components)


with pme.MarginalModel(coords=coords) as m3:

    idx = pm.Categorical("idx", p=np.ones(n_clusters) / n_clusters, dims=("obs",))

    mu_x = np.array([mu[0],mu[2]])
    mu_y = np.array([mu[1],mu[2]])

    mu = pm.math.concatenate([mu_x[..., None], mu_y[..., None]], axis=-1)
    sigma = 0.01

    def dist(idx, mu, sigma, _):
        # Define a mixture where the output x, y depends on the value of idx
        rv = pm.Normal.dist(mu[0][:, None], sigma, shape=[2,1])

        for i in range(n_clusters - 1):
            rv = pm.math.where(
                pm.math.le(idx, i),
                rv,
                pm.Normal.dist(mu[i + 1][:, None], sigma, shape=[2,1])
            )
        return rv


    samples_uv_fixed = pm.CustomDist(
        "mix",
        idx, mu, sigma,
        dist=dist,
        dims=("ndim", "obs"),
    )


draws_uv = pm.draw(samples_uv, 1000)
draws_mv = pm.draw(samples_mv, 1000)
draws_uv_fixed = pm.draw(samples_uv_fixed, 1000)[:,:,0]


value_uv = pt.vector("value_uv")
rv_logp = pm.logp(samples_uv, value_uv)
print("UV logP(0.5, 0.5):" + str(rv_logp.eval({value_uv: [0.5, 0.5]})))
print("UV logP(2, 2):" + str(rv_logp.eval({value_uv: [2, 2]})))
print("UV logP(0.5, 2):" + str(rv_logp.eval({value_uv: [0.5, 2]})))
print("UV logP(2, 0.5):" + str(rv_logp.eval({value_uv: [2, 0.5]})), end="\n\n")

value_mv = pt.vector("value_mv")
rv_logp = pm.logp(samples_mv, value_mv)
print("MV logP(0.5, 0.5):" + str(rv_logp.eval({value_mv: [0.5, 0.5]})))
print("MV logP(2, 2):" + str(rv_logp.eval({value_mv: [2, 2]})))
print("MV logP(0.5, 2):" + str(rv_logp.eval({value_mv: [0.5, 2]})))
print("MV logP(2, 0.5):" + str(rv_logp.eval({value_mv: [2, 0.5]})))

This prints

UV logP(0.5, 0.5):[0.69049938 0.69049938]
UV logP(2, 2):[0.69049938 0.69049938]
UV logP(0.5, 2):[0.69049938 0.69049938]
UV logP(2, 0.5):[0.69049938 0.69049938]

MV logP(0.5, 0.5):-0.2284391538060554
MV logP(2, 2):-0.2284391538060554
MV logP(0.5, 2):-10.7852919734153
MV logP(2, 0.5):-10.7852919734153

One sees that likelihood of [0.5,2] is much lower than [2,2] in MV but same in UV. But then to me this suggests that a mixture of MV Normals is acting somehow different than how mixture models are classically defined. For a classical mixture of MV Normals, the likelihood is ( MV() stands for pdf of multivariate normal)

logP(obs) = log(0.5*MV(obs,mu=[0.5, 0.5], sd=0.1*Id)) + log(0.5*MV(obs,mu=[2, 2], sd=0.1*Id))=
\sim (obs_x - 0.5)^2 + (obs_y - 0.5)^2 + (obs_x - 2)^2 + (obs_y - 2)^2

This would evaluate to the same value for [0.5, 2] and [0.5, 0.5]. So does that somehow imply that the logp of mixed MvNormals is different from the classical mixture model in pymc’s case? Does it mean that when pymc is evaluating the MV normal mixture model’s logp, it is really choosing a component randomly (based on weights) and just evaluating logp with respect to the chosen cluster? I guess then it would mean this is not a classical mixture model in some sense right?

Also this is perhaps a question about me not understanding dimensions correctly but why is the logp in the UV is coming out as two positive numbers (I would have expect a single negative number)?

Thanks

Yes the generative process is about drawing a weight and then selecting a whole draw from the component. In a MvNormal mixture the component is a whole vector of values. In a univariate normal mixture it’s a scalar.

I don’t see what’s unclassical about it?

If you passed two numbers/means you have a batch vector of two independent univariate mixtures, so you get two logps. For thinking about dimensions I suggest Distribution Dimensionality — PyMC 5.10.4 documentation

We don’t state it but the distinction between batch/core dims also clarifies when should multiple values map to a single logp (core dims) or multiple logps (batch dims)

Re: positive/negative don’t forget pdfs are density functions. They can be larger than one, but must always intregate to at most 1 over a region of the domain. Logps can therefore be positive.

1 Like

Thanks for the reply. I guess my confusion stems from the fact that I don’t quite understand how the log likelihood for the mixture looks for the two cases (especially the batched dimension for uv normals case). Lets assume f_{u}(x; \mu) is the pdf for the univariate normal and f_{m}(x,y; \mu_x, \mu_y) is pdf for multivariate (I dont show \sigma for brevity and assume covariance for multivariate is diagonal so it can be written as product of two normals as I do below). In short I will write z=x,y \mu=\mu_x,\mu_y and multivariate as f_{m}(z; \mu).

1- Mixture of two multivariate normals with centers \mu^1=(0,0) and \mu^2=(1,1)

l(z | \mu^1, \mu^2) \sim log(f_{m}(z | \mu^1) + f_{m}(z | \mu^2)) = log(f_{u}(x | 0)*f_u(y | 0) + f_{u}(x | 1)*f_u(y | 1))

2- Mixture of 4 normals created as two uv distributions with batch dimensions 2 (first one has \mu^1=(0,0) second one \mu^2=(1,1))

l(z | \mu^1, \mu^2) \sim log(f_u(x | 0) + f_u(y | 0) + f_u(x | 1) + f_u(y | 1))

So basically the difference is that in the case of univariate normals, when I equally mix 2 univariate normals with batch dimension 2, then I am actually mixing 4 univariate distributions with weights 1/4 right? That is weight of a batched distribution is equally distributed along its batched dimensions. Whereas in the multivariate case I am mixing two multivariate distributions with weighs 1/2 which breaks the symmetry between \mu^1,\mu^2.

Re: positive/negative don’t forget pdfs are density functions. They can be larger than one, but must always intregate to at most 1 over a region of the domain. Logps can therefore be positive.

Oy, that was a rookie mistake on my side but thanks for answering it any ways :slight_smile:

To make the univariate vs multivariate distinction more obvious, think of an extreme case with Mixture(w=[0.5, 0.5], comp_dists=[MvNormal([-1, -1], MvNormal([1, 1]]) where each component has a tiny diagonal covariance S matrix. This distribution can easily generate vectors [-1, -1] or [1, 1] but not [1, -1] or [1, -1], whereas a batch of two univariate normals could easily do it.

The pdf of the mvnormal mixture of a vector x = [x1, x2] looks like 0.5 * MvNormal_pdf(x; [-1, -1], S) + 0.5 * MvNormal_pdf(x; [1, 1], S).

The pdf of the univariate normal mixture of a batched vector x = [x1, x2] looks like [0.5 * Normal_pdf(x1; -1, s) + 0.5 * Normal_pdf(x1; 1, s), 0.5 * Normal_pdf(x2; -1, s) + 0.5 * Normal_pdf(x2; 1, s)]. For the joint probability you would multiply the two pdf entries together. When you do this you end up squaring the weights, but that’s not the only difference between the two cases.

1 Like

Okay got it, and when you multiply the entries of the expression above together:

(0.5 \times N(x_1; -1, s) + 0.5 \times N(x_1; 1, s))\times(0.5 \times N(x_2; -1, s) + 0.5 \times N(x_2; 1, s))
=0.25\times\sum_{i,j \in \{1,-1\}} N(x_1; i, s)\times N(x_2; j, s)

So you still get something that evaluates to the same value whether if you supply [-1,-1] or [-1, 1] or [1, -1] or [1, 1]. Thanks.

1 Like