Mixture of Linear models

I am trying to build the following model:

47%20pm

With some of the sampled data looking like the following:

N = 100000
x = 2*np.random.rand(N,1) - 1
idx = np.random.rand(N) < 0.5
y = 3*x
y[idx] = -x[idx]
y = y + 0.1*np.random.randn(N,1)

The pymc3 model looks like following and I sample it using minibatches for Variational Bayes (so that avoid/ minimise label switching).

def logp_normal(mu, sigma, value):
    # log probability of individual samples
    delta = (value - mu).ravel()
    k = delta.shape[0]
    return -0.5 * (k * tt.log(2 * np.pi) + tt.log(sigma) + tt.square(delta).sum()/tt.square(sigma))

# Log likelihood of Gaussian mixture distribution
def logp_gmix(mus, pi, sigma):
    def logp_(value):
        logps = [tt.log(pi[i]) + logp_normal(mu, sigma, value) for i, mu in enumerate(mus)]

        return tt.sum(pm.math.logsumexp(tt.stacklists(logps), axis=0))

    return logp_

K = 2 #components
D = 1 #dimensionality of x
X = theano.shared(x)
Y = theano.shared(y.ravel())

with pm.Model() as model:
    pi = pm.Dirichlet('pi', np.ones(K))
    ws = [pm.Normal('w_%d'%i, mu=0, sd=2, shape=(D,)) for i in range(K)]
    sigma  = pm.Uniform('sigma', 0, 1)
    mus = [tt.squeeze(tt.dot(X,w)) for w in ws]
    y_obs = pm.DensityDist('y_obs', logp_gmix(mus, pi, sigma), observed=Y)

With the inference part looking like:

batch_size = 128
X_m = pm.Minibatch(x, batch_size)
Y_m = pm.Minibatch(y.ravel(), batch_size)
with model:
    approx = pm.fit(10000,
                    more_replacements={X:X_m, Y:Y_m},
                    callbacks=[pm.callbacks.CheckParametersConvergence(tolerance=1e-4)],
                    method='fullrank_advi')

# Sample from the posterior
n = 1000
trace = approx.sample(n)

However, when I look at the posteriors unfortunately of the trace, it doesn’t seem to be able to converge on the weights (3 and -1 in our case).


The posterior of sigma seems to be values closer to 1 than 0.1 as used in the example.

So my questions are:

  1. Have I specified the model wrong?
  2. Anything else that is wrong here?

Actually VI does not avoid label switching (it does not have that concept to begin with), if you dont restrict the likelihood/posterior geometry, VI would try to approximate the same multimodal space (for a simple demonstration, see https://github.com/junpenglao/All-that-likelihood-with-PyMC3/blob/master/Notebooks/Normal_mixture_logp.ipynb). Our current VI (ie meanfield and fullrank ADVI) does not do what you intend to (ie., only approximate a single mode) - even if sometimes it display such behaviour it is out of chance. Some example of difficult of the current VI having difficulty you can see in this notebook (at the very end): https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/WIP/[WIP]%20Bayesian%20GMM.ipynb

My suggestion is to use sampling with a single chain and study the fit first, and then take it from there. A similar mixture regression example you can have a look is: Gaussian Mixture of regression (notebook here: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse.ipynb and https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse_order.ipynb, the second one also use a VI and you can see it is not doing a good job).

Last note - VI inference on mixture model classically use a conjugate gradient method, which my intuition is that it forces the VI approximation to converge toward a single mode. Mansfield and fullrank that update all parameters at the same time during inference do not archive that.

Thanks for that!

Just a quick note while go through the stuff you sent. The reason I talked about being able to avoid label switching was the fact that VI was mode seeking. Is this not the case because of the way vi is implemented or am I just wrong about this?

Mode seeking behavior is a usual phenomenon but not a guarantee.