Mixture of multivariate Bernoullis


I’m trying to model a mixture of multi-variate Bernoullis, so that my dataset X is an NxP binary matrix and I’m saying that there are K latent groups in the data with different probabilities for each variable, i.e. each X_{i} is drawn from a mixture of K Bernoullis so that there are K x P Bernoulli probability parameters \lambda.

The prior on K is a Dirichlet process using Austin Rochford’s tutorial on univariate Gaussian mixtures as a guide.

I can’t quite get it to work however. I try to manually create a multi-variate Bernoulli for each component as below but I get the error theano.tensor.var.AsTensorError: ('Cannot convert <bound method FreeRV.mean of K_0> to TensorType', <type 'instancemethod'>) which is thrown on the line where I define obs.

import numpy as np
import pandas as pd
import pymc3 as pm
from matplotlib import pyplot as plt
import seaborn as sns
import theano
import sys
from theano import tensor as tt
from theano.printing import Print

if __name__ == "__main__":

    N = 100
    P = 5

    # Simulate 5 variables with 100 observations of each that fit into 3 groups
    mu_actual = np.array([[0.7, 0.8, 0.2, 0.1, 0.1],
                          [0.3, 0.5, 0.9, 0.8, 0.6],
                          [0.1, 0.2, 0.5, 0.4, 0.9]])
    cluster_ratios = [0.4, 0.3, 0.3]  

    df = np.concatenate([np.random.binomial(1, mu_actual[0, :], size=(int(N*cluster_ratios[0]), P)),
                         np.random.binomial(1, mu_actual[1, :], size=(int(N*cluster_ratios[1]), P)),
                         np.random.binomial(1, mu_actual[2, :], size=(int(N*cluster_ratios[2]), P))])

    # Deterministic function for stick breaking
    def stick_breaking(beta):
        portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]])
        return beta * portion_remaining

    K_THRESH = 20

    with pm.Model() as model:
        # The DP priors to obtain w, the cluster weights
        alpha = pm.Gamma('alpha', 1., 1.)
        beta = pm.Beta('beta', 1, alpha, shape=K_THRESH)
        w = pm.Deterministic('w', stick_breaking(beta))

        mu = pm.Beta('mu', 1, 1, shape=(K_THRESH, P))
        # Manually create the bernoulli distributions for each component
        bernoulli_dists = [pm.Bernoulli('K_'+str(i), mu[i, :], shape=P) for i in range(K_THRESH)]
        obs = pm.Mixture('obs', w, bernoulli_dists, observed=df)

    n_samples = 100
    burn = 10
    thin = 1

    with model:
        #step1 = pm.Metropolis(vars=[alpha, beta, w, mu])
        #step2 = pm.ElemwiseCategorical([component], np.arange(K_THRESH))
        #trace_ = pm.sample(n_samples, [step1, step2], random_seed=17)
        trace_ = pm.sample(n_samples, random_seed=17)

    trace = trace_[burn::thin]

I then tried to manually specify the log-likelihood which compiles but doesn’t quite seem to do the job. The model is now the same as above with the component labels manually sampled and a DensityDist used for the custom likelihood function, and I use the 2 step Metropolis + ElemwiseCategorical sampler commented out above . However, as I posted on StackOverflow the discovered components don’t seem to model those that generated the data.

    # Draw discrete component labels using weights
    component = pm.Categorical('component', w, shape=N)

     # Manually specify log-likelihood of a mixture of bernoullis 
    def bernoulli_mixture_loglh(mus, components):
        def logp(value):
            # K = maximum number clusters
            # N = number observations (839 here) # P = number predictors (114 here) # Shape of tensors:
            #   components: K
            #   mus: K, P
            #   value (data): N, P
            mus_comp = mus[components, :]
            pos = value * tt.log(mus_comp)
            neg = (1-value) * tt.log((1-mus_comp))
            comb = pos + neg
            overall_sum = tt.sum(comb)
            return overall_sum
        return logp

    obs = pm.DensityDist('obs', bernoulli_mixture_loglh(mu, component), observed=df)

I also have 2 other questions:

  • I get a warning that the likelihood function can’t be pickled so the sampling runs single-threaded. I tried the solution presented in this post but it passes the component variables as floats rather than ints and I get an error when trying to use them to index mus.
  • How can I generate the probability that a new data instance belongs to a cluster K?


The problem is that you are providing random variables instead of a probability distribution as the mixture comp_dist. You can either get the list of distributions like this:

bernoulli_dists = [pm.Bernoulli('K_'+str(i), mu[i, :], shape=P).distribution for i in range(K_THRESH)]

Or you can change the obs line like this:

obs = pm.Mixture('obs', w, [b.distribution for b in bernoulli_dists], observed=df)

About the likelihood function not being pickleable, did you define it as a top level function? There are many python restrictions on what can and cannot be pickled, and you have to make sure that you write the function in a way that let’s it be pickled.

About the cluster membership, @junpenglao answered this elsewhere but I can’t seem to find the particular thread to link with. You can basically compute the obs.comp_dist.logp values of each row of observed, for each mixture component separately. This gives you an intuition about the component to which the observation likely belongs to.


I believe this is the link you are referring to Get probability of parameter given new data


Thanks for the link @junpenglao, that is exactly what I need.
Also the pickling issue was occurring on a Linux VM and native Windows install - when I ran it on a native Linux install it was fine.

In terms of the mixture modelling, @lucianopaz, running the mixture of the .distributions still seems to have an issue with the broadcasting of the observed values, i.e. it throws the error ValueError: Input dimension mis-match. (input[0].shape[2] = 30, input[1].shape[2] = 5)

Where 30 is the number of components K and 5 is the number of predictor variables P, i.e. df is NxP but it seems to expect NxK.

I couldn’t get this to work so I wrote the Bernoulli mixture likelihood out and use a DensityDist, and this seems to work on this small simulated dataset, but struggles when I apply it to a larger dataset (900 x 100), only finding 2 components despite standard clustering techniques finding at least 4 distinct groups.

I’m going to play around with NUTS’ parameters to see if I can improve it, but is it possible using the Mixture method would do better if I could get it working?

    def bernoulli_mixture_loglh(mus, weights, K, N):
        def logp_(value):
            logps = []
            value_neg = 1 - value
            logmus = tt.log(mus)
            neglogmus = tt.log(1-mus)

            # N*K matrix of likelihood contributions from X_i with mu_k
            betas = tt.tensordot(value, logmus, axes=[1, 1]) + tt.tensordot(value_neg, neglogmus, axes=[1, 1])

            ## Form alphas, NxK matrix with the component weights included
            alphas = (betas + tt.log(weights)).T

            # Take LSE rowise to get N vector
            # and add alpha_cluster1 to get the total likelihood across all
            # components for each X_i
            lse_clust = pm.math.logsumexp(alphas - alphas[0, :], axis=0) + alphas[0,:]

            # Final overall sum
            return tt.sum(lse_clust)
        return logp_

    obs = pm.DensityDist('obs', bernoulli_mixture_loglh(mu, w, K, N),


I’m now quite keen to get this working on Windows (where I have a machine with more cores) so have returned back to the pickling issue (no such problem on Linux). I see I have 2 options:

  1. Get my current custom logp method picklable
  2. Get the Mixture method working (as I assume this won’t have any issues pickling)

For issue 1) I’ve tried:

  • Using a single function rather than a closure for logp, i.e. def bernoulli_mixture_loglh(mus, weights, value) then passing observed=dict(mus=mu, weights=w, value=f)
  • Using a callable class instead as this should be picklable
  • Trying the above at both the highest level (before if __name__ == "__main__") and one indent in

However, I get the same exception everytime from multiprocessing:

ValueError: must use protocol 4 or greater to copy this object; 
since __getnewargs_ex__ returned keyword arguments.

In terms of 2) and getting the mixture to work, I’ve seen that I can’t pass in a list of Bernoulli.distribution since Mixture expects a list of univariate distributions from the docs.
I’m trying with just a single Bernoulli('foo', mu, shape=(K_THRESH,P)).distribution passed into Mixture but then I get the error about incompatible shapes as in my last post.

I’m using pymc 3.6 on Python 3.5.6 on Windows 10.


I can’t test this right now but I’ll try to give you some pointers to help implement the mixture.

When you give Mixture a multivariate distribution as comp_dists, the last axis is assumed to be the mixture components.

This means that you should make a (P, K) shaped binomial distribution and pass that as the comp_dists.

The problematic part is defining the Mixture's shape. If you did not give it an observed input, the distribution’s shape should be set to (P,). Then if you drew N samples from the prior, you’d end up with an (N, P) array. I would start out trying if shape=P works.

However, when you pass observed, it is used to set the RV’s shape (not the distribution’s shape). Providing the observed array that has shape (N, P) should be fine because that should broadcast well with the shape implied by the mixture’s comp_dists: (P,). The thing is that I’m not sure if theano will complain because the observed and the distribution would have different ndims. My recommendation is to try out a shape=(P,) mixture with a (P, K) shaped binomial as comp_dists.


Ok I tried that and it still complains about the shape of the inputs, it seems to be expecting a PxK matrix in obs.

I.e. I’m using:

obs = pm.Mixture('obs', w,  
                 pm.Bernoulli.dist(mu, shape=(P, K)),

And get error message ValueError: Input dimension mis-match. (input[0].shape[0] = N, input[1].shape[0] = P).
If I pass in df.T instead to observed so that the first axis is P for both then ValueError: Input dimension mis-match. (input[0].shape[1] = N, input[1].shape[1] = K), hence it expecting a PxK observed array.

I managed to get my pickling issue working under Windows by upgrading to Python 3.6 and using the callable Class method at the top-level (before if __name__ == "__main__"). It runs extremely slowly with NUTS so I’m having to use Metropolis but that’s the next issue…

class Foo():
    def __init__(self, mus, weights):
        self.mus = mus
        self.weights = weights
    def __call__(self, value):
        # Likelihood function goes here using self.mus and self.weights


Great that you managed to get the picking working!

A shame that the mixture still complains about shape. I see now why, mixture’s logp delegates the logp computation to the distribution comp_dists. That is where the shape raises errors. One last hackish attempt you could try is to reshape your observed to (N, P, 1). It’s ugly and shouldn’t be necessary. I’m currently working on a PR that involves mixture’s random method. I will try to improve the multidimensional distribution support there.


Adding an extra dimension has got it working! Thanks so much for your help!


Glad to hear it! Nevertheless, I think that adding this extra dimension is hackish and kind of a defect of the current implementation. I’ll see if something can be done about it.