I’m trying to model a mixture of multi-variate Bernoullis, so that my dataset X is an NxP binary matrix and I’m saying that there are K latent groups in the data with different probabilities for each variable, i.e. each X_{i} is drawn from a mixture of K Bernoullis so that there are K x P Bernoulli probability parameters \lambda.
The prior on K is a Dirichlet process using Austin Rochford’s tutorial on univariate Gaussian mixtures as a guide.
I can’t quite get it to work however. I try to manually create a multi-variate Bernoulli for each component as below but I get the error theano.tensor.var.AsTensorError: ('Cannot convert <bound method FreeRV.mean of K_0> to TensorType', <type 'instancemethod'>)
which is thrown on the line where I define obs
.
import numpy as np
import pandas as pd
import pymc3 as pm
from matplotlib import pyplot as plt
import seaborn as sns
import theano
import sys
from theano import tensor as tt
from theano.printing import Print
if __name__ == "__main__":
N = 100
P = 5
# Simulate 5 variables with 100 observations of each that fit into 3 groups
mu_actual = np.array([[0.7, 0.8, 0.2, 0.1, 0.1],
[0.3, 0.5, 0.9, 0.8, 0.6],
[0.1, 0.2, 0.5, 0.4, 0.9]])
cluster_ratios = [0.4, 0.3, 0.3]
df = np.concatenate([np.random.binomial(1, mu_actual[0, :], size=(int(N*cluster_ratios[0]), P)),
np.random.binomial(1, mu_actual[1, :], size=(int(N*cluster_ratios[1]), P)),
np.random.binomial(1, mu_actual[2, :], size=(int(N*cluster_ratios[2]), P))])
# Deterministic function for stick breaking
def stick_breaking(beta):
portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]])
return beta * portion_remaining
K_THRESH = 20
with pm.Model() as model:
# The DP priors to obtain w, the cluster weights
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1, alpha, shape=K_THRESH)
w = pm.Deterministic('w', stick_breaking(beta))
mu = pm.Beta('mu', 1, 1, shape=(K_THRESH, P))
# Manually create the bernoulli distributions for each component
bernoulli_dists = [pm.Bernoulli('K_'+str(i), mu[i, :], shape=P) for i in range(K_THRESH)]
obs = pm.Mixture('obs', w, bernoulli_dists, observed=df)
n_samples = 100
burn = 10
thin = 1
with model:
#step1 = pm.Metropolis(vars=[alpha, beta, w, mu])
#step2 = pm.ElemwiseCategorical([component], np.arange(K_THRESH))
#trace_ = pm.sample(n_samples, [step1, step2], random_seed=17)
trace_ = pm.sample(n_samples, random_seed=17)
trace = trace_[burn::thin]
I then tried to manually specify the log-likelihood which compiles but doesn’t quite seem to do the job. The model is now the same as above with the component labels manually sampled and a DensityDist
used for the custom likelihood function, and I use the 2 step Metropolis + ElemwiseCategorical sampler commented out above . However, as I posted on StackOverflow the discovered components don’t seem to model those that generated the data.
# Draw discrete component labels using weights
component = pm.Categorical('component', w, shape=N)
# Manually specify log-likelihood of a mixture of bernoullis
def bernoulli_mixture_loglh(mus, components):
def logp(value):
# K = maximum number clusters
# N = number observations (839 here) # P = number predictors (114 here) # Shape of tensors:
# components: K
# mus: K, P
# value (data): N, P
mus_comp = mus[components, :]
pos = value * tt.log(mus_comp)
neg = (1-value) * tt.log((1-mus_comp))
comb = pos + neg
overall_sum = tt.sum(comb)
return overall_sum
return logp
obs = pm.DensityDist('obs', bernoulli_mixture_loglh(mu, component), observed=df)
I also have 2 other questions:
- I get a warning that the likelihood function can’t be pickled so the sampling runs single-threaded. I tried the solution presented in this post but it passes the
component
variables as floats rather than ints and I get an error when trying to use them to indexmus
. - How can I generate the probability that a new data instance belongs to a cluster K?