Dear all,
I am building a mixture model as explained here http://www.datalab.uci.edu/papers/webcanvas.pdf to cluster sequences of states.
The probability of the data d_{train} is expressed as:
p(d_{train}| \theta) = \prod_{i=1}^{N} p(x_i |\theta) = \prod_{N}^{i=1} \sum_{k=1} \pi_k \theta^I_{k,x_1} \prod_{t=2}^{L_i} \theta^T_{k,x_{t-1}^i,x_{t}^i}
Where N is the number of sequences,
p_k represent cluster assignments drawn from a multinomial (a vector of K mixtures)
\theta_I is a set of K initial state probability vectors a (k vectors of length M)
\theta_T is a set of K transition matrices between states (of MxM).
L_i is the length of the sequence i
So far, I have created a model as follows:
with pm.Model() as model:
# the posterior distribution of a multinomial with a dirichlet prior is another dirichlet prior
pi = pm.Dirichlet('pi', a=pm.floatX((1.0 / num_clusters) * np.ones(num_clusters)),
shape=(num_clusters),transform=t_stick_breaking(1e-9))
theta_i = pm.Dirichlet('theta_i', a=pm.floatX((1.0 / num_values) * np.ones((num_clusters,num_values))),
shape=(num_clusters, num_values), transform=t_stick_breaking(1e-9))
## can we have a distribution with 3 dimensions?
theta_t = [pm.Dirichlet('theta_t_%d' % idx, a=pm.floatX((1.0 / num_values) * np.ones((num_values, num_values))), shape=(num_values, num_values), transform=t_stick_breaking(1e-9)) for idx in range(num_clusters)]
# need to define log_data
obs = pm.DensityDist('obs', log_data(pi, theta_i, theta_t), observed=data)
I am would like to create a log_data function from the parameters, pi, \theta^i, and \theta^t, but I am a bit confused on how to do so. In particular, I don’t know how to treat each observed array in my sequence inside the function log_data, and how to use theano operators to create the likelihood equation above:
The log likelihood would look like :
def log_data(pi,theta_init,theta_trans):
def log_data_(x):
# code to be completed here....
return log_data_
An example of a data could be:
data = [[2, 2, 0, 1, 2], [1, 1, 0, 1, 0], [2, 2, 0, 1, 2], [1, 1, 2, 0, 1], [1, 1, 2, 2, 2], [2, 2, 0, 0, 1], [1, 1, 2, 1, 2], [2, 2, 0, 1, 2], [2, 2, 2, 2, 0], [2, 2, 0, 1, 0]]
with
num_clusters = 3
num_values = 3
Also, do you know if theta_t could have shape =(num_clusters, num_values, num_values), instead of creating an array of matrices?