How to use theano to create a custom loglikelihood for an array of arrays

Dear all,

I am building a mixture model as explained here http://www.datalab.uci.edu/papers/webcanvas.pdf to cluster sequences of states.

The probability of the data d_{train} is expressed as:

p(d_{train}| \theta) = \prod_{i=1}^{N} p(x_i |\theta) = \prod_{N}^{i=1} \sum_{k=1} \pi_k \theta^I_{k,x_1} \prod_{t=2}^{L_i} \theta^T_{k,x_{t-1}^i,x_{t}^i}

Where N is the number of sequences,
p_k represent cluster assignments drawn from a multinomial (a vector of K mixtures)
\theta_I is a set of K initial state probability vectors a (k vectors of length M)
\theta_T is a set of K transition matrices between states (of MxM).
L_i is the length of the sequence i

So far, I have created a model as follows:

with pm.Model() as model:   
  # the posterior distribution of a multinomial with a dirichlet prior is another dirichlet prior
  pi = pm.Dirichlet('pi', a=pm.floatX((1.0 / num_clusters) * np.ones(num_clusters)),
                  shape=(num_clusters),transform=t_stick_breaking(1e-9))
  theta_i = pm.Dirichlet('theta_i', a=pm.floatX((1.0 / num_values) * np.ones((num_clusters,num_values))),
                 shape=(num_clusters, num_values), transform=t_stick_breaking(1e-9))
  ## can we have a distribution with 3 dimensions?
  theta_t =  [pm.Dirichlet('theta_t_%d' % idx, a=pm.floatX((1.0 / num_values) * np.ones((num_values, num_values))), shape=(num_values, num_values), transform=t_stick_breaking(1e-9)) for idx in range(num_clusters)]

  # need to define log_data
  obs = pm.DensityDist('obs', log_data(pi, theta_i, theta_t), observed=data)

I am would like to create a log_data function from the parameters, pi, \theta^i, and \theta^t, but I am a bit confused on how to do so. In particular, I don’t know how to treat each observed array in my sequence inside the function log_data, and how to use theano operators to create the likelihood equation above:

The log likelihood would look like :

def log_data(pi,theta_init,theta_trans):
    def log_data_(x): 
     # code to be completed here....

    return log_data_

An example of a data could be:

 data = [[2, 2, 0, 1, 2], [1, 1, 0, 1, 0], [2, 2, 0, 1, 2], [1, 1, 2, 0, 1], [1, 1, 2, 2, 2], [2, 2, 0, 0, 1], [1, 1, 2, 1, 2], [2, 2, 0, 1, 2], [2, 2, 2, 2, 0], [2, 2, 0, 1, 0]]

with

num_clusters = 3
num_values = 3

Also, do you know if theta_t could have shape =(num_clusters, num_values, num_values), instead of creating an array of matrices?

I am a bit confuse about the l in the formula (\theta_l^I for example). Maybe you can try writing it in python/numpy first and we can see how to make it into a theano function.

As for the theta part, If you are careful about the shape of the RV and the shape of the parameter it should work, other wise you can always do tt.stack(theta_t)

Hi @junpenglao, thanks for the response.

Indeed there was an error in the equation. The term should be \theta_k and not \theta_l. I have modified the equation above.

I decided to go for the option of using a three-dimensional theta_trans, and hoping that I will able to manage it accordingly.

theta_t =  pm.Dirichlet('theta_t', a=pm.floatX((1.0 / num_values) * np.ones((num_clusters,num_values, num_values))),
                 shape=(num_clusters,num_values, num_values), transform=t_stick_breaking(1e-9))

Below is my version of the log likelihood of the data using regular numpy/python code. I am computing the log likelihood of the full observed data using the log sum exp trick:

log(p(d_{train}|\theta)) = \sum_{n=1}^N log(p(x_n|\theta)) = \sum_{n=1}^N log(\sum_{k=i}^K exp[ log(\pi_k) + log( \theta^I_{k,x_1}) + \sum_{t=2}^{L} log(\theta^T_{k,x_{t-1}^i,x_{t}^i}) ])

from scipy.misc import logsumexp 
import numpy as np

l_data = []
for didx in range(len(data)):
   l_obs = []
   for pidx in range(len(pi)):
       l_obs.append(np.log(pi[pidx]) + np.log(theta_init[pidx][data[didx][0]]) +\
                np.sum(np.log(theta_trans[pidx,data[didx][:-1]][data[didx][1:]])))
   l_data.append(logsumexp(l_obs))
 np.sum(l_data)

How would you write this in theano code? Please, let me know if you have any questions about the computation.

I was able to write a numpy implementation of the above code as follows:

 l_obs  = logsumexp(np.log(pi)*np.ones((10,1)) + np.column_stack(np.log(theta_init[:,data[:,0]])) + \
      np.sum(theta_trans[:,data[:,:-1],data[:,1:]],axis=2).T,axis=1)
 np.sum(l_obs)

And a theano implementation as:

 import theano
 import theano.tensor as tt
 import theano
 theano.config.compute_test_value = "ignore"
 theano.config.on_unused_input='ignore'

 
 tdata = tt.imatrix('tdata')
 ttheta_i = tt.fmatrix('ttheta_i')
 ttheta_t = tt.tensor3('ttheta_t')
 tpi = tt.fvector('tpi')

 out = tt.sum(pmmath.logsumexp(tt.log(tpi)*tt.ones((10,1))+\
                          tt.log(ttheta_i[:,tdata[:,0]]).T+\
                          tt.sum(tt.log(ttheta_t[:,tdata[:,:-1],tdata[:,1:]]),axis=2).T
                          ,axis=1))
 f = theano.function([tpi,ttheta_i,ttheta_t,tdata], out)
 f(pi.astype("float32"),theta_i.astype("float32"),theta_t.astype("float32"),data.astype("int32"))

and I verified that given value of theta_t, theta_i, pi, and data. So far so good. Note that I converted data to a matrix:

 data = np.matrix(data)

So far, my log_data looks like:

def log_data(pi,theta_init,theta_trans):
    def log_data_(docs):
       t1 = tt.log(pi)*tt.ones((10,1))
       t2 = tt.log(theta_init[:,docs[:,0]]).T
       t3 = tt.sum(theta_trans[:,docs[:,:-1],docs[:,1:]],axis=2).T
       return tt.sum(pmmath.logsumexp(t1+t2+t3,axis=1))

    return log_data_

However, still have some issues:

  1. The theano computation in my previous message returns an array with one single value instead of one value. How to deal with it?

  2. I would like to get the 10 in (10,1) from the dimensions of the data. How to do so?

  3. Last, and most important, once I compile the model with the given log likelihood log_data, I get the following error:


TypeError Traceback (most recent call last)
in ()
15 shape=(num_clusters,num_values, num_values), transform=t_stick_breaking(1e-9))
16 # how do we aggregate the likelihood of all the data
—> 17 obs = pm.DensityDist(‘obs’, log_data(pi, theta_i, theta_t), observed=data)
18
19 #start = pm.find_MAP(fmin=optimize.fmin_powell)

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/pymc3/distributions/distribution.pyc in new(cls, name, *args, **kwargs)
35 total_size = kwargs.pop(‘total_size’, None)
36 dist = cls.dist(*args, **kwargs)
—> 37 return model.Var(name, dist, data, total_size)
38 else:
39 raise TypeError(“Name needs to be a string but got: {}”.format(name))

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/pymc3/model.pyc in Var(self, name, dist, data, total_size)
830 var = ObservedRV(name=name, data=data,
831 distribution=dist,
–> 832 total_size=total_size, model=self)
833 self.observed_RVs.append(var)
834 if var.missing_values:

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/pymc3/model.pyc in init(self, type, owner, index, name, data, distribution, total_size, model)
1286
1287 self.missing_values = data.missing_values
-> 1288 self.logp_elemwiset = distribution.logp(data)
1289 # The logp might need scaling in minibatches.
1290 # This is done in Factor.

in log_data_(docs)
2 def log_data_(docs):
3 t1 = tt.log(pi)*tt.ones((10,1))
----> 4 t2 = tt.log(theta_init[:,docs[:,0]]).T
5 t3 = tt.sum(theta_trans[:,docs[:,:-1],docs[:,1:]],axis=2).T
6 return tt.sum(pmmath.logsumexp(t1+t2+t3,axis=1))

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/theano/tensor/var.pyc in getitem(self, args)
568 TensorVariable, TensorConstant,
569 theano.tensor.sharedvar.TensorSharedVariable))):
–> 570 return self.take(args[axis], axis)
571 else:
572 return theano.tensor.subtensor.advanced_subtensor(self, *args)

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/theano/tensor/var.pyc in take(self, indices, axis, mode)
612
613 def take(self, indices, axis=None, mode=‘raise’):
–> 614 return theano.tensor.subtensor.take(self, indices, axis, mode)
615
616 # COPYING

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/theano/tensor/subtensor.pyc in take(a, indices, axis, mode)
2438 shuffle[axis] = 0
2439 return advanced_subtensor1(
-> 2440 a.dimshuffle(shuffle), indices).dimshuffle(shuffle)
2441 if axis is None:
2442 shape = indices.shape

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/theano/gof/op.pyc in call(self, *inputs, **kwargs)
613 “”"
614 return_list = kwargs.pop(‘return_list’, False)
–> 615 node = self.make_node(*inputs, **kwargs)
616
617 if config.compute_test_value != ‘off’:

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/theano/tensor/subtensor.pyc in make_node(self, x, ilist)
1701 ilist_ = theano.tensor.as_tensor_variable(ilist)
1702 if ilist_.type.dtype not in theano.tensor.integer_dtypes:
-> 1703 raise TypeError(‘index must be integers’)
1704 if ilist_.type.ndim != 1:
1705 raise TypeError(‘index must be vector’)

TypeError: index must be integers

Any idea of what may be happening?

(Moved comment to post above)

Would you mind putting your data and code in a Gist notebook? I will try to have a look tomorrow.

@junpenglao I put my code here: https://github.com/jibietr/pymc_examples/blob/master/pymc_first_order_markov_mixture_model.ipynb

when I run the model, it looks like data is being converted to a type TensorType(float64, matrix) when it should be an int.

So I finally decided to create a new model MarkovMixtureModel based on pm.Categorize. This seems makes the observed data int instead of float.

The notebooks is here: https://github.com/jibietr/pymc_examples/blob/master/pymc_first_order_markov_mixture_model.ipynb

I am very happy I managed to get until here!
However, as you can see in the gist notebook, I need to find a way to make the model converge.
@junpenglao, do you have any recommendations on what mcmc algorithm and what initialisation would be more appropriate?

Oh cool, otherwise you can use a potential which sometimes it’s also quite convenient when you are working with custom logp function (see last cell in https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/discourse_1524.ipynb)

As for model convergence, I would first try more tuning:

trace = pm.sample(1000, tune=5000, cores=4) # multichain is very important

Then the next step is to use more informative prior. Looking at the trace the samples are very concentrated for some parameters (either 0 or 1 in the trace).
Also I would try a softmax model, sometimes it works better than Dirichlet.

Hi @junpenglao, I was looking at your example, and I saw that you changed

data = np.matrix(data)

for

data = np.asarray(data)

Is there any specific reason for that?

No particular reason, but I dont use np.matrix too often so I change it to numpy array so that I am more familiar with the behavior.

Ok. I thought here may had been a performance reason.

I am having some challenges getting the inference to work. I am using your version with the potential.
After running

 with model:
   start = pm.find_MAP()
       trace = pm.sample(1000, start=start, njobs=1)

I get the error:


ValueError Traceback (most recent call last)
in ()
2 start = pm.find_MAP()
3 #trace = pm.sample(1000, tune=5000, cores=4)
----> 4 trace = pm.sample(1000, start=start, njobs=1)
5
6 #start

/Users/bieljoanisaac/Library/Python/2.7/lib/python/site-packages/pymc3/sampling.pyc in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, compute_convergence_checks, use_mmap, **kwargs)
460 _log.info(‘Sequential sampling ({} chains in 1 job)’.format(chains))
461 _print_step_hierarchy(step)
–> 462 trace = _sample_many(**sample_args)
463
464 discard = tune if discard_tuned_samples else 0

logp = 13,046, ||grad|| = 8.8755: 100% 16/16 [00:00<00:00, 56.69it/s]
Auto-assigning NUTS sampler…
INFO:pymc3:Auto-assigning NUTS sampler…
Initializing NUTS using jitter+adapt_diag…
INFO:pymc3:Initializing NUTS using jitter+adapt_diag…
Sequential sampling (2 chains in 1 job)
INFO:pymc3:Sequential sampling (2 chains in 1 job)
NUTS: [theta_t_stickbreaking__, theta_i_stickbreaking__, pi_stickbreaking__]
INFO:pymc3:NUTS: [theta_t_stickbreaking__, theta_i_stickbreaking__, pi_stickbreaking__]
0%| | 0/1500 [00:00<?, ?it/s]

ValueError Traceback (most recent call last)
in ()
2 start = pm.find_MAP()
3 #trace = pm.sample(1000, tune=5000, cores=4)
----> 4 trace = pm.sample(1000, start=start, njobs=1)
5
6 #start

115             self.potential.raise_ok()
116             raise ValueError('Bad initial energy: %s. The model '

–> 117 ‘might be misspecified.’ % start.energy)
118
119 adapt_step = self.tune and self.adapt_step_size

ValueError: Bad initial energy: inf. The model might be misspecified.

This only happens if initialize with MAP. Any idea of what is happening?

Yeah dont initialized with MAP, that’s usually a bad idea - I was doing it just to check the model.