RV-Dependent Observations from Irregularly-Shaped Dataset

RainierBarrett · July 25, 2017, 4:18pm

Hello.
I’m working on a model for learning a sequence of integers, and I’ve run into a bit of a snag. I’ve pored through the docs, but I can’t find anything that seems to address this issue.
The model is as follows:
Training data comes as sequences of integers. The set of possible integers is bounded. For example’s sake, let’s say 0 through 14. The sequences have varying length.
Every sequence contains a ‘motif’ of fixed length somewhere within, but we aren’t told in our training data where the motifs start. I want to train a distribution to find the best starting positions for the motifs, based only on the sequences themselves. I also train a ‘background distribution’ that is a position- and sequence-independent distribution of these ints.
What I would like to be able to do is supply a pymc3 RV as a list index, like you can do with a category variable in Gaussian mixture models. I’m not sure this is possible, and my first pass did not work. Here’s a minimal example illustrating what I’m trying to do:

import numpy as np
import pymc3 as pm
import random

MOTIF_LENGTH = 4
data = []
maxlength = 0
minlength = 201
#dummy data 
for i in range(1000):
    length = random.randint(5, 200)
    data.append([random.randint(0,15) for _ in range(length)])
    maxlength = max(length, maxlength)
    minlength = min(length, minlength)

all_data_together = []
for line in data:
    for item in line:
        all_data_together.append(item)

model = pm.Model()

with model:
    #Dirichlet priors for background and motif distributions
    bg_p = pm.Dirichlet('bg_p', a=np.ones(15), shape=(15))
    motif_p = pm.Dirichlet('motif_p', a=np.ones(15), shape=(15))

    #this is the index where the motif starts
    motif_start = pm.Uniform('motif_start', lower=0, upper=(minlength-MOTIF_LENGTH))

    #this does not work
    motif_dist = pm.Categorical('motif_dist', p = motif_p,
                                shape = MOTIF_LENGTH,
                                observed = data[:, motif_start])
    
    #this works
    bg_dist = pm.Categorical('bg_dist', p = bg_p,
                             shape=(max(lengths)), observed=all_data_together)

    #sampling
    trace = pm.sample(1000, step=pm.Metropolis())
    ppc = pm.sample_ppc(trace)

This code produces an error caused by numpy, presumably because pymc3 casts the data to a numpy array of objects first (which is why slicing doesn’t go through).

My question is this:
Is it possible in pymc3 to have an observation depend on a pymc3 RV like this at all? And if so, how should I go about doing it?
An alternative solution would be to pad the smaller sequences and then ignore the padding. Is there a way to accomplish that?

junpenglao · July 25, 2017, 4:40pm

This is a very interesting problem. I am not sure how to approach it but I suspect there might be some mature algorithms in bioinformatics and genetics for these kinds of problem. Maybe sequence alignment or sequence matching?

RainierBarrett · July 25, 2017, 6:01pm

I’m afraid that sequence alignment isn’t a good fit for this problem, because of its purpose. I probably should have specified this beforehand, but the end goal with this model will be to draw from the fitted distributions and predictively create new sequences from the fitted models. Sequence alignment would only give a MAP for the sequence/motif pairings. That’s great for what sequence alignment is used for, but it’s not what we need here, unfortunately.

junpenglao · July 25, 2017, 6:27pm

Maybe a high order hidden markov model? It might not be trivial in PyMC3 tho…

RainierBarrett · July 25, 2017, 7:28pm

I think the difficulty there would be ensuring that the motif state would always ‘look at’ 4 sequence elements at once, i.e. requiring the transition probability between ‘motif’ and ‘background’ states to depend on the motif index, so it would only change from the motif state when it finished a motif…

junpenglao · July 25, 2017, 8:47pm

Is the motif known or you want to also model it as a latent variable in the model?

RainierBarrett · July 25, 2017, 9:11pm

The motif is not known, so its distribution would be latent, I suppose.

junpenglao · July 26, 2017, 8:08am

I cannot think of a good solution, maybe as food for thought you can try to embed the integer sequence into a 2D image (each sequence will become a 14 * sequence length image), and run an unsupervised recursive neural net? You can use Lasagne to build the neural net and use gelato to set probabilistic weights as Bayesian inference.

RainierBarrett · July 26, 2017, 1:27pm

Yeah, I was afraid this might not be tractable the way I want it to be. I’ve got an ugly work-around, so that’ll do for now. Thanks for humoring me.

Topic		Replies	Views
Indexes of RVs in modeling? Development theano	0	485	November 20, 2020
Defining an array to store RV before they are introduced in the likelihood Questions	4	1078	June 15, 2018
Naive bayes multivariate HMM Questions	2	603	December 24, 2020
Dependent histogram count data v5 modeling	10	495	November 23, 2022
How to model RVs that depend on time dependent matrix multiplication? Questions	0	394	November 27, 2019

RV-Dependent Observations from Irregularly-Shaped Dataset

Related topics