RV-Dependent Observations from Irregularly-Shaped Dataset

Hello.
I’m working on a model for learning a sequence of integers, and I’ve run into a bit of a snag. I’ve pored through the docs, but I can’t find anything that seems to address this issue.
The model is as follows:
Training data comes as sequences of integers. The set of possible integers is bounded. For example’s sake, let’s say 0 through 14. The sequences have varying length.
Every sequence contains a ‘motif’ of fixed length somewhere within, but we aren’t told in our training data where the motifs start. I want to train a distribution to find the best starting positions for the motifs, based only on the sequences themselves. I also train a ‘background distribution’ that is a position- and sequence-independent distribution of these ints.
What I would like to be able to do is supply a pymc3 RV as a list index, like you can do with a category variable in Gaussian mixture models. I’m not sure this is possible, and my first pass did not work. Here’s a minimal example illustrating what I’m trying to do:

import numpy as np
import pymc3 as pm
import random

MOTIF_LENGTH = 4
data = []
maxlength = 0
minlength = 201
#dummy data 
for i in range(1000):
    length = random.randint(5, 200)
    data.append([random.randint(0,15) for _ in range(length)])
    maxlength = max(length, maxlength)
    minlength = min(length, minlength)

all_data_together = []
for line in data:
    for item in line:
        all_data_together.append(item)

model = pm.Model()

with model:
    #Dirichlet priors for background and motif distributions
    bg_p = pm.Dirichlet('bg_p', a=np.ones(15), shape=(15))
    motif_p = pm.Dirichlet('motif_p', a=np.ones(15), shape=(15))

    #this is the index where the motif starts
    motif_start = pm.Uniform('motif_start', lower=0, upper=(minlength-MOTIF_LENGTH))

    #this does not work
    motif_dist = pm.Categorical('motif_dist', p = motif_p,
                                shape = MOTIF_LENGTH,
                                observed = data[:, motif_start])
    
    #this works
    bg_dist = pm.Categorical('bg_dist', p = bg_p,
                             shape=(max(lengths)), observed=all_data_together)

    #sampling
    trace = pm.sample(1000, step=pm.Metropolis())
    ppc = pm.sample_ppc(trace)

This code produces an error caused by numpy, presumably because pymc3 casts the data to a numpy array of objects first (which is why slicing doesn’t go through).

My question is this:
Is it possible in pymc3 to have an observation depend on a pymc3 RV like this at all? And if so, how should I go about doing it?
An alternative solution would be to pad the smaller sequences and then ignore the padding. Is there a way to accomplish that?

This is a very interesting problem. I am not sure how to approach it but I suspect there might be some mature algorithms in bioinformatics and genetics for these kinds of problem. Maybe sequence alignment or sequence matching?

I’m afraid that sequence alignment isn’t a good fit for this problem, because of its purpose. I probably should have specified this beforehand, but the end goal with this model will be to draw from the fitted distributions and predictively create new sequences from the fitted models. Sequence alignment would only give a MAP for the sequence/motif pairings. That’s great for what sequence alignment is used for, but it’s not what we need here, unfortunately.

Maybe a high order hidden markov model? It might not be trivial in PyMC3 tho…

I think the difficulty there would be ensuring that the motif state would always ‘look at’ 4 sequence elements at once, i.e. requiring the transition probability between ‘motif’ and ‘background’ states to depend on the motif index, so it would only change from the motif state when it finished a motif…

Is the motif known or you want to also model it as a latent variable in the model?

The motif is not known, so its distribution would be latent, I suppose.

I cannot think of a good solution, maybe as food for thought you can try to embed the integer sequence into a 2D image (each sequence will become a 14 * sequence length image), and run an unsupervised recursive neural net? You can use Lasagne to build the neural net and use gelato to set probabilistic weights as Bayesian inference.

Yeah, I was afraid this might not be tractable the way I want it to be. I’ve got an ugly work-around, so that’ll do for now. Thanks for humoring me.