Hello.
I’m working on a model for learning a sequence of integers, and I’ve run into a bit of a snag. I’ve pored through the docs, but I can’t find anything that seems to address this issue.
The model is as follows:
Training data comes as sequences of integers. The set of possible integers is bounded. For example’s sake, let’s say 0 through 14. The sequences have varying length.
Every sequence contains a ‘motif’ of fixed length somewhere within, but we aren’t told in our training data where the motifs start. I want to train a distribution to find the best starting positions for the motifs, based only on the sequences themselves. I also train a ‘background distribution’ that is a position- and sequence-independent distribution of these ints.
What I would like to be able to do is supply a pymc3 RV as a list index, like you can do with a category variable in Gaussian mixture models. I’m not sure this is possible, and my first pass did not work. Here’s a minimal example illustrating what I’m trying to do:
import numpy as np import pymc3 as pm import random MOTIF_LENGTH = 4 data = [] maxlength = 0 minlength = 201 #dummy data for i in range(1000): length = random.randint(5, 200) data.append([random.randint(0,15) for _ in range(length)]) maxlength = max(length, maxlength) minlength = min(length, minlength) all_data_together = [] for line in data: for item in line: all_data_together.append(item) model = pm.Model() with model: #Dirichlet priors for background and motif distributions bg_p = pm.Dirichlet('bg_p', a=np.ones(15), shape=(15)) motif_p = pm.Dirichlet('motif_p', a=np.ones(15), shape=(15)) #this is the index where the motif starts motif_start = pm.Uniform('motif_start', lower=0, upper=(minlength-MOTIF_LENGTH)) #this does not work motif_dist = pm.Categorical('motif_dist', p = motif_p, shape = MOTIF_LENGTH, observed = data[:, motif_start]) #this works bg_dist = pm.Categorical('bg_dist', p = bg_p, shape=(max(lengths)), observed=all_data_together) #sampling trace = pm.sample(1000, step=pm.Metropolis()) ppc = pm.sample_ppc(trace)
This code produces an error caused by numpy, presumably because pymc3 casts the data to a numpy array of objects first (which is why slicing doesn’t go through).
My question is this:
Is it possible in pymc3 to have an observation depend on a pymc3 RV like this at all? And if so, how should I go about doing it?
An alternative solution would be to pad the smaller sequences and then ignore the padding. Is there a way to accomplish that?