How to deal with lists as independent data

Hi Jeff!

In general, ragged lists (which is what I think you’re realing with – a list of lists that aren’t all of the same length) require some special handling. In general, you can’t represent the ensemble of objects as a Pytensor/Aesara/Theano symbolic tensor, because these need to have a defined number of dimensions (what is the dimensionality of a list of items of uneven length?)

But not all hope is lost, and I can see two ways forward.

One option would be to transform your list of lists to a list of pytensor tensors, then scan over this list and do some computation to each element. This would look something like this:

with pm.Model() as risk_model:
    y = pm.MutableData('Outcome', df.y)
    # You can't use pm.Data anymore, because we end up with a list of arrays
    x_list = list(map(pt.as_tensor_variable, df.x))

    a = pm.Normal('a')

    y_hat = pytensor.scan(lambda x, a: (x ** a).mean(),
                      sequences=x_list,
                      non_sequences=a)

The downside of this is that if you want to change out the data (to do predictions, for example) it will be extremely hard.

A second, and perhaps more elegant, approach will be to compute the length of each list, then pad the lists so that they are all of the same length. This way your data becomes non-ragged, and you can vectorize everything:

def pad_and_stack_ragged_list(x_lists):
    '''Transform ragged list x_lists of length n into a matrix of shape n x max(len(x_lists))'''
    lengths = [len(x) for x in x_lists]
    max_len = max(lengths)
    x_padded = [np.r_[x, np.full(max_len - l, 0)] for x, l in zip(x_lists, lengths)]
    
    return np.stack(x_padded)

x_lengths = df.x.apply(len)
X_mat = pad_and_stack_ragged_list(df.x.values)

with pm.Model() as risk_model:
    y = pm.MutableData('Outcome', df.y)
    x = pm.MutableData(“Dependent”, X_mat)
    lengths = pm.MutableData('x_lengths', x_lengths)

    a = pm.Normal('a')
    y_hat = (x ** a).sum() / lengths

The padded zeros won’t contribute* to the sum computed in y_hat, and the correct length will be used to compute the mean since we saved it before padding things out.

In this model, changing the data wouldn’t be so bad, you would just have to compute the lengths of the out-of-sample data, call the pad_and_stack_ragged_list function, then use pm.set_data.

*Unless you have exactly a == 0.00, which shouldn’t matter since it should be drawn with probability zero. But it could be awkward if a was distributed with support over the unit interval and the posterior ended up extremely skewed (e.g. beta or logit normal with a very small mean/std).