How to deal with lists as independent data

Hello,
In my model I am trying to input an list as an independent variable. I think the dependent variable should be for example the mean of every item in the list risen to some power that I’m trying to estimate.

For example the forward model might look like:

a=[ 0.5, 0.2, 0.1]
x=[ [0,.1,.1,.2,.2,.5,.2,.8,.7,.5,.6,1],
[.1,.2,.1,.1,.1],[1,1,1,1,0,0,0,0,0,0,0,5,5,.1]
y= [  ]

for i in len(x):
a=a[i]
x=x[i]
y.append(np.mean(x**a))

These are in a pandas Dataframe like so.

| a | __ x____ | _ y__ |
| .5 | [0,.1,.1,…] | 12 |

While x and y are observed, I’m trying to estimate a. However, if I try to use

with pm.Model() as risk_model:
    # Data
    y = pm.Data("outcome", df.y)
    x=pm.Data("Dependent",df.x)

I get:

666     # `convert_observed_data` takes care of parameter `value` and
    667     # transforms it to something digestible for Aesara.
--> 668     arr = convert_observed_data(value)
    669 
    670     if mutable is None:

~/anaconda3/lib/python3.7/site-packages/pymc/aesaraf.py in convert_observed_data(data)
    139         # otherwise, assume float:
    140         else:
--> 141             return floatX(ret)
    142     # needed for uses of this function other than with pm.Data:
    143     else:

~/anaconda3/lib/python3.7/site-packages/pymc/aesaraf.py in floatX(X)
    457     """
    458     try:
--> 459         return X.astype(aesara.config.floatX)
    460     except AttributeError:
    461         # Scalar passed

ValueError: setting an array element with a sequence.

Any help is much appreciated

Hi Jeff!

In general, ragged lists (which is what I think you’re realing with – a list of lists that aren’t all of the same length) require some special handling. In general, you can’t represent the ensemble of objects as a Pytensor/Aesara/Theano symbolic tensor, because these need to have a defined number of dimensions (what is the dimensionality of a list of items of uneven length?)

But not all hope is lost, and I can see two ways forward.

One option would be to transform your list of lists to a list of pytensor tensors, then scan over this list and do some computation to each element. This would look something like this:

with pm.Model() as risk_model:
    y = pm.MutableData('Outcome', df.y)
    # You can't use pm.Data anymore, because we end up with a list of arrays
    x_list = list(map(pt.as_tensor_variable, df.x))

    a = pm.Normal('a')

    y_hat = pytensor.scan(lambda x, a: (x ** a).mean(),
                      sequences=x_list,
                      non_sequences=a)

The downside of this is that if you want to change out the data (to do predictions, for example) it will be extremely hard.

A second, and perhaps more elegant, approach will be to compute the length of each list, then pad the lists so that they are all of the same length. This way your data becomes non-ragged, and you can vectorize everything:

def pad_and_stack_ragged_list(x_lists):
    '''Transform ragged list x_lists of length n into a matrix of shape n x max(len(x_lists))'''
    lengths = [len(x) for x in x_lists]
    max_len = max(lengths)
    x_padded = [np.r_[x, np.full(max_len - l, 0)] for x, l in zip(x_lists, lengths)]
    
    return np.stack(x_padded)

x_lengths = df.x.apply(len)
X_mat = pad_and_stack_ragged_list(df.x.values)

with pm.Model() as risk_model:
    y = pm.MutableData('Outcome', df.y)
    x = pm.MutableData(“Dependent”, X_mat)
    lengths = pm.MutableData('x_lengths', x_lengths)

    a = pm.Normal('a')
    y_hat = (x ** a).sum() / lengths

The padded zeros won’t contribute* to the sum computed in y_hat, and the correct length will be used to compute the mean since we saved it before padding things out.

In this model, changing the data wouldn’t be so bad, you would just have to compute the lengths of the out-of-sample data, call the pad_and_stack_ragged_list function, then use pm.set_data.

*Unless you have exactly a == 0.00, which shouldn’t matter since it should be drawn with probability zero. But it could be awkward if a was distributed with support over the unit interval and the posterior ended up extremely skewed (e.g. beta or logit normal with a very small mean/std).

Hey Jesse,
Thanks for your input. I didn’t expect such quick interactions and took my eye off this post for a couple days. In the meantime my solution was much less elegant than I think either of these but more similar to solution 2. I basically made the counts of every value their own column as I happen to have a limited number of values. This probably made it much less efficient to sample however.

1 Like