What's the recommended way to split data in Pytensor?

Hi,

I have some data in the form of a matrix that has some missing values coded as -999 in the first column. I have one function that calculates the log-likelihood of all rows without missing data, and the other that calculates the log-likelihood of all rows with missing data. after the calculation, I want to combine the results together as one vector as the output of a likelihood function. It seems that you can’t use something like data[data[:, 0] == -999, :], which you would usually do in numpy, to subset the data in pytensor. What is the recommended way to do this in pytensor?

Thanks!

Can you provide a bit more context for your indexing operation? PyTensor has all the standard numpy operations e(e.g., where(), etc.). Something should work.

For setting tensors you need to use set_subtensor: Basic Tensor Functionality — PyTensor dev documentation

Thank you @cluhmann and @ricardoV94!

So I thought about using pt.where(), but this will perform computation on the full dataset in both cases, which doesn’t seem to be the most efficient way to do this.

So what I want to do, if I had numpy, is something like this:

def logp(data, ...)
    split1 = data[data[:, 0] == -999, :]
    split2 = data[data[:, 0] != -999, :]
    
    result1 = func1(split1, ...)
    result2 = func2(split2, ...)
    
    result = np.zeros(data.shape[0])
    result[data[:, 0] == -999] = result1
    result[data[:, 0] != -999] = result2
    return result

It seems that index assignment is not supported, so I will have to use pt.set_subtensor(). However, I can’t use boolean indexing in set_subtensor. How do I get around this?

You have to use pt.eq and pt.neq instead of == and !=. One of the annoying things of working with PyTensor variables. It has to do with Python constraints on equality and inequality / hashing

So this

result = np.zeros(data.shape[0])
result[data[:, 0] == -999] = result1
result[data[:, 0] != -999] = result2

Can be re-written as

result = pt.zeros(data.shape[0])
pt.set_subtensor(pt.eq(result[data[:, 0], -999]), result1, inplace=True)
pt.set_subtensor(pt.neq(result[data[:, 0], -999]), result2, inplace=True)

correct? I know that the dimensions of result1 and result2 will always be correct, but does pytensor know about this when compiling the op?

2 Likes

You don’t need to use the inplace flag, Pytensor will add inplace Ops itself. You can initialize the tensor with pt.empty or pt.empty_like instead.

Do note that such optimizations may not result in a faster graph. Sometimes indexing is actually slower as it breaks loop fusion, memory layouts, (and indexing itself can be slow). If the graphs of func1/func2 are Elemwise the compiler (after PyTensor) may even avoid the useless branch without you knowing it.

This worked really well. Thank you so much, @ricardoV94 and @cluhmann!

1 Like