What's the recommended way to split data in Pytensor?

digicosmos · February 27, 2024, 8:04pm

Hi,

I have some data in the form of a matrix that has some missing values coded as -999 in the first column. I have one function that calculates the log-likelihood of all rows without missing data, and the other that calculates the log-likelihood of all rows with missing data. after the calculation, I want to combine the results together as one vector as the output of a likelihood function. It seems that you can’t use something like data[data[:, 0] == -999, :], which you would usually do in numpy, to subset the data in pytensor. What is the recommended way to do this in pytensor?

Thanks!

cluhmann · February 27, 2024, 11:44pm

Can you provide a bit more context for your indexing operation? PyTensor has all the standard numpy operations e(e.g., where(), etc.). Something should work.

ricardoV94 · February 28, 2024, 8:15am

For setting tensors you need to use set_subtensor: Basic Tensor Functionality — PyTensor dev documentation

digicosmos · February 28, 2024, 6:07pm

Thank you @cluhmann and @ricardoV94!

So I thought about using pt.where(), but this will perform computation on the full dataset in both cases, which doesn’t seem to be the most efficient way to do this.

So what I want to do, if I had numpy, is something like this:

def logp(data, ...)
    split1 = data[data[:, 0] == -999, :]
    split2 = data[data[:, 0] != -999, :]
    
    result1 = func1(split1, ...)
    result2 = func2(split2, ...)
    
    result = np.zeros(data.shape[0])
    result[data[:, 0] == -999] = result1
    result[data[:, 0] != -999] = result2
    return result

It seems that index assignment is not supported, so I will have to use pt.set_subtensor(). However, I can’t use boolean indexing in set_subtensor. How do I get around this?

ricardoV94 · February 28, 2024, 7:29pm

You have to use pt.eq and pt.neq instead of == and !=. One of the annoying things of working with PyTensor variables. It has to do with Python constraints on equality and inequality / hashing

digicosmos · February 28, 2024, 8:41pm

So this

result = np.zeros(data.shape[0])
result[data[:, 0] == -999] = result1
result[data[:, 0] != -999] = result2

Can be re-written as

result = pt.zeros(data.shape[0])
pt.set_subtensor(pt.eq(result[data[:, 0], -999]), result1, inplace=True)
pt.set_subtensor(pt.neq(result[data[:, 0], -999]), result2, inplace=True)

correct? I know that the dimensions of result1 and result2 will always be correct, but does pytensor know about this when compiling the op?

ricardoV94 · February 28, 2024, 9:00pm

You don’t need to use the inplace flag, Pytensor will add inplace Ops itself. You can initialize the tensor with pt.empty or pt.empty_like instead.

Do note that such optimizations may not result in a faster graph. Sometimes indexing is actually slower as it breaks loop fusion, memory layouts, (and indexing itself can be slow). If the graphs of func1/func2 are Elemwise the compiler (after PyTensor) may even avoid the useless branch without you knowing it.

digicosmos · March 1, 2024, 7:32pm

This worked really well. Thank you so much, @ricardoV94 and @cluhmann!

Topic		Replies	Views
Reshaping a numpy array in a pytensor compiled function	2	468	June 15, 2023
Looping over two dimensions with Pytensor Scan v5 pytensor	5	133	May 28, 2024
Nan when using two outputs in pytensor.scan v5 bug , modeling , pytensor	2	192	December 5, 2023
How to use the "until" function required for conditional ending of scan in PyTensor version agnostic pytensor	1	393	April 26, 2023
Some problems about pytensor version agnostic pytensor	2	291	December 12, 2023

What's the recommended way to split data in Pytensor?

Related Topics