You’re right, the definition of “points to go” should be R_{i,t} = S_i - \sum_{k=0}^t s_{i,t}, my bad!.
The shape of the feature matrix will depend on the shape of R_{i,t}. If you have it as a T \times N matrix, then X_{i,t} should be 3d. If you reshape R_{i,t} to be NT \times 1, you also need to reshape X_{i,t} to be NT \times k. As I mentioned, it’s much more common to have the labels as a 1d vector. But the point is that everything just needs to match, and theta needs to end up the same shape as the labels R_{i,t}.
Finally, t just needs to be T \times 1, as in np.arange(100)[:, None], and PyMC will broadcast it across theta for you.
I suggest testing all this by generating some labels using scipy.stats.poisson with a known theta, and make sure you can get the right answers in a small test.
All it does on the back-end is plug everything into the formula for the log likelihood and sum it down to a single number. As long as theta and labels are all arranged the same way, you’re golden.
Also, don’t forget to take the exponent of theta to ensure it’s always positive.
My last comment is that as it’s written, there isn’t really any way for your model to connect one time step to the next. Each time step is modeled independently, and doesn’t incorporate information about the previous steps. I would consider adding the t vector into your X vector as another feature. What does it mean to multiply theta by (100 - t)? Why not give the time remaining in the match a weight via \beta and let the model decide how important it is? You might also consider the first lag of R_{i,t}, R_{i,t-1} to the feature matrix, since if there are 4 goals to go at time t-1, there’s probably also 4 goals to at time $t (alternatively, model s_{i,t} instead of R_{i,t}). Finally, you could consider putting a Gaussian Random Walk prior on your betas, to let the weights of the features change as the game goes on.