Variable dependent features

Hello Guys.
I’m pretty new to pymc3 and probabilistic programming in general. I hope that this is the right place for the question and that the question isn’t too obvious, as I am that new that I don’t have the right terms to google quite yet :smiley: .
Now to the question: I defined the following Model:
Screenshot from 2022-05-09 11-56-19

My Data is dependent on time t, which can take values from 0-99, and this data has 8 features. That is where the shape of at originates. xt_theano is a tensor of dimensions (t x features x training_vals), in numbers (100 x 8 x 3000). Now the idea is that at[t] dot xt_theano[t] resolves to 3000 scalar values and in combination with the observation size of 3000 everything is great. My problem is that how to tell pymc3 about t. I obviously have a t vector of the size 3000 that that maps the actual features to the time t.

I hope I made my problem clear. If you don’t understand what I want to accomplish, please ask.

1 Like

I’m a bit confused by “maps the actual features to time t”. If I want the value of feature j at time t, isn’t that the value at xt_theano[t, j, :]?

If you want to a tensor dot product, you have to do it “by hand” by multiplying the tensors then summing over the dimension you want to go away. In this case, that’s the feature dimension on axis 1. You will also need to expand the dimensions on at so they match:

at = tt.ones((100, 8))
xt_theano = tt.ones((100, 8, 3000))
(at[:, :, None] * xt_theano).sum(axis=1).shape.eval()
>>>array([ 100, 3000], dtype=int64)

As you can see, this will give you a (100, 3000) tensor where each row is a linear combination of the features at each timestep, weighted by the values of at at each time.

2 Likes

Thank you for your answer.

I don’t think the logic is right the way you suggested, or at least I don’t get the logic :D.
If I compute theta with

theta = pm.invlogit((at[:, :, None] * xt_theano).sum(axis=1) + beta)

I get a result of the size 100, 3000. The dot product values of the features at every time step t. But in the last line of the Model

like = pm.Poisson("like", (100-t)*theta, observed=labels)

the label is, for example, for time step 30, so it should only count for the 30th entry of theta. But with the Model right now the label would impact all time steps, entries of theta, from 0-99 correct?

Thats why I originally wanted to receive scalar values for theta

This is what I meant when I don’t understand what you want exactly. Could you clarify what you are modeling? How many outputs do you want?

My logic is that you get, for every unit and time, a single value that summarizes the features with time-varying parameters, \theta_{i,t} = \alpha + \beta_t X_{i,t}. I’m not sure what you want to do with it from there. If you want one theta per time and unit, you can just ravel theta and you’re done. But when you say:

Do you mean that time is an index that counts up from 0 to 99? Do you just want a deterministic slope term in your model? In that case you would want to add the time index to your feature matrix and give it a coefficient.

1 Like

Thanks again for your answer. I will try to make clear what I try to accomplish.

I model tries to estimate how many goals a football team will score in the remainder of a game. For this, all the games are split into 100 timeframes, thus t can take values from 0-99. Now there are the features for every timeframe and the label is the goals the team actually scored in the remaining time. So my Idea was that theta is some kind of “scoring intensity” for this timeframe.
Now your Idea that results in a (100, 3000) tensor would lead to the result that the observed label, for example 3, would be the label for all t, although it should only be the observed result for t=30.

I hope that makes clear what I actually try to model. And I apologize for the lack of domain-specific wording

Thanks for explaining a bit more. Your wording is great, it’s just that working with models that include time is a huge pain in the ass, so you have to be careful. Let me set up the problem as I now understand it to see if we’re on the same page.

Let’s index games by i \in N, and time by t \in T. We chop the games into T = 100 even time “buckets” and record the number of points scored in that interval only as s_{i,t}. This means that we can get the total score of the game by summing over the buckets: S_i = \sum_{t=0}^T s_{i,t}.

More importantly, we can take the cumsum over the s_{i,t} to get the “points remaining to be scored”. Let’s call that R_{i,t} = \sum_{k=0}^t s_{i,k}. The goal of the exercise is to predict \hat{R}_{i,t} as a function of some features, i.e:

\hat{R}_{i,t} = f(X_{i,t} \beta_t + \epsilon_{i,t}) (where f is some linking function).

The shape of R_{i,t} should be like T \times N, since it has a value in each of the N games and each of the T time buckets. Further, I assume that the feature matrix X has shape T \cdot N \times k, where k is the number of features. Theta, therefore, is exactly what you are looking for: it has shape T \times N: the t-th row and i-th column contains the estimated score-to-go (in logit space, since it hasn’t been passed through f yet), starting from time t, for the i-th game in your sample.

One final note: usually labels are column vectors, so you might prefer to have R_{i,t} be a T \cdot N \times 1 column vector, organized as (i=0, t=0, i=0, t=1, … i=0, t=99, i=1, t=0, …, i=2999, t=99). If you put your labels in this format, you can do a .ravel() on theta to make them match.

Let me know if I’m still not getting something and we can try to work it out together!

1 Like

ok, yes I think we are on the same page now. The only difference is that I don’t really care about the points scored in one “bucket” time interval, but only for the points that are still going to be scored. If S_i = 4 the label/outcome in t=0 would be 4 and in t=99 its 0.

But I think with your definition of R_{i,t} and the cumsum we are talking about the same thing.

Ok so you helped me to understand that I had a big error when modeling the labels / R_{i,t}. I will change that to T \times N. The feature Matrix has the shape T \times k \times N. Which is practically the same as T \cdot N \times k, correct?

So to calculate theta, I take your first advice and calculate

theta = pm.invlogit((at[:, :, None] * xt_theano).sum(axis=1) + beta)

So in the last line:

like = pm.Poisson("like", (100-t)*theta, observed=labels)

Theta is of the size T \times N and with the new/correct modeling of the labels/R_{i,t} is also the size T \times N now for the t part I guess I will also take a T \times N matrix with N times the values from 0 to 99?

And pymc3 “understands” that entry [x][y] of the observed corresponds to entry [x][y] of theta and [x][y] of t? Because then I think I got it :slight_smile:

You’re right, the definition of “points to go” should be R_{i,t} = S_i - \sum_{k=0}^t s_{i,t}, my bad!.

The shape of the feature matrix will depend on the shape of R_{i,t}. If you have it as a T \times N matrix, then X_{i,t} should be 3d. If you reshape R_{i,t} to be NT \times 1, you also need to reshape X_{i,t} to be NT \times k. As I mentioned, it’s much more common to have the labels as a 1d vector. But the point is that everything just needs to match, and theta needs to end up the same shape as the labels R_{i,t}.

Finally, t just needs to be T \times 1, as in np.arange(100)[:, None], and PyMC will broadcast it across theta for you.

I suggest testing all this by generating some labels using scipy.stats.poisson with a known theta, and make sure you can get the right answers in a small test.

All it does on the back-end is plug everything into the formula for the log likelihood and sum it down to a single number. As long as theta and labels are all arranged the same way, you’re golden.

Also, don’t forget to take the exponent of theta to ensure it’s always positive.

My last comment is that as it’s written, there isn’t really any way for your model to connect one time step to the next. Each time step is modeled independently, and doesn’t incorporate information about the previous steps. I would consider adding the t vector into your X vector as another feature. What does it mean to multiply theta by (100 - t)? Why not give the time remaining in the match a weight via \beta and let the model decide how important it is? You might also consider the first lag of R_{i,t}, R_{i,t-1} to the feature matrix, since if there are 4 goals to go at time t-1, there’s probably also 4 goals to at time $t (alternatively, model s_{i,t} instead of R_{i,t}). Finally, you could consider putting a Gaussian Random Walk prior on your betas, to let the weights of the features change as the game goes on.

1 Like

First of all, huge thank you again!

It seems like the model, or at least the model definition, works.

That is probably a very good idea. How do I get a known theta? Is it just making it up?

I actually thought I could model this by the \alpha which has the shape T \times k. I read somewhere that you model a temporal process that way.

Would that alone be enough to model the dependency between the timesteps?
It probably can’t harm to do that either way, except for an even bigger feature space. So I will add the t to X.

Because I wanted to start easy (Or as I once thought “easy” :smiley: )

I dont understand what you mean?

Now that’s the same as the point before, I wanted to start with things I at least half understood. And Gaussian Random Walk isn’t one of them yet. But I will probably look into that next.

Yeah just make it up. Use the distributions in scipy.stats to write down the data generating process you want to model, generate some random data, then see if you can recover the original values.

I guess it depends what is inside the alpha. I assumed it was just a constant term.

It will model a very specific type of dependency between the time steps, a deterministic linear trend. That is, it will account for the fact that scores go up over time.

I would say your model is quite complex. I would begin with a dirt simple slope-intercept model:

R_{i,t} \sim Poisson (\exp(\alpha + \beta \cdot t))

Pool all the data and just estimate two parameters. Make sure things work the way you expect, and that you aren’t getting -infinity in your predictions. After that works, add in your feature matrix, but only estimate one set of parameters for all the games and all the times, so you go up to k+1 parameters. Then if things are still looking good try to get fancier.

I mean that if I’m watching the match and 1 goal has been scored, if I get up to get another beer, by the time I come back, my best prediction of the number of goals (given that I didn’t see the last couple minutes) is 1. Maybe 2, but definitely not 3, 4, or 5, and certainly not 0!. The score at the previous time step informs my guess at the current time step. I should definitely not make a guess independent of the previous score; it that case I might guess 0 when I get back. That would be a pretty terrible guess.

That’s true. That dude would suck at sports betting :smiley:

I think the model works?
At least, I think I am at the point where I can make progress with googling and trial and erroring.

Thank you very much for your help. It is greatly appreciated :heart:

1 Like

Dear S-Morten,

Quick question regarding your link-function:

According to your problem description, and the model’s premises, it is assumed that your dependent variable has a Poisson distribution. Nevertheless, you have applied a Inverse-Logit linking-function to your Poisson distribution.

The Poisson distribution usually has an Exponent (and its inverse, a Log) linking-function for effectively connecting the linear terms (B*X + A) to the E(Y).

Is there any particular reason for why you are using this other linking-function in your model?

Sincerely,

Here I would like to open my previous question to a broader understanding of the PYMC3 library.

I would like to know is the following: once one has defined which family distribution to use (in this post case, the Poisson distribution), should the likely function be defined in terms of the inverse of the linking-function, or itself?

For instance, for the current problem (the Poisson distribution), I wonder whether one should use the Exponent of the dot product of matrices B and X (betas and feature matrix, respectively); or the Log of this dot product.

Please, let me know your thoughts in this regard.

Sincerely

You should use a function that sends the numbers your model generates to a domain that satisfies the constraints of the likelihood function. Don’t get too caught up on names. The logit function (or log-odds function) sends (0,1) \to \mathbb R, while inverse logit (or expit, or sigmoid, or standard logistic) sends \mathbb R \to (0, 1). As you can see, the names are all terrible. Better to focus on what they do. Are you producing numbers between 0 and 1 and you need to get real values? Use logit. Are you producing real numbers and need to constrain them to be between 0 and 1? Use sigmoid (inverse logit, whatever).

The same can be said of log and exp. Log sends \mathbb R^+ \to \mathbb R, while exp sends \mathbb R \to \mathbb R^+. You need to think about what your needs are and reach for the right tool.

The rate parameter \lambda of a Poisson must be strictly positive. If you use \lambda = \log({X\beta}), you will have \lambda <0 for any elements X\beta < 1, which will cause the sampler to error out. Unless you have a very good reason, and you’re positive (pun intended?) that X\beta is never less than 1, you should just use an exponential function.

Dear Jessegrabowski,

Thank you for your reply. Nevertheless, it is still unclear to me S-Morten’s decision for this problem at hand.

Since, in his/her case, the data is mostly countable (Poisson distributed), and the Poisson distribution assumes an Exponential linking function for the linear terms, I see no reason for applying an InverseLogit function for his/her likelihood function. As you well described, the constraints of the problem at hand should be strictly positive, from \mathbb{R} → (0, +inf). Therefore, by applying the InverseLogit function, the problem will be constrained to \mathbb{R} → (0, 1), a subset of the original broader Domain of \mathbb{R}.

With respect to the parameter \lambda, depending on the implementation of each statistical package, it is common practice to usually define the inverse of the linking function to the likelihood function (in order to facilitate the integration steps); this is the reason for this original question. Regardless, I see now that we must actually provide the linking function to the Likelihood-function, here in PYMC. If I got it wrong here, please let me know.