Help with Model Structure in PyMC

Hi All,

I am still in the process of learning my way through PyMC3 and Bayesian Modelling, and I’m at the point now in my journey where I should be able to build myself some simple models. I tend to learn best when applying myself towards problems I am deeply interested in, and I’ve found myself a topic along with a quality data source to back that up.

I’m looking to model probable speed of Greyhounds over different distances given past results.

To add a bit of spice to the mixture, specifically I’d like to include a factor that considers age in 6 month increments.

So I have 3 years worth of Historical data that includes the dogs age in months, individual finish times for each runner and the distance of the race.

I have tried a Hierarchical approach as below, with the variables defined as:

n_distances = Number of Unique race distances
n_agesex = Combined Age Bins and the Sex of the dog, this variable is unique number of combinations
n_dogs = Number of unique dogs in Dataset
dog_idx = an index number created from dataframe column using pd.factorize to identify individual dogs
agesex_idx = Similar to above except it’s an index number for individual AgeSex labels.

with pm.Model() as Model:
    #Shared Hyperpriors
    shared_intercept = pm.Normal('shared_intercept', mu=0, sd=1.5)
    shared_slope = pm.Normal('shared_slope', mu=0, sd=1.5)
    sigma_a = pm.HalfCauchy('sigma_a', 5)
    sigma_b = pm.HalfCauchy('sigma_b', 5)

    #Offset for distances, attempted non-centered parameterization as per T Wiecki
    distance_offset = pm.Normal('distance_offset', mu=shared_intercept, sd=sigma_a, shape=n_distances)
    distance_a = pm.Deterministic('distance_a', shared_intercept + distance_offset * sigma_a)

    #Offset for Age+Sex, attempted non-centered parameterization as per T Wiecki
    agesex_offset = pm.Normal('agesex_offset', mu=shared_intercept, sd=sigma_b, shape=n_agesex)
    agesex_a = pm.Deterministic('agesex_a', shared_intercept + agesex_offset * sigma_b)

    #Offset for Individual Dog, attempted non-centered parameterization as per T Wiecki
    dog_offset = pm.Normal('agesex_offset', mu=shared_intercept, sd=sigma_a, shape=n_dogs)
    dog_a = pm.Deterministic('dog_a', shared_intercept + horse_offset * sigma_a)

    estimate = pm.Deterministic('estimate', horse_a[dog_idx] + agesex_a[agesex_idx])

    y = pm.Normal("y", estimate, observed=horseTime_log)
    trace = pm.sample(1000, tune=500, cores=4, return_inferencedata=True, init="advi+adapt_diag")

Now, an expert will look at this model and see right away that it doesn’t work. My novice eyes seem to think it SHOULD work, but i’m failing to see the fault. I think it may be in the way that i’ve structured the model all together. I’ll see if I can boil down my thoughts:

We have a Dog D
D is a Age and Sex
D has competed in r races over d distances
For each r_{d} the dog clocks a Time t

What I would like to do as an end goal, is have a function where I can feed in an ID for the Dog (Name, for example), it’s current Age+Sex Category and a Distance, and return a trace where the model takes into account all previous times for that Dog over the given distance, conditioned by finish times for all Dogs in that Age+Sex category over the given distance

Given D , a and d :
     Get all t for D over d = dog
     Get all t over d for a = all
         Condition a random variable on all Data
         Condition a random variable on dog Data and all RV
Return Distribution of Probable outcomes

Hope that makes sense, I’m not math trained so don’t cringe too hard at my trying to use notation!

Also, not sure if this is allowed here (correct me if so) but at this point I’m willing to pay someone for a one on one consultation session to ask these questions! Probably only take an hour or so to figure out what I need for now. I’m Australian time zone is preferable!

Dear @vaanderal ,
don’t worry, the process of learning is completely normal.
And I particularly sense some confusion about terminology in you model. Things to consider:

  • an “intercept” is not a “hyperprior”. The intercept should be defined separately and add to the estimate.
  • the hyperprior for different slopes (=“offsets”) may not be the same; each has their own “population level slope”
  • the “dog” offset might be problematic, because it can be 1:1 related to “sex” (maybe also to “age”, unless you have longitudinal data of dogs). If the relation is indeed 1:1, the sampler will have trouble “choosing” one slope for a potential effect, and might fail to converge.

At first glance, the rest looks rather plausible, but you start with a lot at once. I suggest you break down the model and construct it one by one: first, a model with only the intercept, if it works, you already have something. Then maybe add the “sex” slope, then “age” and so forth.

And just a suggestion, maybe check again with the pymc docs examples and the many blog posts out there, try to download the data and re-construct the model. You’re certainly capable to learn yourself!

And good luck with the dog betting :slight_smile:
Cheers!

Thanks Falk, much appreciated for your feedback.

To your first point, I believe what I’ve done is confused the configuration from this tutorial . I’ve re-read my reference material and simplified.

The individual dog calculations are still tripping me up, so I’m testing the following to just estimate a time based on age/sex + Distance

with pm.Model() as Model:
    #Shared Hyperpriors
    mu_dist = pm.Normal('mu_dist', mu=0, sd=1.5)
    sigma_dist = pm.HalfCauchy('sigma_dist', 5)
    mu_age = pm.Normal('mu_age', mu=0, sd=1.5)
    sigma_age = pm.HalfCauchy('sigma_age', 5)

    #Offset for distances, attempted non-centered parameterization as per T Wiecki
    distance_offset = pm.Normal('distance_offset', mu=0, sd=1.5, shape=n_distances)
    distance_a = pm.Deterministic('distance_a', mu_dist + distance_offset * sigma_dist)

    #Offset for Age+Sex, attempted non-centered parameterization as per T Wiecki
    agesex_offset = pm.Normal('agesex_offset', mu=0, sd=1.5, shape=n_agesex)
    agesex_a = pm.Deterministic('agesex_a', mu_age + agesex_offset * sigma_age)

    estimate = pm.Deterministic('estimate', distance_a[distance_idx] + agesex_a[agesex_idx])

    y = pm.Normal("y", estimate, observed=dog_times)
    trace = pm.sample(1000, tune=500, cores=4, return_inferencedata=True, init="advi+adapt_diag")

I am still unsure how to add in individual dog entries to this, would it just be a simple matter of adding another offset in a similar format to above? Intuitively, that doesn’t seem right to me.

Also, if it makes you feel better I shan’t be using this for betting, this is purely a learning experience!

There’s no need for me to give a moral judgement on betting :slight_smile: The modeling is fun, and any data is good to learn. In my opinion, “doing” is the best way to approach programming. So it is good that you’re here!

The reduced model looks a bit better! But I think there is still a confusion with how you model “slopes”, and how you model “hierarchical effects”.
If “agesex” is a discrete/categorical class, it might work. “Sex” certainly is. But I will assume that “age” is continuous. Imagine that male and female dogs would have a different “age function”: imagine females run at the same speed, no matter what age, whereas young males are quick and old males are sluggish and they age linearly.
You would model this as follows:

n_sexes = 2 #just an example.
sex_idx = data['is_male'].values #depends on your data
ages = data['age'].values
with Model:
    agesex_slope = pm.Normal('age|sex', mu = 0, sd = 1.5, shape = n_sexes)

    estimator = ... + agesex_slope[sex_idx]*ages

Critical here:

  • you have to multiply in the actual values!
  • there are two independent slopes, one for each sex. This is what’s called “multilevel”.
  • a hyperprior makes no sense imo because n_sexes = 2 and you can hardly infer a distribution from two observations. (think about this to understand more how multiple hierarchical levels interact)

For “distance”, you probably do not have any hierarchical effect at all. Just a plain distance slope. No “shape” keyword required. But note again: you have to multiply the slope with the data.

About the “dogs”:
How about adding a multilevel intercept for the dogs?

population_intercept = pm.Normal('population_intercept', mu = 0, sd = 1.)
dog_intercept = pm.Normal('dog_intercept', mu = population_intercept, sd = 1., shape = n_dogs)

estimate = pm.Deterministic('estimate', dog_intercept[dog_idx] + distance_a[distance_idx] + agesex_a[agesex_idx])

This is conventionally called a “random intercept model” (though of course, in the world of MCMC everything is “random”). If indeed you have evidence to think that each dog has an individual distance slope, or that the “age function” of each dog varies, you’ll have to construct a multi-level slope.

You will see that this is a “sampling” parametrization and not a “non-centered/offset” parametrization. Just giving you some options here, I’m sure you can produce the other one :slight_smile:

Hope this helps!