Help with Model Structure in PyMC

There’s no need for me to give a moral judgement on betting :slight_smile: The modeling is fun, and any data is good to learn. In my opinion, “doing” is the best way to approach programming. So it is good that you’re here!

The reduced model looks a bit better! But I think there is still a confusion with how you model “slopes”, and how you model “hierarchical effects”.
If “agesex” is a discrete/categorical class, it might work. “Sex” certainly is. But I will assume that “age” is continuous. Imagine that male and female dogs would have a different “age function”: imagine females run at the same speed, no matter what age, whereas young males are quick and old males are sluggish and they age linearly.
You would model this as follows:

n_sexes = 2 #just an example.
sex_idx = data['is_male'].values #depends on your data
ages = data['age'].values
with Model:
    agesex_slope = pm.Normal('age|sex', mu = 0, sd = 1.5, shape = n_sexes)

    estimator = ... + agesex_slope[sex_idx]*ages

Critical here:

  • you have to multiply in the actual values!
  • there are two independent slopes, one for each sex. This is what’s called “multilevel”.
  • a hyperprior makes no sense imo because n_sexes = 2 and you can hardly infer a distribution from two observations. (think about this to understand more how multiple hierarchical levels interact)

For “distance”, you probably do not have any hierarchical effect at all. Just a plain distance slope. No “shape” keyword required. But note again: you have to multiply the slope with the data.

About the “dogs”:
How about adding a multilevel intercept for the dogs?

population_intercept = pm.Normal('population_intercept', mu = 0, sd = 1.)
dog_intercept = pm.Normal('dog_intercept', mu = population_intercept, sd = 1., shape = n_dogs)

estimate = pm.Deterministic('estimate', dog_intercept[dog_idx] + distance_a[distance_idx] + agesex_a[agesex_idx])

This is conventionally called a “random intercept model” (though of course, in the world of MCMC everything is “random”). If indeed you have evidence to think that each dog has an individual distance slope, or that the “age function” of each dog varies, you’ll have to construct a multi-level slope.

You will see that this is a “sampling” parametrization and not a “non-centered/offset” parametrization. Just giving you some options here, I’m sure you can produce the other one :slight_smile:

Hope this helps!