# Repeating a model for two genders

Hello! I’m following the Statistical Rethinking course in a study group at work and doing the problem for week 2, question 3: stat_rethinking_2023/week02.pdf at main · rmcelreath/stat_rethinking_2023 · GitHub

We have the following dataset (showing 5 rows) (csv link):

``````   height     weight   age  male
0  121.92  19.617854  12.0     1
1  105.41  13.947954   8.0     0
2   86.36  10.489315   6.5     0
3  109.22  15.989118   7.0     0
4  114.30  17.860185  11.0     1
``````

The question is concerned with modelling the total impact of age on weight. My solution for the previous question - ignoring gender - is the following:

``````import pymc as pm
import arviz as az
import pandas as pd

df = pd.read_csv("Howell1.csv", delimiter=";").query("age < 13").reset_index(drop=True)

with pm.Model() as m:
age = df["age"].to_numpy()
weight = df["weight"].to_numpy()
intercept = pm.Normal(name="intercept", mu=5, sigma=1)
scale_age_weight = pm.Normal(name="scale_age_weight", mu=2, sigma=1)
sigma = pm.LogNormal(name="sigma", mu=0, sigma=1)
likelihood = pm.Normal(
name="likelihood",
mu=intercept + scale_age_weight * age,
sigma=sigma,
observed=weight,
)

trace = pm.sample()
print(az.summary(trace))

#                    mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
# intercept         7.150  0.335   6.539    7.795      0.008    0.006    1810.0    2441.0    1.0
# scale_age_weight  1.379  0.051   1.284    1.477      0.001    0.001    1745.0    2229.0    1.0
# sigma             2.548  0.142   2.294    2.821      0.003    0.002    2335.0    2402.0    1.0

#                    mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
# intercept         7.150  0.335   6.539    7.795      0.008    0.006    1810.0    2441.0    1.0
# scale_age_weight  1.379  0.051   1.284    1.477      0.001    0.001    1745.0    2229.0    1.0
# sigma             2.548  0.142   2.294    2.821      0.003    0.002    2335.0    2402.0    1.0
``````

For question 3, we’re interested in two things:

• Computing the above but for each gender
• Finding the contrast distribution (subtracting the female posterior from the male posterior) and calculating the mean.

I can easily solve the first point by filtering the input dataframe for `male == 1` and `male == 0`. Is there a way I can instead corporate the this into my model?

For the second point, I’m feeling a bit hacky in what I’m doing:

``````contrast_trace = male_trace.copy()
contrast_trace.posterior = male_trace.posterior - female_trace.posterior
az.plot_posterior(contrast_trace);
az.summary(contrast_trace);
``````
``````                   mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
intercept         0.459  0.647  -0.799    1.637      0.015    0.010    1936.0    2179.0    1.0
scale_age_weight  0.144  0.102  -0.045    0.338      0.002    0.002    1965.0    2450.0    1.0
sigma             0.438  0.305  -0.137    0.993      0.006    0.005    2358.0    2484.0    1.0
``````

This seems to work okay, but hardly feels like the right way to do it.

Does anyone have any suggestions to improve my solutions?

I like the book/course, but never found time to hack in all the examples. Keep on the work! You’ll definitely benefit from it.

I would think this is a “hierarchical” problem and solve it like this:

``````import pymc as pm
import arviz as az
import pandas as pd

df = pd.read_csv("Howell1.csv", delimiter=";").query("age < 13").reset_index(drop=True)

is_male = df['male'].values.astype(int)
n_sexes = 2

with pm.Model() as m:
age = df["age"].to_numpy()
weight = df["weight"].to_numpy()
intercept = pm.Normal(name="intercept", mu=5, sigma=1, size = n_sexes)
scale_age_weight = pm.Normal(name="scale_age_weight", mu=2, sigma=1, size = n_sexes)
sigma = pm.LogNormal(name="sigma", mu=0, sigma=1)
likelihood = pm.Normal(
name="likelihood",
mu=intercept[is_male] + scale_age_weight[is_male] * age,
sigma=sigma,
observed=weight,
)

trace = pm.sample()
print(az.summary(trace))

``````

You see that this introduces what is classically called a “random” intercept and a “random” slope, for the sex dimension. That old terminology is nonsense. But what it does is that it computes one intercept for each sex, one slope for each sex, or both. The outcome is, for example, `intercept` for females, and `intercept` for males.

All I had to change from your solution (and this is the beauty of `pymc`!) was

• giving the `intercept` and `slope` variables a new dimension (with `size`)
• indexing them in the `estimator`/`mu` calculation with…
• the indexing vector `is_male`, which works as indexer to the pymc variables, therefore must be integer.

The way I think of this indexer is that it just takes a different (Normal) distribution for each of the sexes, and samples them independently. You can even sample them dependently by drawing from a hyper-distribution, making this a truly hierarchical model. But that’s another story, there are great example notebooks for it But maybe this is not what that exercise intended you to figure out. Maybe it just wanted you to simply add another slope for the sex? Like so:

``````
import pymc as pm
import arviz as az
import pandas as pd

df = pd.read_csv("Howell1.csv", delimiter=";").query("age < 13").reset_index(drop=True)

is_male = df['male'].values.astype(int)

with pm.Model() as m:
age = df["age"].to_numpy()
weight = df["weight"].to_numpy()
intercept = pm.Normal(name="intercept", mu=5, sigma=1)
scale_age_weight = pm.Normal(name="scale_age_weight", mu=2, sigma=1)
sex_is_male = pm.Normal(name="sex_is_male", mu=0, sigma=1)
sigma = pm.LogNormal(name="sigma", mu=0, sigma=1)
likelihood = pm.Normal(
name="likelihood",
mu=intercept + scale_age_weight * age + is_male * sex_is_male,
sigma=sigma,
observed=weight,
)

trace = pm.sample()
print(az.summary(trace))

``````

Cheers and good luck with the course!

2 Likes

Thanks for such an encouraging and in-depth answer!

I fully agree with your way of thinking on the indexer - I get how we’re essentially using numpy advanced indexing to associate the model parameters with to the appropriate observed data.

Your first answer matches the approach used in R in the solutions (page 4, first code block).