Repeating a model for two genders

thomasaarholt · April 3, 2023, 5:03pm

Hello! I’m following the Statistical Rethinking course in a study group at work and doing the problem for week 2, question 3: stat_rethinking_2023/week02.pdf at main · rmcelreath/stat_rethinking_2023 · GitHub

We have the following dataset (showing 5 rows) (csv link):

   height     weight   age  male
0  121.92  19.617854  12.0     1
1  105.41  13.947954   8.0     0
2   86.36  10.489315   6.5     0
3  109.22  15.989118   7.0     0
4  114.30  17.860185  11.0     1

The question is concerned with modelling the total impact of age on weight. My solution for the previous question - ignoring gender - is the following:

import pymc as pm
import arviz as az
import pandas as pd

df = pd.read_csv("Howell1.csv", delimiter=";").query("age < 13").reset_index(drop=True)

with pm.Model() as m:
    age = df["age"].to_numpy()
    weight = df["weight"].to_numpy()
    intercept = pm.Normal(name="intercept", mu=5, sigma=1)
    scale_age_weight = pm.Normal(name="scale_age_weight", mu=2, sigma=1)
    sigma = pm.LogNormal(name="sigma", mu=0, sigma=1)
    likelihood = pm.Normal(
        name="likelihood",
        mu=intercept + scale_age_weight * age,
        sigma=sigma,
        observed=weight,
    )

    trace = pm.sample()
print(az.summary(trace))

#                    mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
# intercept         7.150  0.335   6.539    7.795      0.008    0.006    1810.0    2441.0    1.0
# scale_age_weight  1.379  0.051   1.284    1.477      0.001    0.001    1745.0    2229.0    1.0
# sigma             2.548  0.142   2.294    2.821      0.003    0.002    2335.0    2402.0    1.0


#                    mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
# intercept         7.150  0.335   6.539    7.795      0.008    0.006    1810.0    2441.0    1.0
# scale_age_weight  1.379  0.051   1.284    1.477      0.001    0.001    1745.0    2229.0    1.0
# sigma             2.548  0.142   2.294    2.821      0.003    0.002    2335.0    2402.0    1.0

For question 3, we’re interested in two things:

Computing the above but for each gender
Finding the contrast distribution (subtracting the female posterior from the male posterior) and calculating the mean.

I can easily solve the first point by filtering the input dataframe for male == 1 and male == 0. Is there a way I can instead corporate the this into my model?

For the second point, I’m feeling a bit hacky in what I’m doing:

contrast_trace = male_trace.copy()
contrast_trace.posterior = male_trace.posterior - female_trace.posterior
az.plot_posterior(contrast_trace);
az.summary(contrast_trace);

                   mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat  
intercept         0.459  0.647  -0.799    1.637      0.015    0.010    1936.0    2179.0    1.0  
scale_age_weight  0.144  0.102  -0.045    0.338      0.002    0.002    1965.0    2450.0    1.0  
sigma             0.438  0.305  -0.137    0.993      0.006    0.005    2358.0    2484.0    1.0

This seems to work okay, but hardly feels like the right way to do it.

Does anyone have any suggestions to improve my solutions?

falk · April 5, 2023, 11:59am

I like the book/course, but never found time to hack in all the examples. Keep on the work! You’ll definitely benefit from it.

I would think this is a “hierarchical” problem and solve it like this:

import pymc as pm
import arviz as az
import pandas as pd

df = pd.read_csv("Howell1.csv", delimiter=";").query("age < 13").reset_index(drop=True)

is_male = df['male'].values.astype(int)
n_sexes = 2

with pm.Model() as m:
    age = df["age"].to_numpy()
    weight = df["weight"].to_numpy()
    intercept = pm.Normal(name="intercept", mu=5, sigma=1, size = n_sexes)
    scale_age_weight = pm.Normal(name="scale_age_weight", mu=2, sigma=1, size = n_sexes)
    sigma = pm.LogNormal(name="sigma", mu=0, sigma=1)
    likelihood = pm.Normal(
        name="likelihood",
        mu=intercept[is_male] + scale_age_weight[is_male] * age,
        sigma=sigma,
        observed=weight,
    )

    trace = pm.sample()
print(az.summary(trace))

You see that this introduces what is classically called a “random” intercept and a “random” slope, for the sex dimension. That old terminology is nonsense. But what it does is that it computes one intercept for each sex, one slope for each sex, or both. The outcome is, for example, intercept[0] for females, and intercept[1] for males.

All I had to change from your solution (and this is the beauty of pymc!) was

giving the intercept and slope variables a new dimension (with size)
indexing them in the estimator/mu calculation with…
the indexing vector is_male, which works as indexer to the pymc variables, therefore must be integer.

The way I think of this indexer is that it just takes a different (Normal) distribution for each of the sexes, and samples them independently. You can even sample them dependently by drawing from a hyper-distribution, making this a truly hierarchical model. But that’s another story, there are great example notebooks for it

But maybe this is not what that exercise intended you to figure out. Maybe it just wanted you to simply add another slope for the sex? Like so:


import pymc as pm
import arviz as az
import pandas as pd

df = pd.read_csv("Howell1.csv", delimiter=";").query("age < 13").reset_index(drop=True)

is_male = df['male'].values.astype(int)

with pm.Model() as m:
    age = df["age"].to_numpy()
    weight = df["weight"].to_numpy()
    intercept = pm.Normal(name="intercept", mu=5, sigma=1)
    scale_age_weight = pm.Normal(name="scale_age_weight", mu=2, sigma=1)
    sex_is_male = pm.Normal(name="sex_is_male", mu=0, sigma=1)
    sigma = pm.LogNormal(name="sigma", mu=0, sigma=1)
    likelihood = pm.Normal(
        name="likelihood",
        mu=intercept + scale_age_weight * age + is_male * sex_is_male,
        sigma=sigma,
        observed=weight,
    )

    trace = pm.sample()
print(az.summary(trace))

Cheers and good luck with the course!

thomasaarholt · April 5, 2023, 5:04pm

Thanks for such an encouraging and in-depth answer!

I fully agree with your way of thinking on the indexer - I get how we’re essentially using numpy advanced indexing to associate the model parameters with to the appropriate observed data.

Your first answer matches the approach used in R in the solutions (page 4, first code block).

Your answer also helps me with my understanding on how to think of the shapes and dimensions in the model (I asked a question about this a few weeks ago).

Topic		Replies	Views
Statistical Rethinking 2nd Ed, Ch4, Exercise 4H1 version agnostic	3	304	September 15, 2023
Help with Model Structure in PyMC Questions	4	474	November 16, 2021
Linear Regression - Difference between these two models? Questions	5	517	January 8, 2019
Interpreting Output of Binomial Model v3	2	478	May 5, 2022
Correlated samples in factor/cross-classified model version agnostic modeling	11	655	May 24, 2022

Repeating a model for two genders

Related topics