Syntax to mix continuous covariates with dummy categoricals

Hi <3 I am playing with the penguins dataset, which looks like this:

I have learned some syntax to estimate all three penguin species individually, like an ANOVA. Here, I use shape=3 to set 3 sigma priors and 3 mu priors (one for each of the three species). Then, I use [penguin_species.codes] to tell the likelihood which rows come from which species.

with pm.Model() as model_0:
    
    # Priors of shape 3, because 3 species.
    sigma = pm.HalfStudentT("sigma", 100, 2000, shape=3)
    mu = pm.Normal("mu", 4000, 3000, shape=3)
    
    # {y} is penguin mass.
    y = pm.Normal(
        "mass",
        mu=mu[penguin_species.codes],
        sigma=sigma[penguin_species.codes],
        observed=penguins["body_mass_g"]
    )

    idata_0 = pm.sample()
    idata_0.extend(pm.sample_prior_predictive(samples=5000))
    pm.sample_posterior_predictive(idata_0, extend_inferencedata=True)

This gives me six parameter distributions: 3 sigmas and 3 mus. Perfect.

But, when I want to extend to a proper linear regression with a continuous covariate, I don’t know the best syntax to combine continuous covariates with categoricals. If I just want to do continuous covariates, I can construct the design matrix with B and X, then do pm.math.dot(B,X):

Screenshot 2023-06-07 at 21.04.16

with pm.Model() as model_1:
    
    # Set priors.
    sigma = pm.HalfStudentT("sigma", 100, 2000)
    B = pm.Normal("B", 0, 3000, shape=2)

    # Define {mu} with the linear model.
    mu = pm.math.dot(X, B)

    # {y} is penguin mass.
    y = pm.Normal(
        "y",
        mu=mu,
        sigma=sigma,
        observed=penguins["body_mass_g"]
    )

    idata_1 = pm.sample()
    idata_1.extend(pm.sample_prior_predictive(samples=5000))
    pm.sample_posterior_predictive(idata_1, extend_inferencedata=True)

But, what if I want to combine the two methods and use BOTH continuous and categorical predictors? Do I have to manually construct a design matrix using something like pd.get_dummies, and then follow my 2nd method using pm.math.dot?

That seems like a lot of manual work, so I suspect there is an easier way???

Yes I think so.

That’s pretty much the purpose of the Bambi library: BAyesian Model-Building Interface (Bambi) in Python — Bambi 0.12.0.dev documentation

I would say that for categorical covariates, it’s better to index into a vector of variables instead of using a design matrix. See the multilevel modeling example for implementation details, or this Statistical Rethinking video for a higher level explanation.

Thanks, both. Gonna check out these links.

I was able to replicate both methods! I really love Bambi; hope to focus on that for a while.

But, I first wanted to make the same no-pooling model with both methods and check that they matched. Alas, Bambi really really really wants to do a partial-pooling model (Getting Started — Bambi 0.12.0.dev documentation). I can’t figure out how to make Bambi do no pooling.

Seems like a new question so I’ll make another post!

2 Likes

Duh, it’s literally so easy to do a no pooling model in Bambi that I didn’t even think of it. Just do “y ~ x + non_pooled_group” :smiley:

1 Like