# Including a Distribution as a Predictior in Regression

Say I am trying to model blood alcohol concentration (BAC) in individuals after consuming alcohol. For each individual I have data describing:

1. `n_drinks`: number of drinks consumed
2. `sex_is_female_01`: 1 when F, 0 when M
3. `bac`: BAC as measured x minutes after consumed alcohol (x is fixed for all participants)

I make the simple model below to predict BAC from `sex_is_female_01` and `n_drinks`.

``````with pm.Model() as model:
mu_intercept = Normal('mu_intercept', mu = 0.0, tau = 0.0001)
sigma_intercept = HalfCauchy('sigma_intercept', 5)

a = Normal('a', mu = mu_intercept, sigma = sigma_intercept)

b = Normal('b', mu = 0.0, sigma = 1e5, shape = 2)

sd_y = HalfCauchy('sd_y', 5)

y_hat = a + b[0]*n_drinks+ b[1]*sex_is_female_01

y_like = Normal('y_like', mu = y_hat, sigma = sd_y, observed = bac)

``````

Fortunately this works quite well and I can show BAC independently depends on `sex_is_female_01` and `n_drinks`. However, I also know that males tend to be heavier than females, and I start to wonder if `sex_is_female_01` may actually be a proxy for weight. Sadly, I do not have weight data for my participants. . . I can however find data online describing the distribution of male and female weights in the population from which my participants were drawn (https://jech.bmj.com/content/40/4/319 – yes I realize people have gotten fatter since, but let’s say these are still accurate. Alternatively say it is 19, 19, 1985 (youtube), the year the data in the referenced paper was collected). I code up another model including that distribution as a predictor.

Edit: I am realized since posting that I was attempting to implement a bayesian approach to simply creating an additional independent variable and filling it with the average female weight for female subjects and average male weight for male subjects. Thus, the created variable containing mean weights would be completely dependent on sex, and should not be included in the model (obviously correct me if I am wrong). That being said, I would still like to know if the model I implemented below does what I intended. Thank you.

``````with pm.Model() as vary_intercept:
mu_intercept = Normal('mu_intercept', mu = 0.0, tau = 0.0001)
sigma_intercept = HalfCauchy('sigma_intercept', 5)

#female weight distribution from referenced paper
weight_f = Normal('weight_f', mu = 61, sigma = 11)
#male weight distribution from referenced paper
weight_m = Normal('weight_m', mu = 76, sigma = 11.5)
#distribution of weight for a given subject, depending on sex
weight = weight_m*(1 - subject_is_female_01) + weight_f*(subject_is_female_01)

a = Normal('a', mu = mu_intercept, sigma = sigma_intercept)

b = Normal('b', mu = 0.0, sigma = 1e5, shape = 3)

sd_y = HalfCauchy('sd_y', 5)

y_hat = a + b[0]*n_drinks+ b[1]*subject_is_female_01 + b[2] * weight

y_like = Normal('y_like', mu = y_hat, sigma = sd_y, observed = bac)

``````

I was worried that the model would not run, as there would seem to a dimensional mismatch in `y_hat` – we only have one sex per participant, but a whole distribution of possible weights for each. But model sampled as expected.

# My questions are:

• What have I misunderstood about the shape of the terms in `y_hat` that led me to believe this would not work?
• Does the model represent what I intended?

Thank you