# Prior on binary and ordinal variables in multiple linear regression

Hi, Guys.

I have a question about building linear model

But I don’t know how to set prior distributions on the binary(boolean) or categorcal or ordinal variables.

Let’s say I have dataframe x_train and y_train(IQ score).
x_train consists of 4 variables age, sex, self-esteem, favorite-fruit.

age is continuous variable, sex is binary, self-esteem is ordinal(1 to 5), fruit is categorical variable(1 to 6).

How should I build a linear model with non informative or weak priors?

Would this work?

``````
with Model() as linear_model:
# priors
sigma = HalfCauchy("sigma", beta=5)
alpha = Normal("alpha", mu=0, sigma=10) # intercept

beta0 = Normal("beta0", mu=0, sigma=20) # age
beta1 = Bernoulli("beta1", p=0.5) # sex
beta2 = DiscreteUniform("beta2", lower=1, upper=5) # self-esteem
beta3 = DiscreteUniform("beta3", lower=1, upper=6) # fruit

# likelihood
likelihood = Normal("likeli", mu=alpha + beta0 * x_train['age'] + beta1 * x_train['sex'] + beta2 * x_train['selfesteem'] + beta3 * x_train['fruit'], sigma=sigma, observed=y_train)

trace = sample(1000, return_inferencedata=True, chains=3)
``````

Thank you!

1 Like

I suspect that your priors are not what you want/intended. For example, assuming sex is dummy-coded (e.g., male=0, female=1), your model implies that the difference in IQ between males and females is, at maximum, 1 point (because the Bernoulli-distributed `beta1` can only take on values between 0 and 1). Similar logic applied to the coefficients for the categorical and ordinal predictors.

A conventional approach would be to dummy-code categorical variables such as sex and fruit like this:

``````x_train = pd.DataFrame({'sex': ['f', 'm', 'f'],
'selfesteem': [1,2,3],
'fruit': ['apple', 'orange', 'banana'],
'age': [1, 2, 3]})
y_train = [1,2,3]

x_train_dmy = pd.get_dummies(x_train,
columns=['sex', 'fruit'],
prefix=['sex', 'fruit'],
drop_first=True)

fruit_pred = x_train_dmy.loc[:,x_train_dmy.columns.str.contains('fruit*')].to_numpy()
``````

Then you would have a separate parameter for each dummy-coded level:

``````with pm.Model() as linear_model:
# priors
sigma = pm.HalfCauchy("sigma", beta=5)
# intercept
alpha = pm.Normal("alpha", mu=0, sigma=10)
# age
beta0 = pm.Normal("beta0", mu=0, sigma=20)
# sex
beta1 = pm.Normal("beta1", mu=0, sigma=20)
# self-esteem
beta2 = pm.Normal("beta2", mu=0, sigma=20)
# fruit
beta3 = pm.Normal("beta3", mu=0, sigma=20, shape=len(x_train['fruit'].unique())-1)

# likelihood
likelihood = pm.Normal("likeli", mu=alpha +
beta0 * x_train_dmy['age'] +
beta1 * x_train_dmy['sex_m'] +
beta2 * x_train_dmy['selfesteem'] +
pm.math.dot(beta3, fruit_pred.T),
sigma=sigma, observed=y_train)
``````

Here I have treated the ordinal `selfesteem` as if it is continous, which it is not (though such ratings scale measurements are often treated as if they were). You could treat the 5 responses as categorical and model them as I have done here with `fruit`. Doing so imples that the effects of `selfesteem` need not be monotonic. To explore options for how to model them as ordinal, I would recommend Austin Rochford’s blog post.

2 Likes

I really appreciate! Thank you a lot! 1 Like