Prior on binary and ordinal variables in multiple linear regression

Hi, Guys.

I have a question about building linear model

I’ve read some examples about linear model with continuos predictors.
But I don’t know how to set prior distributions on the binary(boolean) or categorcal or ordinal variables.

Let’s say I have dataframe x_train and y_train(IQ score).
x_train consists of 4 variables age, sex, self-esteem, favorite-fruit.

age is continuous variable, sex is binary, self-esteem is ordinal(1 to 5), fruit is categorical variable(1 to 6).

How should I build a linear model with non informative or weak priors?

Would this work?


with Model() as linear_model:
    # priors
    sigma = HalfCauchy("sigma", beta=5)
    alpha = Normal("alpha", mu=0, sigma=10) # intercept

    beta0 = Normal("beta0", mu=0, sigma=20) # age
    beta1 = Bernoulli("beta1", p=0.5) # sex
    beta2 = DiscreteUniform("beta2", lower=1, upper=5) # self-esteem
    beta3 = DiscreteUniform("beta3", lower=1, upper=6) # fruit

    # likelihood
    likelihood = Normal("likeli", mu=alpha + beta0 * x_train['age'] + beta1 * x_train['sex'] + beta2 * x_train['selfesteem'] + beta3 * x_train['fruit'], sigma=sigma, observed=y_train)

    trace = sample(1000, return_inferencedata=True, chains=3)

Thank you!

1 Like

I suspect that your priors are not what you want/intended. For example, assuming sex is dummy-coded (e.g., male=0, female=1), your model implies that the difference in IQ between males and females is, at maximum, 1 point (because the Bernoulli-distributed beta1 can only take on values between 0 and 1). Similar logic applied to the coefficients for the categorical and ordinal predictors.

A conventional approach would be to dummy-code categorical variables such as sex and fruit like this:

x_train = pd.DataFrame({'sex': ['f', 'm', 'f'],
                        'selfesteem': [1,2,3],
                        'fruit': ['apple', 'orange', 'banana'],
                        'age': [1, 2, 3]})
y_train = [1,2,3]

x_train_dmy = pd.get_dummies(x_train,
                             columns=['sex', 'fruit'],
                             prefix=['sex', 'fruit'],
                             drop_first=True)

fruit_pred = x_train_dmy.loc[:,x_train_dmy.columns.str.contains('fruit*')].to_numpy()

Then you would have a separate parameter for each dummy-coded level:

with pm.Model() as linear_model:
    # priors
    sigma = pm.HalfCauchy("sigma", beta=5)
    # intercept
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    # age
    beta0 = pm.Normal("beta0", mu=0, sigma=20)
    # sex
    beta1 = pm.Normal("beta1", mu=0, sigma=20)
    # self-esteem
    beta2 = pm.Normal("beta2", mu=0, sigma=20)
    # fruit
    beta3 = pm.Normal("beta3", mu=0, sigma=20, shape=len(x_train['fruit'].unique())-1)

    # likelihood
    likelihood = pm.Normal("likeli", mu=alpha + 
                                        beta0 * x_train_dmy['age'] + 
                                        beta1 * x_train_dmy['sex_m'] + 
                                        beta2 * x_train_dmy['selfesteem'] + 
                                        pm.math.dot(beta3, fruit_pred.T),
                                     sigma=sigma, observed=y_train)

Here I have treated the ordinal selfesteem as if it is continous, which it is not (though such ratings scale measurements are often treated as if they were). You could treat the 5 responses as categorical and model them as I have done here with fruit. Doing so imples that the effects of selfesteem need not be monotonic. To explore options for how to model them as ordinal, I would recommend Austin Rochford’s blog post.

2 Likes

I really appreciate! Thank you a lot! :+1:

1 Like