# Prediction by Bayesian linear regression

Hello all,

I am quite new to Python and Bayesian Linear Regression. I could write a code for the Bayesian Linear Regression model (Although I don’t know if it is good or not) to predict a SINGLE “x”. However, my data consists of 3 independent variables and 1 dependent variable and I like to predict my dependent variable based on 3 inputs, not just 1 input as you can see in the following code.

I also could not understand how the model trained itself because in the model context (pymc3) there is no sign of using X_train or X_test! That is confusing to me.

Thank you all.

``````import numpy as np
import pandas as pd
import scipy
import theano
import pymc3 as pm

from sklearn.model_selection import train_test_split

X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,3].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

with pm.Model() as linear_model:
# Intercept
intercept = pm.Normal('Intercept', mu = 0, sd = 10)

# Slope
slope = pm.Normal('slope', mu = 0, sd = 10)

# Standard deviation
sigma = pm.HalfNormal('sigma', sd = 10)

# Estimate of mean
mean = intercept + slope * dataset.iloc[:,0].values

# Observed values
Y_obs = pm.Normal('Y_obs', mu = mean, sd = sigma, observed = y)

# Sampler
step = pm.NUTS()

# Posterior distribution
linear_trace = pm.sample(1000, step)

pm.traceplot(linear_trace, figsize = (12, 12))

pm.plot_posterior(linear_trace, figsize = (12, 10))

bayes_prediction = linear_trace['Intercept'] + linear_trace['slope'] * "Any number for ONE IV"
``````
1 Like

@OriolAbril Thanks.

1 Like

``````    # Slopes
slope1 = pm.Normal('slope1', mu = 0, sd = 10)
slope2 = pm.Normal('slope2', mu = 0, sd = 10)
slope3 = pm.Normal('slope3', mu = 0, sd = 10)
``````

Then use them:

``````    # Estimate of mean
mean = (intercept +
(slope1 * dataset['x1']) +
(slope2 * dataset['x2']) +
(slope3 * dataset['x3']) )
``````

There are more compact ways of specifying this, but it’s hard to go wrong with verbose and explicit.

2 Likes

As for prediction, your code is currently training on all the data. It is not using the train/test split you constructed at the beginning. To do so, you would use the training data instead of the dataset itself.

``````# Estimate of mean
mean = intercept + slope * X_train['var1']
``````

and

``````Y_obs = pm.Normal('Y_obs', mu = mean, sd = sigma, observed = y_train)
``````
2 Likes

Hi, welcome!

The predictors are used in the `mean = intercept + slope * dataset.iloc[:,0].values`, here you are using the first column of `dataset` as predictor, doing intercept + slope * x_0. You can define this in any way you want, for a univariate linear regression with multiple predictors, you can use for example

``````slope = pm.Normal('slope', mu = 0, sd = 10, shape=3)
mean = intercept + slope[0] * dataset.iloc[:,0].values + slope[1] * dataset.iloc[:,1].values + slope[2] * dataset.iloc[:,2].values
``````

in order to use 3 predictors, each with it’s own slope. You could also set different priors for the slopes…

I don’t know much about the context here, but depending on what you eventually plan to work on and which extensions could the model need, I’d recommend taking a look at the `pymc3.glm` module and/or at `bambi`.

The predictors are used as I commented above in defining `mean`, the observed values, `y` in this case are passed to PyMC3 with the `observed` keyword, this is done in `Y_obs` which defines the normal likelhood of the model.

Again, I don’t know what you plan to use the model for, but in general there is no need to split the data into train and test subsets for fitting, you can evaluate the goodness of fit by other means, and overfitting is not generally an issue due to the probabilistic nature of the modeling plus the effect of the priors.

To give a concrete example, you can actually estimate the leave-one-out cross-validation with `az.loo`, as done in this notebook

2 Likes

Now, I want to predict the DV with the following code but I get an error:

`bayes_prediction = linear_trace['Intercept'] + linear_trace['slope'] * [16,31,35]`

[16,31,35] are three random numbers for slope [0], slope [1] and slope [2].
I don’t know how to put three slopes together to get the prediction. I would like to put my X_test values for my 3 IVs.

I appreciate it if you could help me with this as well.

Thank you.

I’m not sure I understand what this is supposed to mean.

Here is a model:

``````with pm.Model() as linear_model:
# Intercept
intercept = pm.Normal('Intercept', mu = 0, sd = 10)

# Slope
slope1 = pm.Normal('slope1', mu = 0, sd = 10)
slope2 = pm.Normal('slope2', mu = 0, sd = 10)
slope3 = pm.Normal('slope3', mu = 0, sd = 10)

# Standard deviation
sigma = pm.HalfNormal('sigma', sd = 10)

# Estimate of mean
mean = (intercept +
slope1 * X_train[:,0] +
slope2 * X_train[:,1] +
slope3 * X_train[:,2])

# Observed values
Y_obs = pm.Normal('Y_obs',
mu = mean,
sd = sigma,
observed = y_train)

# Posterior distribution
linear_trace = pm.sample()
``````

Now you can now use `X_test` make predictions about `y_test`. But how you do this will strongly depend on what exactly you are looking for. Presumably, you are interested in the predictions associated with the individual observations in the test set.

``````# which observation from the test
# set do we want to inspect?
obs_index = 0

ppc = (linear_trace['Intercept'] +
(linear_trace['slope1'] * X_test[obs_index,0]) +
(linear_trace['slope2'] * X_test[obs_index,1]) +
(linear_trace['slope3'] * X_test[obs_index,2]) )
``````

The variable `ppc` will contain a full posterior distribution over values of `y` for the relevant observation (i.e., the value found in `y_test[obs_index]`). What you do from there depends on your needs.

1 Like

Thank you so much.