Multiple Linear bayesian regression: Improve posterior predictive distribution

nipnipj · February 28, 2020, 11:48pm

Using yt.csv (4.8 KB) dataset, I’m trying to fit a multiple linear regression model with the followin code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import arviz as az
import pymc3 as pm

df = pd.read_csv('C:\yt.csv')

MultiRegress=pm.Model()
with MultiRegress:    
    # Paranmeter
    alpha = pm.Normal('alpha', mu=np.mean(df['sales']), sd=1)
    beta1 = pm.Normal('beta1', mu=0, sd=1)
    beta2 = pm.Normal('beta2', mu=0, sd=1)
    beta3 = pm.Normal('beta3', mu=0, sd=1)
    # variables
    x_1 = pm.Data("x_1", df['youtube']-np.mean(df['youtube']))
    x_2 = pm.Data("x_2", df['newspaper']-np.mean(df['newspaper']))
    x_3 = pm.Data("x_3", df['facebook']-np.mean(df['facebook']))
    # Linear regression
    mu = alpha + beta1 * x_1 + beta2 * x_2 + beta3 * x_3
    e = pm.HalfCauchy('e', 5)
    # Data likelihood
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=e, observed=df['sales'])
    trace = pm.sample(2000, tune=1000)

ppc = pm.sample_posterior_predictive(trace, samples=2000, model=MultiRegress)

data_ppc = az.from_pymc3(trace=trace, posterior_predictive=ppc)
ax = az.plot_ppc(data_ppc, figsize=(12, 6), mean=False)
ax[0].legend(fontsize=15)

This is the result:

Could please suggest some corrections to the code in order to get a better fit arround the observed Y variable.

Thanks & Regards.

Gon_F · February 29, 2020, 2:24am

Hey, nipnipj. Quite an impressive knot of hair (I cannot not see it) you plotted there. It seems like you are knowledgeable with Pymc3’s methodology, so the next step is to think about what kind of model would best fit Y_obs.

Have you thought about what distributions may better fit your data? The T-distribution, some Gamma distribution, a mixture? Think about and play with these concepts for a bit and see how it goes.

sammosummo · March 4, 2020, 1:20am

Your posterior predictive distribution looks pretty good to me. The real data falls somewhere in the middle of it, which is good.

Schimidt99 · March 5, 2020, 4:18am

Hello nipnipj! I think it’s very important that you normalize your input data. The best method for your data would be to divide the values of each row of the ‘yt.csv’ by the average of the respective column. This will make a big difference in the distribution of input variables. You only made a substrate of the average of the columns, so in my opinion, it is not the best normalization. Good luck!

Topic		Replies	Views
Prediction by Bayesian linear regression Questions	7	1771	April 12, 2021
Multivariate linear regression with different prior distributions for coefficients v3 modeling	0	533	May 7, 2022
Could somebody provide a minimal example for sample_posterior_predictive() Questions	2	412	April 15, 2021
Tutorial real work about simple and multiple linear regression bayesian Questions	13	2160	September 20, 2018
Modeling of multiple regression model with array form v5 development , linear_model , shape_issue , modeling	12	1204	December 17, 2024

Multiple Linear bayesian regression: Improve posterior predictive distribution

Related topics