Basic linear regression on fuel efficiency of cars/trucks. Trouble with dims

mono · April 21, 2023, 5:22pm

Hi, I’m new here. While there are many examples online of different models one can make with pymc, I keep having trouble doing something on my own. I have tried pymc multiple times but I think I eventually always run into an issue with dims. I thought maybe try and ask for help once, instead of endless trial and error

I made a silly toy dataset to do a pretty basic linear regression to answer a simple question. Are trucks more fuel efficient than cars? Fuel per sec is determined by speed and weight of the vehicle. More speed and weight is more fuel use. Cars and trucks drive the same speeds but trucks are heavier so they use more fuel. But what if we account for the effect of weight itself on fuel use? Maybe trucks have more efficient engines so they use less fuel per sec than cars if they weighed the same?

Below is code that generates the data. The numbers are total nonsense of course, it’s just about the principle. It also includes some plotting and a linear regression with statsmodels as a sanity check. Then the pymc model, sampling and plotting of the results.

My main problem is that I get a posterior for each data point individually. But I simply want to see one for cars and trucks separately and ideally also for their difference. Such that I can conclude something like “After accounting for speed and weight, I am 90% sure the difference in fuel use is more than 5, 80% sure the difference is more than 10, 60% sure the difference is more than 20”.

Thank you for your time.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pymc as pm
import arviz as az

import statsmodels.api as sm
import statsmodels.formula.api as smf

def generate_data():
    np.random.seed(49213)
    n = 100
    npart = int(n/4) # 25 trucks, 75 cars

    speed = np.random.normal(100, 5, n)
    weight = np.random.normal(1000, 5, n)
    weight[:npart] += 100 # trucks are heavier

    fuelpersec = 0.5 * weight + 2 * speed + np.random.normal(0, 1, n)
    fuelpersec[:npart] -= 20 # trucks have more efficient engines

    df = pd.DataFrame({'weight':weight, 'speed':speed, 'fuelpersec':fuelpersec})
    df['vehicle'] = 'car'
    df.loc[:npart-1, 'vehicle'] = 'truck'
    return df

df = generate_data()

f, axes = plt.subplots(1,2, figsize=[8,3])
sns.scatterplot(df, x='speed', y='fuelpersec', hue='vehicle', ax=axes[0])
sns.scatterplot(df, x='weight', y='fuelpersec', hue='vehicle', ax=axes[1])

# center the data
df['vehicle'] = df['vehicle'].replace({'car':0, 'truck':1})
cols = ['weight', 'speed', 'vehicle']
df[cols] = df[cols] - df[cols].mean(axis=0)

model = smf.ols(formula='fuelpersec ~ vehicle + speed + weight', data=df)
res = model.fit()
print(res.summary())

The coefs are as expected from the data generating. Trucks use 20 less fuel per sec when accounting for speed and weight.

with pm.Model(coords={'idx':df.index}) as model:
    # data
    speed = pm.MutableData('speed', df['speed'], dims='idx')
    weight = pm.MutableData('weight', df['weight'], dims='idx')
    vehicle = pm.MutableData('vehicle', df['vehicle'], dims='idx')
    fuel = pm.MutableData('fuel', df['fuelpersec'], dims='idx')

    # priors
    intercept = pm.Normal("intercept", mu=700, sigma=10)
    speed_mu = pm.Normal("speed_mu", mu=0, sigma=1)
    weight_mu = pm.Normal("weight_mu", mu=0, sigma=1)
    vehicle_mu = pm.Normal("vehicle_mu", mu=0, sigma=1)
    sigma = pm.HalfNormal("sigma", sigma=1)

    mu = pm.Deterministic('mu', intercept + speed_mu*speed + weight_mu*weight + vehicle_mu*vehicle)

    # likelihood
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=fuel)

    # sampling
    idata = pm.sample_prior_predictive(samples=100, random_seed=0)
    idata.extend(pm.sample(draws=1000, tune=1000, chains=2, random_seed=0))
    idata.extend(pm.sample_posterior_predictive(idata, random_seed=0))
    
display(pm.model_to_graphviz(model))
display(az.plot_ppc(idata, group='prior'));
display(az.plot_ppc(idata, num_pp_samples=100));
display(az.summary(idata));

ricardoV94 · April 22, 2023, 12:45pm

After sampling you can set your mutable data to compare the 2 cases you care about and do posterior predictive with the fitted trace. You will get predictive draws for the “2 observations” corresponding to the cases you care about directly.

If you want to compare the predictive effect of a single variable (“holding the others constant”), you have to decide how to do that. Maybe set the other variables to their mean. Or you can add a couple of observations with the observed range of the “other variables” to both groups (so now you will have 2 * n cases instead of just 2 * 1), and then ignore those and look only at “car” vs “truck” or whatever you cared about.

In general, there is no one way to ask questions for your fitted model.

Topic		Replies	Views
Beginner question - Comparing two posterior predictive distributions with different number of observed data v5	8	715	July 12, 2023
Linear Regression - Difference between these two models? Questions	5	517	January 8, 2019
How to run unpooled independent regressions using coords/dims Questions	1	424	January 12, 2022
Regression model sampling solely one sample every 5th second version agnostic modeling	2	512	August 29, 2022
PyMC3 posterior prediction example; how to incorporate a matrix of betas? Questions	24	2838	April 23, 2021

Basic linear regression on fuel efficiency of cars/trucks. Trouble with dims

Related topics