# Basic linear regression on fuel efficiency of cars/trucks. Trouble with dims

Hi, I’m new here. While there are many examples online of different models one can make with pymc, I keep having trouble doing something on my own. I have tried pymc multiple times but I think I eventually always run into an issue with dims. I thought maybe try and ask for help once, instead of endless trial and error I made a silly toy dataset to do a pretty basic linear regression to answer a simple question. Are trucks more fuel efficient than cars? Fuel per sec is determined by speed and weight of the vehicle. More speed and weight is more fuel use. Cars and trucks drive the same speeds but trucks are heavier so they use more fuel. But what if we account for the effect of weight itself on fuel use? Maybe trucks have more efficient engines so they use less fuel per sec than cars if they weighed the same?

Below is code that generates the data. The numbers are total nonsense of course, it’s just about the principle. It also includes some plotting and a linear regression with statsmodels as a sanity check. Then the pymc model, sampling and plotting of the results.

My main problem is that I get a posterior for each data point individually. But I simply want to see one for cars and trucks separately and ideally also for their difference. Such that I can conclude something like “After accounting for speed and weight, I am 90% sure the difference in fuel use is more than 5, 80% sure the difference is more than 10, 60% sure the difference is more than 20”.

``````import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pymc as pm
import arviz as az

import statsmodels.api as sm
import statsmodels.formula.api as smf

def generate_data():
np.random.seed(49213)
n = 100
npart = int(n/4) # 25 trucks, 75 cars

speed = np.random.normal(100, 5, n)
weight = np.random.normal(1000, 5, n)
weight[:npart] += 100 # trucks are heavier

fuelpersec = 0.5 * weight + 2 * speed + np.random.normal(0, 1, n)
fuelpersec[:npart] -= 20 # trucks have more efficient engines

df = pd.DataFrame({'weight':weight, 'speed':speed, 'fuelpersec':fuelpersec})
df['vehicle'] = 'car'
df.loc[:npart-1, 'vehicle'] = 'truck'
return df

df = generate_data()

f, axes = plt.subplots(1,2, figsize=[8,3])
sns.scatterplot(df, x='speed', y='fuelpersec', hue='vehicle', ax=axes)
sns.scatterplot(df, x='weight', y='fuelpersec', hue='vehicle', ax=axes)

`````` ``````# center the data
df['vehicle'] = df['vehicle'].replace({'car':0, 'truck':1})
cols = ['weight', 'speed', 'vehicle']
df[cols] = df[cols] - df[cols].mean(axis=0)

model = smf.ols(formula='fuelpersec ~ vehicle + speed + weight', data=df)
res = model.fit()
print(res.summary())
``````

The coefs are as expected from the data generating. Trucks use 20 less fuel per sec when accounting for speed and weight.

``````with pm.Model(coords={'idx':df.index}) as model:
# data
speed = pm.MutableData('speed', df['speed'], dims='idx')
weight = pm.MutableData('weight', df['weight'], dims='idx')
vehicle = pm.MutableData('vehicle', df['vehicle'], dims='idx')
fuel = pm.MutableData('fuel', df['fuelpersec'], dims='idx')

# priors
intercept = pm.Normal("intercept", mu=700, sigma=10)
speed_mu = pm.Normal("speed_mu", mu=0, sigma=1)
weight_mu = pm.Normal("weight_mu", mu=0, sigma=1)
vehicle_mu = pm.Normal("vehicle_mu", mu=0, sigma=1)
sigma = pm.HalfNormal("sigma", sigma=1)

mu = pm.Deterministic('mu', intercept + speed_mu*speed + weight_mu*weight + vehicle_mu*vehicle)

# likelihood
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=sigma, observed=fuel)

# sampling
idata = pm.sample_prior_predictive(samples=100, random_seed=0)
idata.extend(pm.sample(draws=1000, tune=1000, chains=2, random_seed=0))
idata.extend(pm.sample_posterior_predictive(idata, random_seed=0))

display(pm.model_to_graphviz(model))
display(az.plot_ppc(idata, group='prior'));
display(az.plot_ppc(idata, num_pp_samples=100));
display(az.summary(idata));
``````  After sampling you can set your mutable data to compare the 2 cases you care about and do posterior predictive with the fitted trace. You will get predictive draws for the “2 observations” corresponding to the cases you care about directly.

If you want to compare the predictive effect of a single variable (“holding the others constant”), you have to decide how to do that. Maybe set the other variables to their mean. Or you can add a couple of observations with the observed range of the “other variables” to both groups (so now you will have 2 * n cases instead of just 2 * 1), and then ignore those and look only at “car” vs “truck” or whatever you cared about.

In general, there is no one way to ask questions for your fitted model.

1 Like