Bambi vs pymc model specification

Hi, I’m struggling to fit an equivalent model in pymc to one specified using bambi. My code is below. The bambi model is producing different (and correct) results compared with the pymc model. I don’t understand how my model specification in pymc differs structurally than in bambi? Any insight would be most appreciated!

'''
White Tail Deer Dataset
Hefley et al. 2013
Available at https://www.stat.colostate.edu/~hooten/other/BBM2L_Code.zip
'''

deer_df = pd.read_csv('deer.csv', names=['body_mass', 'crop_size', 'sex', 'age', 'age_sq'],
                      header=0, index_col=0)
deer_df['intercept'] = 1.
# Replace crop_size with categories
deer_df['crop_size_cat'] = deer_df.crop_size.replace({599:0, 614:1})

# Define the x,y data
deer_x = deer_df[['crop_size_cat', 'sex', 'age', 'age_sq']].to_numpy()
deer_y = deer_df[['body_mass']].to_numpy()

# data dimensions
n = deer_x.shape[0] # data rows
p = deer_x.shape[1] # regression parameters
print(n, p)
%%time

# Run model using pymc
with pm.Model() as lm_deer1:
    
    ϵ = pm.HalfNormal('ϵ', 10.)

    α = pm.Normal('α', mu=39., sd=100.)

    β = pm.Normal('β', mu=0., sd=100., shape=p)
   
    # Linear predictor
    μ = pm.Deterministic('μ', α + pm.math.dot(deer_x, β))
    
    # Gaussian Likelihood
    y_pred = pm.Normal('y_pred',
                       mu=μ,
                       sd=ϵ,
                       observed=deer_y)
    
    # Sample
    idata_lm_deer1 = pm.sample(1000)

# Numerical summary of parameters
az.summary(idata_lm_deer1, var_names=['ϵ', 'α', 'β'], round_to=2)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
ϵ 15.36 0.02 15.33 15.40 0.0 0.0 3561.41 2350.36 1.0
α 39.53 0.08 39.40 39.65 0.0 0.0 1932.08 1855.40 1.0
β[0] 0.00 0.07 -0.11 0.11 0.0 0.0 3064.31 2528.50 1.0
β[1] -0.00 0.07 -0.11 0.11 0.0 0.0 2577.86 2322.27 1.0
β[2] 0.00 0.05 -0.08 0.08 0.0 0.0 1955.27 2208.42 1.0
β[3] -0.00 0.01 -0.01 0.01 0.0 0.0 2155.84 2205.65 1.0
# Fit the model with Bambi

lm_deer2 = bmb.Model("body_mass ~ crop_size_cat + sex + age + age_sq", deer_df)
idata_lm_deer2 = lm_deer2.fit(draws=2000)

# Numerical summary of traces
az.summary(idata_lm_deer2, round_to=2)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept 13.03 0.80 11.76 14.30 0.01 0.01 2892.27 2941.11 1.0
crop_size_cat 0.74 0.71 -0.47 1.81 0.01 0.01 4291.72 2460.97 1.0
sex 8.20 0.71 7.07 9.32 0.01 0.01 3736.76 2665.11 1.0
age 17.46 0.51 16.67 18.28 0.01 0.01 2178.99 2324.65 1.0
age_sq -1.64 0.07 -1.75 -1.52 0.00 0.00 2161.45 2329.67 1.0
body_mass_sigma 7.29 0.24 6.92 7.66 0.00 0.00 4147.96 2625.83 1.0

The values for α (intercept), β[0] (crop_size_cat), β[1] (sex), β[2] (age), β[3] (age_sq) are all very different between bambi and pymc and I’m not sure how to reproduce the bambi result using pymc? Thanks for any insight anyone can provide!

Last updated: Tue Feb 22 2022

Python implementation: CPython
Python version       : 3.9.10
IPython version      : 8.0.1

pandas    : 1.4.1
pymc      : 4.0.0b2
scipy     : 1.7.3
matplotlib: 3.5.1
theano    : 1.1.2
aesara    : 2.3.8
arviz     : 0.11.4
numpy     : 1.22.2

Watermark: 2.3.0

deer.csv (12.5 KB)

1 Like

I think it’s a broadcasting issue with deer_y. Can you make the modification

# Gaussian Likelihood
    y_pred = pm.Normal('y_pred',
                       mu=μ,
                       sd=ϵ,
                       observed=deer_y.squeeze())

and tell me if you get results that line up with Bambi?

1 Like

Hi, I just gave this a run and found the issue, I think. Your variable deer_y is an n \times 1 array. It should instead be a 1D-array of shape (n,). I believe that the way you wrote it results in some unwanted broadcasting somewhere. In any case, when I run:

# Run model using pymc
with pm.Model() as lm_deer1:
    
    ϵ = pm.HalfNormal('ϵ', 10.)

    α = pm.Normal('α', mu=39., sd=100.)

    β = pm.Normal('β', mu=0., sd=100., shape=p)
   
    # Linear predictor
    μ = pm.Deterministic('μ', α + pm.math.dot(deer_x, β))
    
    # Gaussian Likelihood
    y_pred = pm.Normal('y_pred',
                       mu=μ,
                       sd=ϵ,
                       # The only change is here!
                       observed=deer_y.reshape(-1))
    
    # Sample
    idata_lm_deer1 = pm.sample(1000, return_inferencedata=True)

I get very similar results to the ones you post for bambi:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
ϵ 7.29 0.23 6.83 7.7 0 0 3228.05 2385.95 1
α 13.01 0.8 11.58 14.57 0.02 0.01 2059.1 2307.27 1
β[0] 0.74 0.7 -0.59 2.06 0.01 0.01 3757.76 3028.9 1
β[1] 8.19 0.68 6.87 9.45 0.01 0.01 2990.54 2458.38 1
β[2] 17.47 0.51 16.52 18.43 0.01 0.01 2035.88 1994.96 1
β[3] -1.64 0.07 -1.78 -1.51 0 0 2225.92 2121.04 1

I have made this mistake, and I’ve seen others make it too – I think it’s a potential downside to how numpy and other libraries handle broadcasting. Hope this helps!

PS: Thank you for posting the code and inputs so clearly, it made it easy to run!

1 Like

This is working - thanks so much!

1 Like

Thank-you for this

Oh my apologies somehow I missed that you also found the problem @ckrapu ! Glad we could help you @Jake_Wall :slight_smile:

2 Likes