Bambi vs pymc model specification

Jake_Wall · February 26, 2022, 10:23am

Hi, I’m struggling to fit an equivalent model in pymc to one specified using bambi. My code is below. The bambi model is producing different (and correct) results compared with the pymc model. I don’t understand how my model specification in pymc differs structurally than in bambi? Any insight would be most appreciated!

'''
White Tail Deer Dataset
Hefley et al. 2013
Available at https://www.stat.colostate.edu/~hooten/other/BBM2L_Code.zip
'''

deer_df = pd.read_csv('deer.csv', names=['body_mass', 'crop_size', 'sex', 'age', 'age_sq'],
                      header=0, index_col=0)
deer_df['intercept'] = 1.

# Replace crop_size with categories
deer_df['crop_size_cat'] = deer_df.crop_size.replace({599:0, 614:1})

# Define the x,y data
deer_x = deer_df[['crop_size_cat', 'sex', 'age', 'age_sq']].to_numpy()
deer_y = deer_df[['body_mass']].to_numpy()

# data dimensions
n = deer_x.shape[0] # data rows
p = deer_x.shape[1] # regression parameters
print(n, p)

%%time

# Run model using pymc
with pm.Model() as lm_deer1:
    
    ϵ = pm.HalfNormal('ϵ', 10.)

    α = pm.Normal('α', mu=39., sd=100.)

    β = pm.Normal('β', mu=0., sd=100., shape=p)
   
    # Linear predictor
    μ = pm.Deterministic('μ', α + pm.math.dot(deer_x, β))
    
    # Gaussian Likelihood
    y_pred = pm.Normal('y_pred',
                       mu=μ,
                       sd=ϵ,
                       observed=deer_y)
    
    # Sample
    idata_lm_deer1 = pm.sample(1000)

# Numerical summary of parameters
az.summary(idata_lm_deer1, var_names=['ϵ', 'α', 'β'], round_to=2)

	mean	sd	hdi_5.5%	hdi_94.5%	ess_bulk	ess_tail	r_hat
ϵ	15.36	0.02	15.33	15.40	3561.41	2350.36	1.0
α	39.53	0.08	39.40	39.65	1932.08	1855.40	1.0
β[0]	0.00	0.07	-0.11	0.11	3064.31	2528.50	1.0
β[1]	-0.00	0.07	-0.11	0.11	2577.86	2322.27	1.0
β[2]	0.00	0.05	-0.08	0.08	1955.27	2208.42	1.0
β[3]	-0.00	0.01	-0.01	0.01	2155.84	2205.65	1.0

# Fit the model with Bambi

lm_deer2 = bmb.Model("body_mass ~ crop_size_cat + sex + age + age_sq", deer_df)
idata_lm_deer2 = lm_deer2.fit(draws=2000)

# Numerical summary of traces
az.summary(idata_lm_deer2, round_to=2)

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Intercept	13.03	0.80	11.76	14.30	0.01	0.01	2892.27	2941.11	1.0
crop_size_cat	0.74	0.71	-0.47	1.81	0.01	0.01	4291.72	2460.97	1.0
sex	8.20	0.71	7.07	9.32	0.01	0.01	3736.76	2665.11	1.0
age	17.46	0.51	16.67	18.28	0.01	0.01	2178.99	2324.65	1.0
age_sq	-1.64	0.07	-1.75	-1.52	0.00	0.00	2161.45	2329.67	1.0
body_mass_sigma	7.29	0.24	6.92	7.66	0.00	0.00	4147.96	2625.83	1.0

The values for α (intercept), β[0] (crop_size_cat), β[1] (sex), β[2] (age), β[3] (age_sq) are all very different between bambi and pymc and I’m not sure how to reproduce the bambi result using pymc? Thanks for any insight anyone can provide!

Last updated: Tue Feb 22 2022

Python implementation: CPython
Python version       : 3.9.10
IPython version      : 8.0.1

pandas    : 1.4.1
pymc      : 4.0.0b2
scipy     : 1.7.3
matplotlib: 3.5.1
theano    : 1.1.2
aesara    : 2.3.8
arviz     : 0.11.4
numpy     : 1.22.2

Watermark: 2.3.0

deer.csv (12.5 KB)

ckrapu · February 26, 2022, 1:35pm

I think it’s a broadcasting issue with deer_y. Can you make the modification

# Gaussian Likelihood
    y_pred = pm.Normal('y_pred',
                       mu=μ,
                       sd=ϵ,
                       observed=deer_y.squeeze())

and tell me if you get results that line up with Bambi?

Martin_Ingram · February 26, 2022, 4:26pm

Hi, I just gave this a run and found the issue, I think. Your variable deer_y is an n \times 1 array. It should instead be a 1D-array of shape (n,). I believe that the way you wrote it results in some unwanted broadcasting somewhere. In any case, when I run:

# Run model using pymc
with pm.Model() as lm_deer1:
    
    ϵ = pm.HalfNormal('ϵ', 10.)

    α = pm.Normal('α', mu=39., sd=100.)

    β = pm.Normal('β', mu=0., sd=100., shape=p)
   
    # Linear predictor
    μ = pm.Deterministic('μ', α + pm.math.dot(deer_x, β))
    
    # Gaussian Likelihood
    y_pred = pm.Normal('y_pred',
                       mu=μ,
                       sd=ϵ,
                       # The only change is here!
                       observed=deer_y.reshape(-1))
    
    # Sample
    idata_lm_deer1 = pm.sample(1000, return_inferencedata=True)

I get very similar results to the ones you post for bambi:

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
ϵ	7.29	0.23	6.83	7.7	0	0	3228.05	2385.95	1
α	13.01	0.8	11.58	14.57	0.02	0.01	2059.1	2307.27	1
β[0]	0.74	0.7	-0.59	2.06	0.01	0.01	3757.76	3028.9	1
β[1]	8.19	0.68	6.87	9.45	0.01	0.01	2990.54	2458.38	1
β[2]	17.47	0.51	16.52	18.43	0.01	0.01	2035.88	1994.96	1
β[3]	-1.64	0.07	-1.78	-1.51	0	0	2225.92	2121.04	1

I have made this mistake, and I’ve seen others make it too – I think it’s a potential downside to how numpy and other libraries handle broadcasting. Hope this helps!

PS: Thank you for posting the code and inputs so clearly, it made it easy to run!

Jake_Wall · February 26, 2022, 5:50pm

This is working - thanks so much!

Jake_Wall · February 26, 2022, 6:03pm

Thank-you for this

Martin_Ingram · February 26, 2022, 6:48pm

Oh my apologies somehow I missed that you also found the problem @ckrapu ! Glad we could help you @Jake_Wall

Topic		Replies	Views
The logp outputs are different between bambi and pymc implementations v5 bambi	5	246	March 17, 2024
Looking for help: error occured in simple linear model example Questions	3	435	February 20, 2022
How to recreate a model in Bambi in pymc v5	3	717	May 24, 2024
Trouble Replicating Bambi v5 modeling	2	317	November 6, 2022
Create model matrix Questions	7	1490	June 23, 2021

Bambi vs pymc model specification

Related topics