I am new to Bayesian modeling and trying to “replicate” and compare the results of a hierarchical linear model in Pymc3 to the results fitting this same model in R (lmer). I believe I am properly setting up a case where there are fixed effects for the intercept, exper and hgc.9
, there are random effects for each of these and all three are allowed to be correlated. In case anyone is familiar with R code it looks like below.
The model is not converging (the fixed effects especially are horrible). Am I doing something wrong? It seems like this should be an easy model to fit. Thanks!
R code:
wages <- read.table(“https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt”, header=T, sep=",")
#center
wages$exper_c<-(wages$exper-mean(wages$exper))
wages$hgc.9_c<-(wages$hgc.9-mean(wages$hgc.9))
summary(lmer(lnw~ hgc.9_c + exper_c + (exper_c+ hgc.9_c | id) , data=wages))
PYMC3:
> %matplotlib inline
> import pymc3 as pm
> import numpy as np
> import pandas as pd
> from scipy import stats
> import matplotlib.pyplot as plt
> import seaborn as sns
> import statsmodels.formula.api as smf
> from theano import shared, tensor as tt
> %config InlineBackend.figure_format = 'retina'
> plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
>
>
>
> #Load and prepare data####################################################
> #######################################################################
> wages=pd.read_csv('https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt')
> wages=wages.sort_values('id')
>
> N_ids=len(wages.id.unique()) # number of distinct ids
>
> #remap id to 0,1,2....
> id_lookup=pd.DataFrame(list(zip(np.arange(N_ids),wages.id.unique())), columns=['new_id','id'])
> wages=pd.merge(wages,id_lookup, right_on='id', left_on='id')
>
> #Center predictors
> wages['hgc.9']=wages['hgc.9']-wages['hgc.9'].mean()
> wages['exper']=wages['exper']-wages['exper'].mean()
>
>
> #Fit Model##############################################################
> #######################################################################
>
> id_idx=wages.new_id.values #array of the group IDs
> N_ids=len(wages.id.unique()) # number of unqiue group IDs
>
> dim_corr=3 #dimension of the correlation matrix (int, slope for exper and slope for hgc.9)
>
>
>
> with pm.Model() as m_wages:
>
> #distribution LKJCholeskyCov pulls the needed number of standard deviations
> #LKJCholeskyCov will extract as many as needed
> sd_dist = pm.HalfCauchy.dist(beta=2)
>
> #eta is the parameter of how informative the prior is (figure 13.3 of rethinking) for the correlation
> # n is the dimension of the covariance matrix (here 3 x 3)
> packed_chol = pm.LKJCholeskyCov('chol_cov', eta=2, n=dim_corr, sd_dist=sd_dist)
>
>
> #n The number of rows of the triangular matrix - corresponds to dimension of the corr matrix above
> #lower : If true, assume that the matrix is lower triangular.
> chol = pm.expand_packed_triangular(n=dim_corr, packed=packed_chol, lower=True)
>
> #The cholesky decomposition only includes L where S=LL.T so this gets to the full covariance matrix
> cov = tt.dot(chol, chol.T)
>
>
> # Extract the standard deviations and rhos#################################################################
> ##########################################################################################################
>
> #The diagonal of the covariance matrix are the variances
> sigma_id = pm.Deterministic('sigma_id', tt.sqrt(tt.diag(cov)))
> #these are the variance components (intercept and the 2 slopes)
> variance_components=pm.Deterministic('variance_components',tt.power(sigma_id,2))
>
> #This converts to correlation matrix
> corr = tt.diag(sigma_id**-1).dot(cov.dot(tt.diag(sigma_id**-1)))
> # the '3' (dim_corr) here says how many indicies to pull out
> r = pm.Deterministic('Rho', corr[np.triu_indices(dim_corr, k=1)])
>
> # this is the prior for intercept and two slopes, both independent N(0,1)
> ab = pm.Normal('ab', mu=0, sd=1, shape=dim_corr) # prior for average intercept and slope
>
>
> #can use chol or covariance matrix directly according to docstring
> #This is the distribution of random effects : intercepts and slopes for all the observations in the data
> ab_id = pm.MvNormal('ab_id', mu=ab, chol=chol, shape=(N_ids, dim_corr)) # Population of varying effects
> # Shape needs to be (N_ids, 3) because we're getting back both a and b for each group
>
> #the grand intercept and the 2 other fixed effects
> # these are assumed independent from the random effects
> fixedEffects = pm.Normal('fixedEffects', 0., 10., shape=3)
>
>
> #linear model
> #order here will dictate order for covariance matrix
> '''
> intercept exper hgc.9
> intercept
> exper
> hgc.9
>
> '''
> A = fixedEffects[0] + ab_id[id_idx, 0] #this is PIE_0_i in Singer/Willett book (fixed intercept + random intercepts)
> BE = fixedEffects[1] + ab_id[id_idx, 1] #this is PIE_1_i
> BH = fixedEffects[2] + ab_id[id_idx, 2] #this is PIE_2_i
>
>
> mu = A + BE * wages['exper'].values + BH * wages['hgc.9'].values # linear model
> sd = pm.HalfCauchy('sigma', beta=2) # prior stddev within ids (the error term sigma_e in Singer / Willet)
> resid_var=pm.Deterministic('resid_var',tt.power(sd,2))
>
> wait = pm.Normal('lnw', mu=mu, sd=sd, observed=wages['lnw']) # likelihood
> trace_wages = pm.sample(5000, tune=20000)
Error messages:
Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, fixedEffects, ab_id, ab, chol_cov] Sampling 4 chains: 100%|██████████| 100000/100000 [56:45<00:00, 7.48draws/s] There were 258 divergences after tuning. Increase
target_acceptor reparameterize. The acceptance probability does not match the target. It is 0.7042628219991673, but should be close to 0.8. Try to increase the number of tuning steps. There were 378 divergences after tuning. Increase
target_acceptor reparameterize. The acceptance probability does not match the target. It is 0.8837509766365375, but should be close to 0.8. Try to increase the number of tuning steps. There were 72 divergences after tuning. Increase
target_acceptor reparameterize. The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge. The estimated number of effective samples is smaller than 200 for some parameters.
Summary:
pm.summary(trace_wages,varnames=['resid_var','variance_components','Rho','fixedEffects'])
Trace
pm.traceplot(trace_wages,varnames=['sigma','Rho','fixedEffects'],combined=True)