Random Intercepts and Slopes (Correlated) Convergence Issues


#1

I am new to Bayesian modeling and trying to “replicate” and compare the results of a hierarchical linear model in Pymc3 to the results fitting this same model in R (lmer). I believe I am properly setting up a case where there are fixed effects for the intercept, exper and hgc.9, there are random effects for each of these and all three are allowed to be correlated. In case anyone is familiar with R code it looks like below.

The model is not converging (the fixed effects especially are horrible). Am I doing something wrong? It seems like this should be an easy model to fit. Thanks!

R code:
wages <- read.table(“https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt”, header=T, sep=",")

#center
wages$exper_c<-(wages$exper-mean(wages$exper))
wages$hgc.9_c<-(wages$hgc.9-mean(wages$hgc.9))

summary(lmer(lnw~ hgc.9_c + exper_c + (exper_c+ hgc.9_c | id)  ,  data=wages))

PYMC3:

> %matplotlib inline
> import pymc3 as pm
> import numpy as np
> import pandas as pd
> from scipy import stats
> import matplotlib.pyplot as plt
> import seaborn as sns
> import statsmodels.formula.api as smf
> from theano import shared, tensor as tt
> %config InlineBackend.figure_format = 'retina'
> plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])
> 
> 
> 
> #Load and prepare data####################################################
> #######################################################################
> wages=pd.read_csv('https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt')
> wages=wages.sort_values('id')
> 
> N_ids=len(wages.id.unique()) # number of distinct ids
> 
> #remap id to 0,1,2....
> id_lookup=pd.DataFrame(list(zip(np.arange(N_ids),wages.id.unique())), columns=['new_id','id'])
> wages=pd.merge(wages,id_lookup, right_on='id', left_on='id')
> 
> #Center predictors
> wages['hgc.9']=wages['hgc.9']-wages['hgc.9'].mean()
> wages['exper']=wages['exper']-wages['exper'].mean()
> 
> 
> #Fit Model##############################################################
> #######################################################################
> 
> id_idx=wages.new_id.values  #array of the group IDs
> N_ids=len(wages.id.unique()) # number of unqiue group IDs
> 
> dim_corr=3 #dimension of the correlation matrix (int, slope for exper and slope for hgc.9)
> 
> 
> 
> with pm.Model() as m_wages:
>     
>     #distribution LKJCholeskyCov pulls the needed number of standard deviations
>     #LKJCholeskyCov will extract as many as needed
>     sd_dist = pm.HalfCauchy.dist(beta=2) 
>     
>     #eta is the parameter of how informative the prior is (figure 13.3 of rethinking) for the correlation
>     # n is the dimension of the covariance matrix (here 3 x 3)
>     packed_chol = pm.LKJCholeskyCov('chol_cov', eta=2, n=dim_corr, sd_dist=sd_dist)
>     
>     
>     #n  The number of rows of the triangular matrix - corresponds to dimension of the corr matrix above
>     #lower : If true, assume that the matrix is lower triangular.
>     chol = pm.expand_packed_triangular(n=dim_corr, packed=packed_chol, lower=True)
> 
>     #The cholesky decomposition only includes L where S=LL.T so this gets to the full covariance matrix
>     cov = tt.dot(chol, chol.T)
>     
>     
>     # Extract the standard deviations and rhos#################################################################
>     ##########################################################################################################
>     
>     #The diagonal of the covariance matrix are the variances
>     sigma_id = pm.Deterministic('sigma_id', tt.sqrt(tt.diag(cov)))
>     #these are the variance components (intercept and the 2 slopes)
>     variance_components=pm.Deterministic('variance_components',tt.power(sigma_id,2))
>     
>     #This converts to correlation matrix
>     corr = tt.diag(sigma_id**-1).dot(cov.dot(tt.diag(sigma_id**-1)))
>     # the '3' (dim_corr) here says how many indicies to pull out
>     r = pm.Deterministic('Rho', corr[np.triu_indices(dim_corr, k=1)])
>     
>     # this is the prior for intercept and two slopes, both independent N(0,1)
>     ab = pm.Normal('ab', mu=0, sd=1, shape=dim_corr)  # prior for average intercept and slope
>     
>     
>     #can use chol or covariance matrix directly according to docstring
>     #This is the distribution of random effects : intercepts and slopes for all the observations in the data
>     ab_id = pm.MvNormal('ab_id', mu=ab, chol=chol, shape=(N_ids, dim_corr)) # Population of varying effects
>     # Shape needs to be (N_ids, 3) because we're getting back both a and b for each group
>     
>     #the grand intercept and the 2 other fixed effects
>     # these are assumed independent from the random effects
>     fixedEffects = pm.Normal('fixedEffects', 0., 10., shape=3) 
>     
>     
>     #linear model
>     #order here will dictate order for covariance matrix
>     '''
>                 intercept  exper  hgc.9
>     intercept
>     exper
>     hgc.9
>     
>     '''
>     A =  fixedEffects[0] + ab_id[id_idx, 0] #this is PIE_0_i in Singer/Willett book (fixed intercept + random intercepts)
>     BE = fixedEffects[1] + ab_id[id_idx, 1] #this is PIE_1_i
>     BH = fixedEffects[2] + ab_id[id_idx, 2] #this is PIE_2_i
>     
>     
>     mu = A + BE * wages['exper'].values + BH * wages['hgc.9'].values   # linear model
>     sd = pm.HalfCauchy('sigma', beta=2)      # prior stddev within ids (the error term sigma_e in Singer / Willet)
>     resid_var=pm.Deterministic('resid_var',tt.power(sd,2)) 
>     
>     wait = pm.Normal('lnw', mu=mu, sd=sd, observed=wages['lnw'])  # likelihood
>     trace_wages = pm.sample(5000, tune=20000)

Error messages:

Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, fixedEffects, ab_id, ab, chol_cov] Sampling 4 chains: 100%|██████████| 100000/100000 [56:45&lt;00:00, 7.48draws/s] There were 258 divergences after tuning. Increasetarget_acceptor reparameterize. The acceptance probability does not match the target. It is 0.7042628219991673, but should be close to 0.8. Try to increase the number of tuning steps. There were 378 divergences after tuning. Increasetarget_acceptor reparameterize. The acceptance probability does not match the target. It is 0.8837509766365375, but should be close to 0.8. Try to increase the number of tuning steps. There were 72 divergences after tuning. Increasetarget_acceptor reparameterize. The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge. The estimated number of effective samples is smaller than 200 for some parameters.

Summary:
pm.summary(trace_wages,varnames=['resid_var','variance_components','Rho','fixedEffects'])

Trace

pm.traceplot(trace_wages,varnames=['sigma','Rho','fixedEffects'],combined=True)


#2

There was a thread on something similar here. It may be helpful in troubleshooting.


#3

Im not sure, I essentially copied the approach from 13.22 of
rethinking example


#4

I did find one mistake in the code. But even with this correction, the estimates (especially for the correlations) are off (the first rho should be -0.3 ish not positive 0.4). I can get “correct” estimates using brms or rstanarm so I must be doing something wrong in terms of the pymc3 code?

id_idx=wages.new_id.values  #array of the group IDs
N_ids=len(wages.id.unique()) # number of unqiue group IDs

dim_corr=3 #dimension of the correlation matrix (int, slope for exper and slope for hgc.9)



with pm.Model() as m_wages:
    
    #distribution LKJCholeskyCov pulls the needed number of standard deviations
    #LKJCholeskyCov will extract as many as needed
    sd_dist = pm.HalfCauchy.dist(beta=2) 
    
    #eta is the parameter of how informative the prior is (figure 13.3 of rethinking) for the correlation
    # n is the dimension of the covariance matrix (here 3 x 3)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=1, n=dim_corr, sd_dist=sd_dist)
    
    
    #n  The number of rows of the triangular matrix - corresponds to dimension of the corr matrix above
    #lower : If true, assume that the matrix is lower triangular.
    chol = pm.expand_packed_triangular(n=dim_corr, packed=packed_chol, lower=True)

    #The cholesky decomposition only includes L where S=LL.T so this gets to the full covariance matrix
    cov = tt.dot(chol, chol.T)
    
    
    # Extract the standard deviations and rhos#################################################################
    ##########################################################################################################
    
    #The diagonal of the covariance matrix are the variances
    sigma_id = pm.Deterministic('sigma_id', tt.sqrt(tt.diag(cov)))
    #these are the variance components (intercept and the 2 slopes)
    variance_components=pm.Deterministic('variance_components',tt.power(sigma_id,2))
    
    #This converts to correlation matrix
    corr = tt.diag(sigma_id**-1).dot(cov.dot(tt.diag(sigma_id**-1)))
    # the '3' (dim_corr) here says how many indicies to pull out
    r = pm.Deterministic('Rho', corr[np.triu_indices(dim_corr, k=1)])
    
   
    
    #can use chol or covariance matrix directly according to docstring
    #This is the distribution of random effects : intercepts and slopes for all the observations in the data
    ab_id = pm.MvNormal('ab_id', mu=0, chol=chol, shape=(N_ids, dim_corr)) # Population of varying effects
    # Shape needs to be (N_ids, 3) because we're getting back both a and b for each group
    
    #the grand intercept and the 2 other fixed effects
    # these are assumed independent from the random effects
    fixedEffects = pm.Normal('fixedEffects', 0., 10., shape=3) 
    
    
    #linear model
    #order here will dictate order for covariance matrix
    '''
                intercept  exper  hgc.9
    intercept
    exper
    hgc.9
    
    '''
    A =  fixedEffects[0] + ab_id[id_idx, 0] #this is PIE_0_i in Singer/Willett book (fixed intercept + random intercepts)
    BE = fixedEffects[1] + ab_id[id_idx, 1] #this is PIE_1_i
    BH = fixedEffects[2] + ab_id[id_idx, 2] #this is PIE_2_i
    
    
    mu = A +  BE * wages['exper'].values +BH * wages['hgc.9'].values # linear model
    sd = pm.HalfCauchy('sigma', beta=2)      # prior stddev within ids (the error term sigma_e in Singer / Willet)
    resid_var=pm.Deterministic('resid_var',tt.power(sd,2)) 
    
    lnw = pm.Normal('lnw', mu=mu, sd=sd, observed=wages['lnw'])  # likelihood
    trace_wages = pm.sample(10000, tune=10000)


#5

Could you print the brm stan generated code?


#6

Sure, it looks like this (I am not a stan programmer so I cant say that I follow it completely):

Here is the R code:

#brms
wages <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt", header=T, sep=",")

#center
wages$exper_c<-(wages$exper-mean(wages$exper))
wages$hgc.9_c<-(wages$hgc.9-mean(wages$hgc.9))

modbrms<-brm(lnw~ hgc.9_c + exper_c + (exper_c+ hgc.9_c | id) , data=wages
             , family = gaussian(),cores = 8)

The output of summary(modbrms):

  Links: mu = identity; sigma = identity 
Formula: lnw ~ hgc.9_c + exper_c + (exper_c + hgc.9_c | id) 
   Data: wages (Number of observations: 6402) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~id (Number of levels: 888) 
                       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)              0.23      0.01     0.22     0.25        493 1.01
sd(exper_c)                0.04      0.00     0.04     0.05        518 1.01
sd(hgc.9_c)                0.02      0.01     0.00     0.05        103 1.05
cor(Intercept,exper_c)     0.39      0.06     0.27     0.50       1222 1.00
cor(Intercept,hgc.9_c)     0.35      0.32    -0.37     0.90       1163 1.00
cor(exper_c,hgc.9_c)       0.02      0.39    -0.74     0.81        721 1.01

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept     1.90      0.01     1.88     1.91        966 1.00
hgc.9_c       0.04      0.01     0.02     0.05       1102 1.00
exper_c       0.05      0.00     0.04     0.05       1717 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.31      0.00     0.30     0.31       1656 1.00

The stan code:

// generated with brms 2.7.0
functions { 
} 
data { 
  int<lower=1> N;  // total number of observations 
  vector[N] Y;  // response variable 
  int<lower=1> K;  // number of population-level effects 
  matrix[N, K] X;  // population-level design matrix 
  // data for group-level effects of ID 1
  int<lower=1> J_1[N];
  int<lower=1> N_1;
  int<lower=1> M_1;
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  int<lower=1> NC_1;
  int prior_only;  // should the likelihood be ignored? 
} 
transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 
parameters { 
  vector[Kc] b;  // population-level effects 
  real temp_Intercept;  // temporary intercept 
  real<lower=0> sigma;  // residual SD 
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // unscaled group-level effects
  // cholesky factor of correlation matrix
  cholesky_factor_corr[M_1] L_1;
} 
transformed parameters { 
  // group-level effects 
  matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  vector[N_1] r_1_1 = r_1[, 1];
  vector[N_1] r_1_2 = r_1[, 2];
  vector[N_1] r_1_3 = r_1[, 3];
} 
model { 
  vector[N] mu = temp_Intercept + Xc * b;
  for (n in 1:N) { 
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n] + r_1_3[J_1[n]] * Z_1_3[n];
  } 
  // priors including all constants 
  target += student_t_lpdf(temp_Intercept | 3, 2, 10); 
  target += student_t_lpdf(sigma | 3, 0, 10)
    - 1 * student_t_lccdf(0 | 3, 0, 10); 
  target += student_t_lpdf(sd_1 | 3, 0, 10)
    - 3 * student_t_lccdf(0 | 3, 0, 10); 
  target += normal_lpdf(to_vector(z_1) | 0, 1);
  target += lkj_corr_cholesky_lpdf(L_1 | 1); 
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma);
  } 
} 
generated quantities { 
  // actual population-level intercept 
  real b_Intercept = temp_Intercept - dot_product(means_X, b); 
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // take only relevant parts of correlation matrix
  cor_1[1] = Cor_1[1,2]; 
  cor_1[2] = Cor_1[1,3]; 
  cor_1[3] = Cor_1[2,3]; 
}

#7

There are quite a few place the Stan model and the pymc3 model differ, but what makes the most differences for model convergence is usually the standardization of the predictor matrix:

transformed data { 
  int Kc = K - 1; 
  matrix[N, K - 1] Xc;  // centered version of X 
  vector[K - 1] means_X;  // column means of X before centering 
  for (i in 2:K) { 
    means_X[i - 1] = mean(X[, i]); 
    Xc[, i - 1] = X[, i] - means_X[i - 1]; 
  } 
} 

You should do the same for your input X in python


#8

I center the two predictors in the code I pasted above:

wages['hgc.9']=wages['hgc.9']-wages['hgc.9'].mean() 
wages['exper']=wages['exper']-wages['exper'].mean()

#9

It seems really odd to me that any of the priors make much difference given the amount of data. Seeing completely flipped signs for the correlation component…

From the standpoint of the code for pymc3 does this look correct?


#10

Parameterization does make a huge difference, and programmatically generated model like the one from brm often apply many heuristics in preprocessing etc to make it work well.


#11

Thanks. It seems like unless there is a mistake in the code, in a pretty simple case like this I should be able to get close to what comes from lmer, brm etc. Frustrating for us newbies :slight_smile:


#12

Yeah, automatic tools are dangerous because there is just so much going on under the hood


#13

To your knowledge are there any pymc3 examples out there for a correlated random intercept and slope model? I would find it really helpful to see an example of how to build a hierarchical Bayes model where the results are reasonably close to the results from using an approach like lmer.


#14

Yes, see @lwahedi’s reply above:


#15

Thanks. I didnt actually see an example there of actual available data where correlated random intercepts and slopes was created that could be compared to say lmer. Ill look closer I guess.


#16

In the thread there is a repository from @Jack_Caster, where he has some great example comparing exactly brm and pymc3 models on correlated coefficients.


#17

But it looks like the R code you posted has the first estimate as +0.4?

cor(Intercept,exper_c) 0.39 0.06 0.27 0.50 1222 1.00
cor(Intercept,hgc.9_c) 0.35 0.32 -0.37 0.90 1163 1.00
cor(exper_c,hgc.9_c) 0.02 0.39 -0.74 0.81 721 1.01
looks consistent with

Rho_0 (0.40 +/- 0.055)
Rho_1 (0.29 +/- 0.18)
Rho_2 (0.00 +/1 0.30)

If anything the variances are a little off. What’s actually wrong here? Just the convergence of the fixed effects?