I’m doing a simple logistic regression with a single categorical independent variable with two levels. I’m coding up a design matrix with one constant column and one column that has zeroes and ones (dummy coding). Now I know that the exact choice of which of the two levels of the independent variable I code as 1 and which as 0 should not really matter in a classical GLM context: the coefficient for the dummy column (and the associated t-score etc.) will simply switch sign depending on how I make this choice. Weirdly, though, when I try to do the same type of regression with ADVI, the choice of which group to code as 1 and which as 0 seems to matter a lot. The parameter z-score (absolute value of posterior mean divided by s.d.) of the task regressor coefficient tends to be a lot higher when I code the higher-mean group as 1, compared to when I code the lower-mean group as 1. I’ve been struggling to understand this phenomenon for more than a day now, and am not making any substantial progress. Any input would be highly appreciated!
The following code illustrates what I mean.
import numpy as np
import pymc3 as mc
import statsmodels.api as sm
from tqdm import trange
import matplotlib.pyplot as plt
import theano.tensor as T
import itertools as it
# how many times to simulate and analyze
niter = 50
# number of Bernoulli trials per simulation
ntri = 1000
# base prob of success of single Bernoulli trial
baseprob = 0.1
# the increase in prob of success for one level of the indep variable
incrprob = 0.1
# results of the simulation runs
# (note that only the second element of the coef dimension is relevant; the
# first element gives the coefficient for the constant)
# iter X ncoef X dummyscheme X mle/advi/nuts X coef/zscore
allcoefs = np.zeros((niter, 2, 2, 3, 2))
# given data and design, estimate parameters in three different ways
def fit_mle_advi_nuts(dat, design):
# MLE using statsmodels
logmodel = sm.Logit(dat, design)
mle_results = logmodel.fit(disp=False)
model = mc.Model()
with model:
# two coefficients, one for each column of design matrix
coefs = mc.Normal('coefs', mu=0, sd=1, shape=2)
# the latent factor for the Bernoulli process...
bernfactor = T.sum(design.astype('float32') * coefs[np.newaxis,:],
axis=1)
# ...converted into probability
bernprob = mc.math.invlogit(bernfactor)
ymodelled = mc.Bernoulli('ymodelled', p=bernprob, observed=dat)
# advi parameters
params = mc.advi(n=1500, learning_rate=0.05, start=model.test_point,
progressbar=False)
# sample using NUTS
trace = mc.sample(1000, step=mc.NUTS(), start=model.test_point,
njobs=1, progressbar=False)
# return parameter estimates and associated t/zscores
# as whichcoef X mle/advi/nuts X coef/zscore
return asarray((
(mle_results.params, mle_results.tvalues),
(params.means['coefs'], params.means['coefs']/params.stds['coefs']),
(np.mean(trace['coefs'][500::2], 0) / np.std(trace['coefs'][500::2], 0),
np.mean(trace['coefs'][500::2], 0))
)).transpose((2, 0, 1))
for k in trange(niter):
# generate data
dat = np.concatenate((
np.random.choice((0,1), size=ntri//2, p=(1-(baseprob+incrprob),
baseprob+incrprob)),
np.random.choice((0,1), size=ntri//2, p=(1-baseprob, baseprob))
)).astype('int8')
# design with coding higher-mean group as 1
design = np.zeros((ntri,2))
design[:,0] = 1
design[:ntri//2,1] = 1
allcoefs[k,:,0,:,:] = fit_mle_advi_nuts(dat, design)
# design with coding lower-mean group as 1
design = np.zeros((ntri,2))
design[:,0] = 1
design[ntri//2:,1] = 1
allcoefs[k,:,1,:,:] = fit_mle_advi_nuts(dat, design)
## check
method_labels = ('mle', 'advi', 'nuts')
score_labels = ('raw', 'z')
# we care about differences between absolute coefficients and t/zscores,
# because the sign flip between the two coding schemes is expected
diffs = np.abs(allcoefs[:,1,0,:,:]) - np.abs(allcoefs[:,1,1,:,:])
fig, allax = plt.subplots(2, 3)
for k, l in it.product(range(3), range(2)):
allax[l,k].hist(diffs[:,k,l], 'auto')
allax[l,k].set_xlabel(method_labels[k] +
' diff(abs(' + score_labels[l] + '))')
allax[l,k].axvline(c='r')
Which generates this figure. I am plotting histograms across simulation runs of differences between dummy coding schemes (0/1 vs 1/0) in absolute coefficient and t/zscores, as a function of whether the estimation was done using maximum likelihood estimation (statsmodels), ADVI, or NUTS.
Note that there are no real differences between the dummy coding schemes when using MLE. Differences are considerable when using ADVI, especially in the parameter z-scores. Differences are small when using NUTS, but still present. (Unfortunately for the real problem I’m working on, which is much bigger than this, sampling is not an option due to computational time.)