I’m doing a simple logistic regression with a single categorical independent variable with two levels. I’m coding up a design matrix with one constant column and one column that has zeroes and ones (dummy coding). Now I know that the exact choice of which of the two levels of the independent variable I code as 1 and which as 0 should not really matter in a classical GLM context: the coefficient for the dummy column (and the associated t-score etc.) will simply switch sign depending on how I make this choice. Weirdly, though, when I try to do the same type of regression with ADVI, the choice of which group to code as 1 and which as 0 seems to matter a lot. The parameter z-score (absolute value of posterior mean divided by s.d.) of the task regressor coefficient tends to be a lot higher when I code the higher-mean group as 1, compared to when I code the lower-mean group as 1. I’ve been struggling to understand this phenomenon for more than a day now, and am not making any substantial progress. Any input would be highly appreciated!
The following code illustrates what I mean.
import numpy as np import pymc3 as mc import statsmodels.api as sm from tqdm import trange import matplotlib.pyplot as plt import theano.tensor as T import itertools as it # how many times to simulate and analyze niter = 50 # number of Bernoulli trials per simulation ntri = 1000 # base prob of success of single Bernoulli trial baseprob = 0.1 # the increase in prob of success for one level of the indep variable incrprob = 0.1 # results of the simulation runs # (note that only the second element of the coef dimension is relevant; the # first element gives the coefficient for the constant) # iter X ncoef X dummyscheme X mle/advi/nuts X coef/zscore allcoefs = np.zeros((niter, 2, 2, 3, 2)) # given data and design, estimate parameters in three different ways def fit_mle_advi_nuts(dat, design): # MLE using statsmodels logmodel = sm.Logit(dat, design) mle_results = logmodel.fit(disp=False) model = mc.Model() with model: # two coefficients, one for each column of design matrix coefs = mc.Normal('coefs', mu=0, sd=1, shape=2) # the latent factor for the Bernoulli process... bernfactor = T.sum(design.astype('float32') * coefs[np.newaxis,:], axis=1) # ...converted into probability bernprob = mc.math.invlogit(bernfactor) ymodelled = mc.Bernoulli('ymodelled', p=bernprob, observed=dat) # advi parameters params = mc.advi(n=1500, learning_rate=0.05, start=model.test_point, progressbar=False) # sample using NUTS trace = mc.sample(1000, step=mc.NUTS(), start=model.test_point, njobs=1, progressbar=False) # return parameter estimates and associated t/zscores # as whichcoef X mle/advi/nuts X coef/zscore return asarray(( (mle_results.params, mle_results.tvalues), (params.means['coefs'], params.means['coefs']/params.stds['coefs']), (np.mean(trace['coefs'][500::2], 0) / np.std(trace['coefs'][500::2], 0), np.mean(trace['coefs'][500::2], 0)) )).transpose((2, 0, 1)) for k in trange(niter): # generate data dat = np.concatenate(( np.random.choice((0,1), size=ntri//2, p=(1-(baseprob+incrprob), baseprob+incrprob)), np.random.choice((0,1), size=ntri//2, p=(1-baseprob, baseprob)) )).astype('int8') # design with coding higher-mean group as 1 design = np.zeros((ntri,2)) design[:,0] = 1 design[:ntri//2,1] = 1 allcoefs[k,:,0,:,:] = fit_mle_advi_nuts(dat, design) # design with coding lower-mean group as 1 design = np.zeros((ntri,2)) design[:,0] = 1 design[ntri//2:,1] = 1 allcoefs[k,:,1,:,:] = fit_mle_advi_nuts(dat, design) ## check method_labels = ('mle', 'advi', 'nuts') score_labels = ('raw', 'z') # we care about differences between absolute coefficients and t/zscores, # because the sign flip between the two coding schemes is expected diffs = np.abs(allcoefs[:,1,0,:,:]) - np.abs(allcoefs[:,1,1,:,:]) fig, allax = plt.subplots(2, 3) for k, l in it.product(range(3), range(2)): allax[l,k].hist(diffs[:,k,l], 'auto') allax[l,k].set_xlabel(method_labels[k] + ' diff(abs(' + score_labels[l] + '))') allax[l,k].axvline(c='r')
Which generates this figure. I am plotting histograms across simulation runs of differences between dummy coding schemes (0/1 vs 1/0) in absolute coefficient and t/zscores, as a function of whether the estimation was done using maximum likelihood estimation (statsmodels), ADVI, or NUTS.
Note that there are no real differences between the dummy coding schemes when using MLE. Differences are considerable when using ADVI, especially in the parameter z-scores. Differences are small when using NUTS, but still present. (Unfortunately for the real problem I’m working on, which is much bigger than this, sampling is not an option due to computational time.)