Dummy coding scheme matters in ADVI logistic regression, but not in MLE

I’m doing a simple logistic regression with a single categorical independent variable with two levels. I’m coding up a design matrix with one constant column and one column that has zeroes and ones (dummy coding). Now I know that the exact choice of which of the two levels of the independent variable I code as 1 and which as 0 should not really matter in a classical GLM context: the coefficient for the dummy column (and the associated t-score etc.) will simply switch sign depending on how I make this choice. Weirdly, though, when I try to do the same type of regression with ADVI, the choice of which group to code as 1 and which as 0 seems to matter a lot. The parameter z-score (absolute value of posterior mean divided by s.d.) of the task regressor coefficient tends to be a lot higher when I code the higher-mean group as 1, compared to when I code the lower-mean group as 1. I’ve been struggling to understand this phenomenon for more than a day now, and am not making any substantial progress. Any input would be highly appreciated!

The following code illustrates what I mean.

import numpy as np
import pymc3 as mc
import statsmodels.api as sm
from tqdm import trange
import matplotlib.pyplot as plt
import theano.tensor as T
import itertools as it

# how many times to simulate and analyze
niter = 50

# number of Bernoulli trials per simulation
ntri = 1000

# base prob of success of single Bernoulli trial
baseprob = 0.1

# the increase in prob of success for one level of the indep variable
incrprob = 0.1

# results of the simulation runs
# (note that only the second element of the coef dimension is relevant; the
# first element gives the coefficient for the constant)
# iter X ncoef X dummyscheme X mle/advi/nuts X coef/zscore
allcoefs = np.zeros((niter, 2, 2, 3, 2))

# given data and design, estimate parameters in three different ways
def fit_mle_advi_nuts(dat, design):
    # MLE using statsmodels
    logmodel = sm.Logit(dat, design)
    mle_results = logmodel.fit(disp=False)

    model = mc.Model()
    with model:
        # two coefficients, one for each column of design matrix
        coefs = mc.Normal('coefs', mu=0, sd=1, shape=2)
        # the latent factor for the Bernoulli process...
        bernfactor = T.sum(design.astype('float32') * coefs[np.newaxis,:],
        # ...converted into probability
        bernprob = mc.math.invlogit(bernfactor)
        ymodelled = mc.Bernoulli('ymodelled', p=bernprob, observed=dat)
        # advi parameters
        params = mc.advi(n=1500, learning_rate=0.05, start=model.test_point,
        # sample using NUTS
        trace = mc.sample(1000, step=mc.NUTS(), start=model.test_point,
            njobs=1, progressbar=False)
    # return parameter estimates and associated t/zscores
    # as whichcoef X mle/advi/nuts X coef/zscore
    return asarray((
        (mle_results.params, mle_results.tvalues),
        (params.means['coefs'], params.means['coefs']/params.stds['coefs']),
        (np.mean(trace['coefs'][500::2], 0) / np.std(trace['coefs'][500::2], 0),
            np.mean(trace['coefs'][500::2], 0))
    )).transpose((2, 0, 1))

for k in trange(niter):
    # generate data
    dat = np.concatenate((
        np.random.choice((0,1), size=ntri//2, p=(1-(baseprob+incrprob),
        np.random.choice((0,1), size=ntri//2, p=(1-baseprob, baseprob))
    # design with coding higher-mean group as 1
    design = np.zeros((ntri,2))
    design[:,0] = 1
    design[:ntri//2,1] = 1
    allcoefs[k,:,0,:,:] = fit_mle_advi_nuts(dat, design)
    # design with coding lower-mean group as 1
    design = np.zeros((ntri,2))
    design[:,0] = 1
    design[ntri//2:,1] = 1
    allcoefs[k,:,1,:,:] = fit_mle_advi_nuts(dat, design)

## check

method_labels = ('mle', 'advi', 'nuts')
score_labels = ('raw', 'z')

# we care about differences between absolute coefficients and t/zscores,
# because the sign flip between the two coding schemes is expected
diffs = np.abs(allcoefs[:,1,0,:,:]) - np.abs(allcoefs[:,1,1,:,:])

fig, allax = plt.subplots(2, 3)
for k, l in it.product(range(3), range(2)):
    allax[l,k].hist(diffs[:,k,l], 'auto')
    allax[l,k].set_xlabel(method_labels[k] +
        ' diff(abs(' + score_labels[l] + '))')

Which generates this figure. I am plotting histograms across simulation runs of differences between dummy coding schemes (0/1 vs 1/0) in absolute coefficient and t/zscores, as a function of whether the estimation was done using maximum likelihood estimation (statsmodels), ADVI, or NUTS.

Note that there are no real differences between the dummy coding schemes when using MLE. Differences are considerable when using ADVI, especially in the parameter z-scores. Differences are small when using NUTS, but still present. (Unfortunately for the real problem I’m working on, which is much bigger than this, sampling is not an option due to computational time.)

1 Like

I did not have time to look at it yet, but just some information you might find helpful:

you should look at the posterior prediction using pm.sample_ppc. I have a similar observation that the different model fitting method does make a difference in parameter estimation (e.g., see https://github.com/junpenglao/GLMM-in-Python/blob/master/GLMM_in_python.ipynb in the last figure, which compares different design matrix and fitting method).

Also, in general, there is bias in ADVI, so use with caution (see eg explanation here).

I didn’t go through this in detail, but I don’t think the two models (changing around the dummy coding) actually are equivalent. There are many cases where things that are equivalent for a maximum likelihood estimate are not the same as soon as you add priors or regularization. It seems this is what’s happening here. If you just look at the lme estimates for a moment you can see that clearly. Say we have lme fits mle1 and mle2 for both version of dummy coding. Then I get

mle1.params, mle2.params
(array([-2.49797873,  1.02176504]), array([-1.47621369, -1.02176504]))

The parameters for the dummy predictor are the same up to the sign. But the intercept value is very different, and in the bayesian model we have a prior on this term.

You can get something where the order doesn’t matter if you use a difference-to-mean coding:

design = np.r_[[-0.5] * (n // 2), [0.5] * (n // 2)]

with pm.Model() as model:
    intercept = pm.Cauchy('intercept', alpha=0, beta=2)
    effect = pm.StudentT('effect', mu=0, nu=3, sd=2.5)
    pred = intercept + design * effect
    p = mc.math.invlogit(pred)
    ymodelled = mc.Bernoulli('ymodelled', p=p, observed=dat)

Bayesian methods are independent of the parametrization in a useful sense, but if you change the parametrization, that usually invites different choices for priors. This takes a bit getting-used-to, but can be quite useful.


Thanks; I hadn’t quite appreciated the interaction between choice of prior and the coding of the design (including intercept).