Logistic ANOVA: different results with statsmodels and PyMC3

This is probably more of a model specification question than a technical question about PyMC3 itself, but I’m hoping someone here can help. Basically, multi-group logistic ANOVA in PyMC3 is giving different results than other Bayesian models and frequentist methods, and I’d like to figure out why.

I’m providing as much background as I can, but you can probably skip the reference analyses if you prefer.

Experimental background

I have a roughly split-plot design for a memory experiment. Participants (~40) from two different conditions perform a repeated measures task (~100 trials), where each trial has two different categories. On each trial, the participant needs to recall the presence/absence of a marker at each of ~40 grid locations (typically 10-15 markers per trial).

I want to predict the probability of a participant forgetting to place a marker at grid position, given a participant’s condition and the type of trial they saw. (Therefore there are several ways to represent the target quantity: as binary indicators at each grid location (~160000 observations), as counts per trial (~4000 observations), or as counts per participant (~80 observations).

Analyses

Reference analyses

TLDR: these all indicate large effect of trial type, weak effect of condition.

Permutation tests
  • (Original frequentist analysis)
  • Not ideal, as treats condition and trial type as totally independent, and uses per-subject point estimates instead of full data
  • Results: trial type has large and reliable (p << .001) effect, condition has small and unreliable (p ~=.3) effect
MLE logistic ANOVA

2 MLE logistic ANOVA

  • Uses sum coding, matching Krushcke’s sum-to-zero constraint.
  • Consistent with permutation tests

Notes:

Using statsmodels for MLE on the regression equation gives roughly \beta_0 = -2.2, \beta_{condition} = .07, and \beta_{trial\_type} = .21.

No \beta_{subject} are estimated well (confidence interval is extreme to either side of 0), as is \beta_{condition}.

When excluding \beta_{subject}, \beta_{condition} has a similar value but much tighter confidence interval. Everything else remains the same.

\beta_{interaction} is identical and tiny regardless of whether \beta_{subject} is included.

Hierarchical binomial models
  • Structurally similar to permutation tests; assumes independence of trial type and condition, but doesn’t
    – Unlike permutation tests, however, can be done at level of individual trials or grid locations within trials.
  • See Krushcke Ch 9 (p252 for model and JAGS)
  • Result is similar to permutation tests and MLE: small effect of condition, with some non-negligible posterior mass on opposite side of 0; large effect of trial type, with all posterior mass on one side of 0

Bayesian logistic ANOVA

  • Model from Krushcke DBDA 2e Ch 21 (p642)
    – Rough example also in this notebook.
  • I want to eventually report this analysis, but I’m not sure it’s working correctly, as it gives qualitatively different results from the other analyses.

Model

y \sim Binomial(\theta, n)
\theta \sim Beta(a, b)
a = \omega \cdot (\kappa - 2) + 1
b = (1 - \omega) \cdot (\kappa -2) + 1
\omega = \mathit{logistic}(\mu)
\mu = \beta_{0} + \beta_{condition}[x_{condition}] + \beta_{trial\_type}[x_{trial\_type}] + \beta_{interaction}[x_{interaction}] + \beta_{subject}[x_{subject}]
\beta_0 \sim N(0, \tau=\frac{1}{4})
\beta_i \sim N(0, \sigma_i)
\sigma_i \sim Gamma(1.64, .32)

Sample code
# Standard split-plot BANOVA design
# More or less follows
# https://github.com/JWarmenhoven/DBDA-python/blob/master/Notebooks/Chapter%2021.ipynb

# I've tried a bunch of variations on this, including changing to sum-to-zero encoding for inputs
# but these changes mostly don't seem to have that much impact

def create_coeff(name, shape):
    # Shared variance, sample coefficients
    sigma = pm.Gamma(f'sigma_{name}', 1.64, .32)
    a = pm.Normal(f'a_{name}', mu=0, tau=1/sigma**2, shape=shape)
    return a


def create_model(trial_index, condition_index, interaction_index, subject_index, 
                 y, n, batch_size,
                 num_trial_levels, num_condition_levels, num_interaction_levels, 
                 num_subjects): 
    """Create variables and return context manager.

       index args are integers indicating which level of variable
       y is error counts, n is number of possible errors
    """
    
    with pm.Model() as model:
        a0 = pm.Normal('intercept', 0, tau=1/2**2)
        a_position = create_coeff('trial', num_trial_levels)
        a_condition = create_coeff('condition', num_condition_levels)
        a_interaction = create_coeff('interaction', num_interaction_levels)
        a_subject = create_coeff('subject', num_subjects)
        
        mu = a0
        mu += a_position[trial_index]
        mu += a_condition[condition_index]
        mu += a_interaction[interaction_index]
        mu += a_subject[subject_index]
        mu = pm.Deterministic('mu', mu)

        omega = pm.Deterministic('omega', T.nnet.sigmoid((mu)))
        
        kappa = pm.Gamma('beta_variance', .01, .01)
        alpha = omega * (kappa - 2) + 1
        beta = (1 - omega) * (kappa - 2) + 1
        
        theta = pm.Beta('theta', alpha=alpha, beta=beta, shape=batch_size)
        y = pm.Binomial('targets', p=theta, n=n, observed=y)

        a = T.concatenate([a_position, a_condition, a_interaction, a_subject])
        m = pm.Deterministic('m', a0 + a)
        bb0 = pm.Deterministic('bb0', T.mean(m))
        bb = pm.Deterministic('bb', m - bb0)
        
        position_contrast = pm.Deterministic('bb_pos', bb[1] - bb[0])
        condition_contrast = pm.Deterministic('bb_con', bb[3] - bb[2])
        
    return model

Results

The results deviate from the other three analyses. No reliable effect of trial type, and condition sometimes has a reliable and large effect depending on model alterations. Some subject coefficients are estimated with tight posteriors around relatively large values.

Further details

There are two ways to implement this model. Krushcke uses an overparametrized scheme, where each level of each variable gets a coefficient, and these are corrected by substracting the level-wise mean from each coefficient. The alternative is to use sum-to-zero coding. There are advantages and drawbacks to both approaches.

In Krushcke’s approach, no dot product is necessary, which helps to accelerate sampling, especially when using trial or within-trial results as the target outcome. (Theano seems inefficient at these dot products, and PyMC3 doesn’t work very well with GPU compute for me, so this becomes a problem.) In practice, however, it seems to perform poorly when sampling.

In the sum-to-zero coding approach, the dot product is computationally limiting, but it simplifies the computations downstream by avoiding the need to adjust all regression params by level-wise means. Computing condition/trial type contrasts is as simple as doubling the coefficient value for a given level, since each has only two levels.

(The condition in which experiment condition has a large effect is using sum-to-zero coding with an additional correction. That correction subtracts the per-condition means of all \beta_{subject_i} and adding to the corresponding \beta_{condition}. This made sense to me because each participant belonged to only one condition, so any tendency across participants within a condition can be attributed to condition instead of participant coefficient.)

In either approach, I do notice that the sigma_i tend to be quite large.

Questions

Main question is, why am I getting a radically different result from the Bayesian ANOVA approach? I wouldn’t be that surprised if coefficients were just on different scales, but to find the relative presence/absence of effects basically reversed seems unusual. (Yes, I checked that I was passing the data correctly :wink: ).

My suspicion is that this has to do with prior on per-variable variance \sigma_i being on the wrong scale, but I’m not too sure - changing it hasn’t really changed the posteriors too much. I’ve also tried initializing the coefficient means to the outcomes from MLE point estimates, which helps some if the variance prior is also tightened. But inference seems to pull these towards the same values as zero initialization regardless.

Anyway, I’ve been working on this for weeks with no real progress - any suggestions, hints, or clues would be welcome!

After fiddling with some implementation details, I managed to get this running with the full data using the Bernoulli likelihood assumed by other models. Using the Bernoulli likelihood restored the expected behavior wrt the other analyses.

However, it has a huge drawback, in that it becomes computationally very expensive to compute the regression equation over 160k observations when including subjects. (~4 coefficients -> 41 coefficients; on my 16-core AMD, this takes close to a week to tune and take 8k samples).


A couple follow up questions.

  1. Why is there a big, qualitative difference between Bernoulli and binomial likelihoods here? I had intuited these would be mathematically equivalent, but it seems they’re not interchangeable when sampling.
    1. The binomial n varies by trial in my experiment - is that potentially an issue?
  2. I also found when computing the regression equation, it was actually more efficient to write out the dot product multiplication-sum pattern explicitly instead of using theano.Tensor.dot. The latter would give me memory errors when trying to run on multiple cores, and was slower on a single core. Anybody have an understanding of theano behavior here?

Maybe @JWarmenhoven could provide some comments here, but in general there should be no differences using a Bernoulli or Binomial likelihood, if you dont have additional trial level information.

Also, the slowness of Tensor.dot is likely GPU related, as in PyMC3 you mostly get better performance by using CPU only.

I think the issue is that \theta \not\perp n in my problem. When I introduce n as a regressor using Bernoulli likelihood, I find a huge coefficient.

If I have time, I may go back and confirm whether the difference persists with Binomial likelihood when including n as a regressor. But I think I have a grip on this now; thanks for your feedback!