Hierarchical logistic regression

Here is some code I’ve found myself using a lot recently. It’s a basic hierarchical logistic regression. It has sensible prior choices (seemingly sensible to me at least), accepts a patsy-style design string, and has some nice convenience features such as automatically naming coefficients and “compressing” the data from long form (one row per Bernoulli trial) into count/binomial format. Please feel free to use or critique it!

Reproduced below:

"""Simple fully contained script to fit a Bayesian hierarchical logistic
regression model using PyMC3.

import theano
import matplotlib
matplotlib.use('Agg')  # seems to be necessary on Unix
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
from patsy import dmatrix
import matplotlib.pyplot as plt

def compress(data, outcome, formula, submodel):
    """Compress the data frame from "Bernoulli" format (one row per trial) to
    "binomial" format (one row per condition). This can dramatically speed up

        data (pd.DataFrame): Data in Bernoulli format.
        outcome (str): Name of the outcome variable (i.e., dependent variable)
        formula (str): Patsy formula specifying the independent variables.
        submodel (str): Name of variable used to index submodels.

        pd.DataFrame: Data in binomial format.

    predictors = [t for t in data.columns if t in formula or t == submodel]
    entries = []

    for idx, df in data.groupby(predictors):

        entry = {k: v for k, v in zip(predictors, idx)}
        entry['n%ss' % outcome] = df[outcome].sum()
        entry['ntrials'] = len(df)

    return pd.DataFrame(entries)

def model(data, outcome, formula, submodel):
    """Constructs model and places it in context, ready for sampling.

        data (pd.DataFrame): Data in binomial format.
        outcome (str): Name of the outcome variable (i.e., dependent variable)
        formula (str): Patsy formula specifying the independent variables.
        submodel (str): Name of variable used to index submodels.

        None: Model placed in context.

    design_matrix = dmatrix(formula, data)
    submodel_names = data[submodel].unique()
    sub_ix = data[submodel].replace(
        {r: i for i, r in enumerate(submodel_names)}).values
    betas = []

    for n in design_matrix.design_info.column_names:

        n = n.replace('"', '').replace("'", '').replace(' ', '').replace(',','')
        μ = pm.Cauchy(name='μ_' + n, alpha=0, beta=5)
        σ = pm.HalfCauchy(name='σ_' + n, beta=5)
        δ = [pm.Normal(
            name='δ_(%s=%s)_(condition=%s)' % (submodel, r, n), mu=0., sd=1.
        ) for r in submodel_names]
        β = [pm.Deterministic(
            'β_(%s=%s)_(condition=%s)' % (submodel, r, n), μ + d * σ
        ) for d, r in zip(δ, submodel_names)]

    B = tt.stack(betas, axis=0).T
    p = pm.invlogit(tt.sum(np.asarray(design_matrix) * B[sub_ix], axis=1))

        name='n%ss' % outcome,
        observed=data['n%ss' % outcome].values

def run():
    """Run the model.


    with pm.Model():

        model_name = ''  # give your model a name
        data = pd.read_csv('')  # path to Bernoulli-formatted CSV
        outcome = ''  # DV column name
        formula = ''  # Patsy-style formula
        submodel = ''  # submodel index column
        data = compress(data, outcome, formula, submodel)
        model(data, outcome, formula, submodel)
        backend = pm.backends.Text(model_name)
        trace = pm.sample(10000, tune=2000, trace=backend)
        params = open('%s/chain-0.csv' % model_name).readline().split(',')
        params = [p for p in params if 'μ' in p]
        pm.traceplot(trace, params)
        plt.savefig('%s/traceplot.png' % model_name)
        pm.plot_posterior(trace, params)
        plt.savefig('%s/posteriors.png' % model_name)
        pm.df_summary(trace).to_csv('%s/summary.csv' % model_name)

if __name__ == '__main__':


Pretty neat! Do you have some example code and data?

No data to hand that I have permission to share, unfortunately. I’ve only used this on other people’s data thus far.

Very nice!
I’d maybe think about standardising the predictors, or adjusting the prior scales to the sd of the predictor. For the σ vars I think I’d maybe rather use an pm.StudentT(nu=3, sd=2.5)
You could have a look at how rstanarm chooses its defaults.


I don’t standardize the predictors simply because each time I’ve used this the predictors have been categorical. Personally, when dealing with continuous predictors in a logistic regression I prefer to scale and center them before building the model, and to keep them in units that make sense for the model I’m building. For example, I often use IQ as a predictor, which I transform to iq_ = (iq - 100) / 15 beforehand, so that it is centered on the putative population average and is in units of putative population standard deviation.

Changing the prior on σ is an interesting idea, that’s actually the decision I’m most unsure of.

Thanks for sharing!

I’ve only got one predictor variable but it’s continuous. From what I can tell the code isn’t really set up to handle a continuous variable. For example, it loops over data.groupby(predictors) which I’m not sure makes sense for a continuous predictor. Would you be able to offer advice on how I might be able to modify this for a continuous predictor? I saw you mentioned IQ in the comments but I’m new to all this and still don’t understand how I’d implement it based on that example.

I will work just fine with a continuous predictor.

Hi sammosummo,

Thank you for the code.
Can you please provide an example of how to use this code?

Thanks you.

The script is a self-running example. Just fill in the missing strings in top 5 lines of run().

Okay. Will it work if the submodel variable has large number of categories ( around 2500 in my case)?

You have 25,000 submodels? I don’t know of any software that will handle such a hugely complex model well.

Incidentally, recently I’ve been having trouble with the invlogit link function. I would switch this to sigmoid.

+1, really like the code organization :slight_smile: A few notes from my own recent experiences with logistic regression in PyMC3:

Using `Binomial` likelihood can sometimes give different results than `Bernoulli` likelihood.

I haven’t worked out exactly why this is the case for my problem, but it may have to do with n for Binomial varying by “trial” in my case. When I use data in Bernoulli format and introduce n as an additional regressor, I find a huge coefficient, so there’s likely some interaction there that can’t be picked up by the Binomial likelihood, which effectively treats theta as invariant with respect to n.

TLDR: YMMV if n varies by trial.

Sharing variance prior over levels of variables.

In my own work, I followed Kruschke’s approach of sharing variance prior over different levels of the same categorical variable. The motivating is to share scale for coefficients for different levels of the same variable.

However, if I’m reading your model function correctly, you’re going to have independent priors on variance for each variable regardless of level.

(Or is this what the submodel_names logic is doing here? I’m having a bit of trouble following it without seeing some example dataframe/design matrix, but seems like it has something to do with selecting a subset of the data instead…)

Using `Beta` prior over likelihood.

Again following Kruschke; he likes to use a Beta prior over the likelihood and use the regression equation to model the expectation of that Beta prior. I’m curious if you’ve tried this!

(For me, it slowed sampling down and didn’t have a noticeable impact on coefficient estimates, so I took it out. Probably less important as size of dataset increases.)

I’m not sure I understand this. I have found equivalent results when binning trials into conditions and using Binomial with varying n and not binning and using Bernoulli to model each trial individually, as expected.

That’s correct. My script does nothing special for categorical predictors, it just reads a patsy string to create a design matrix. If you use the C(var) constructor to create multiple predictors from the same categorical variable these predictors get treated separately. Sharing the variance prior sounds like a good idea, but slightly tricky to implement programmatically I suspect.

A Beta prior over the likelihood, does this mean instead of a Binomial or Bernoulli likelihood, or something else?

I have found equivalent results …

Yeah, I was surprised by this as well! Bernoulli and binomial should be equivalent as long as \theta \perp n.

However, I have a different situation, where \theta \not\perp n. In that case, if n is not a regressor in the model, the binomial likelihood is disproportionately affected by making use of n but treating it as independent of \theta (compared to the Bernoulli model, which ignores n entirely).

For example: Imagine an experiment where somebody has to remember the color of each ball in a bucket. As the number of balls in a bucketincreases, do you expect the probability of remembering each ball’s color remain flat across buckets?

… slightly tricky to implement …

Yep. I handled this with some func like create_coeff(name, shape, ...). But I don’t have the same convenient inference of model from formula that you developed :slight_smile:

Beta prior

\mu = logistic(\beta X)
\kappa \sim Gamma(.01, .01)
a = \mu \cdot \kappa
b = (1 - \mu) \cdot \kappa
\theta \sim Beta(a, b)
y \sim Binomial(\theta, n)

Instead of
\theta = logistic(\beta X)
y \sim Binomial(\theta, n)

I’m still not quite following this, but I think in all the times I’ve ran somethings like this n is more or less constant, so it isn’t an issue, I think.

What purpose does this serve?

I’d expect you wouldn’t see \theta \not\perp n often. In certain areas though (psychology, economics, neuroscience, maybe agent-based AI), you’ll find some resource-constrained problems (like recall from finite memory) that have this property.

Another way to look at this: if \theta \not\perp n, then outcomes are not really identically distributed unless \theta is conditioned on n. If IID assumption is not close to reality, you can get strange behavior when rolling up to binomial level.

If you start writing as though to derive MLE, you’ll spot the difference pretty quickly.

But, as you say, this is usually not the case - IID usually a fine assumption.

Using a beta prior will allow you to explicitly model the variance of \theta. The variance of Bernoulli depends only on \theta, so you can’t estimate variance/expectation separately for it. (Same for binomial, since n is usually fixed). Therefore a beta prior allows you to model noise across trials in \theta estimates.

(If you consider the linear regression case with a normal likelihood, you would usually introduce a prior on \sigma for the likelihood - no way to do this with bernoulli/binomial.)

Put another way, it makes the model more robust and helps handle potentially overdispersed outcomes.

(Additionally, if you are doing something ANOVA-like, it reflects the idea of homogeneity of variance structurally. Maybe not that important for Bayesians, but if you find yourself pitching to frequentist die-hards, can help.)

1 Like

Thanks, @galbiati!

This is a fascinating discussion that has simultaneously taught me a great deal and completely ruined my workday.


I like your code very much.
I’ve found that all the coefficients of independent variables are different in different sub-models.
Would you mind providing a verison that some coefficients are identical and some coefficients are different in sub-models. It can reduce the amount of calculation.
And to my knowledge, a model like this may be called mixed model which has random effect and fixed effect.
Thank you.