Unsure how to proceed with prior and posterior predictive checking for Bayesian multiple logistic regression

Hi everyone,

Apologies if this question was answered previously - if so, I would appreciate any further reading and references! I did a thorough search both here and elsewhere, and I believe there is scarce information for beginners on Bayesian multiple logistic regression end-to-end workflows.

Below is my code for model checking and relevant dependencies using both PyMC and ArviZ. The model itself runs perfectly well after reviewing the trace plot, summary, etc. However, when attempting to plot the prior and posterior predictive checks, I think I am either A) totally misunderstanding how to interpret the plots or B) experiencing user error in generating the correct or intended plots.

If the former, could you please help me understand how to make sense of the plots? If the latter, is there an error with my code and / or what is the correct way to conduct these model checks for Bayesian multiple logistic regression? Additionally, what other follow-up procedures do you recommend for this type of analysis?

Thank you in advance for your time and support!


# Define the Bayesian multiple logistic regression model

    with pm.Model() as model: 
        a = pm.Cauchy('intercept', alpha = 0, beta = 10)
        b1 = pm.Cauchy('x1', alpha = 0, beta = (2.50 / (2 * 0.50)))
        b2 = pm.Cauchy('x2', alpha = 0, beta = (2.50 / (2 * 0.50)))
        b3 = pm.Cauchy('x3', alpha = 0, beta = (2.50 / (2 * 0.50)))
        b4 = pm.Cauchy('x4', alpha = 0, beta = (2.50 / (2 * 0.50)))
        b5 = pm.Cauchy('x5', alpha = 0, beta = (2.50 / (2 * 0.50)))
        b6 = pm.Cauchy('x6', alpha = 0, beta = (2.50 / (2 * 0.50)))
        # Gelman et al. 2008 DOI: 10.1214/08-AOAS191

        mu = pm.invlogit(a + b1 * df['x1'] + b2 * df['x2'] + 
            b3 * df['x3'] + b4 * df['x4'] + 
            b5 * df['x5'] + b6 * df['x6'])

        pm.Bernoulli('logit', p = mu, observed = df['y'])

# Run the Hamiltonian Monte Carlo method

    with model: 
        trace = pm.sample(draws = 1000, tune = 1000)
        # Increase both 'draws' and 'tune' for true analysis

# Evaluate prior and posterior predictive checks

        prior_pc = pm.sample_prior_predictive(samples = 10)
        # Increase 'samples' for true analysis
        az.plot_ppc(prior_pc, observed = True, colors = ['coral', 
            'lightskyblue', 'slategrey'], group = 'prior')

        post_pc = pm.sample_posterior_predictive(trace = trace, 
            progressbar = True)
        az.plot_ppc(post_pc, observed = True, colors = ['coral', 
            'lightskyblue', 'slategrey'], group = 'posterior')

Welcome!

In these sorts of cases, I think it’s helpful to state what you expect to see from these plots and how these plots differ from your expectations.

That being said, the plots look ok to me. Roughly speaking, the plots show the proportion of zeros (to the left) and the proportion of ones (to the right). There are several pairs proportions plotted simultaneously. One pair corresponds to your data (the blue line). The others are generated by the model (in orange) and each draw from the prior/posterior yields a new pair of proportions. The mean of these proportions is in black. Does that help?

Hi there!

Thanks for the quick reply - I am glad to hear these plots seem correct!

With equal parts being new to the Bayesian world and not being the biggest fan of logistic regression, I may just need more foundational knowledge to accomplish my current goals. With these plots, I anticipated some S-shape, but not necessarily the missing y-axis and scale of the x-axis. Given your explanation, I follow the y-axis roughly displaying proportions or counts, etc. However, why is the x-axis scaled from zero to two? Is it a function of binning, where less than one represents zero and greater or equal to one represents one?

That being said, it was mainly the prior predictive check that left me slightly confused. Below is my final version of the graph with 1,000 samples. How exactly would one describe this visual? I am assuming it would be ideal for the prior predictive mean to be equal across zeros and ones… Would you mind helping me finish that thought?

For the posterior predictive check, given the overlap, one would state that the model is satisfactory at retrodicting the data. Does that sound correct?

I am sure a lot of this information is PyMC 101, so I greatly appreciate your guidance!

If you were to obtain the plot you describe, what do you expect the x-axis to represent?

I too would probably prefer to plot the parameter p (or in your model, mu) that is fed into the logistic. However, as the above question implies, generating that plot requires some careful choices. For example, you can’t simply order the observations (e.g., the values on the x-axis of the plot you posted above) by their value/magnitude because you only have 2 values (0 and 1). You could probably craft such a plot if you had predictions about exactly how mu should vary across the observations in your data set. However, doing so would require a custom solution based on your data set and your understanding of that data set.

I would suggest checking out this notebook for a) some general tips on prior/posterior predictive checking workflow, b) some custom plots that could be used to visualize more useful patterns in your prior/posterior, and c) an example using logistic regression toward the end.

I believe so. The plotting doesn’t know that your data is dichotomous, so it just tries to handle the values it seems in your data set (and the values produced by your model).

I would summarize this plot as suggesting that the two values in your data are (approximately) equally probable. However, prior predictive plots should really be evaluated based on your understanding of your data and your model. For example, it’s not necessarily the case that one’s priors should imply equal probabilities of the two classes (e.g., you might know you are dealing with imbalanced data). Check out the notebook linked above for examples of alternative strategies for interrogating things.

Thank you so much for taking the time to reply with such thoughtful responses! I appreciate your use of questions and brainstorming to help me better understand this workflow.

I will certainly review the notebook you linked; I actually had reviewed it previously, but I guess I never made it to that last example at that bottom in enough depth…

And the last bit you provided is especially insightful: My data is actually highly imbalanced (i.e., ~90-10% split). I am going to take all of this back to the drawing board.

Have a great day!

1 Like

Glad it helped. As you continue to work on your model, free free to ask questions as you run into new problems/confusion/mysteries!

It might also be useful to consider arviz.plot_separation — ArviZ 0.15.1 documentation (which should appear in the see also section of plot_ppc, instead of linking to itself, sorry about that)