Multivariate-multinomial logistic regression

I’m new to PyMC3 and currently working on a multivariate-multinomial logistic regression. I came across the iris data set problem and using it as a template. I have the following questions

  1. I ran into convergence problems when I specified the shape for the alpha and beta (The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.). Does PyMC3 determine the shapes of the alpha and beta internally based on the shape of data in x?
  2. If I were to use different sampling distributions (like beta_1, beta_2, beta_3 and beta_4) for each of the four features in my data, how do I specify the linear equation in mu?
  3. Can anyone please share a snippet of code for performing posterior predictions? Basically I would want to specify values to the four features and get a probabilities of outcome for the three classes
    Here’s my code. Any help is appreciated and thanks for your time

data = pd.read_csv(‘data.csv’)

data[‘TYPE’]= label_encoder.fit_transform(data[‘TYPE’])
y_obs = data[‘TYPE’].values
x_n = data.columns[:-1]
x = data[x_n].values
ndata= x.shape[0]
nparam = x.shape[1]
nclass = len(data[‘TYPE’].unique())

with pm.Model() as hazmat_model:
alfa = pm.Normal(‘alfa’, mu=0, sd=10, shape=nclass)
beta = pm.Normal(‘beta’, mu=0, sd=100, shape=(nparam,nclass))
alfa = pm.Normal(‘alfa’, mu=0, sd=10)
beta = pm.Normal(‘beta’, mu=0, sd=100)
mu =,beta) + alfa
p = tt.nnet.softmax(mu)
yl = pm.Categorical(‘y_obs’, p=p, observed=y_obs)
trace = pm.sample(niter, step = pm.Metropolis())

Welcome WhineKing

Your model is setup pretty well in regard to the shapes. You need to specify the shapes of alpa and beta manually as you have done in your code. You can use more verbose code like beta_0, beta_1 but it’s rarely the best approach. It’s cleaner and a little bit quicker to use vectorised parameters.

Regarding your convergence, there are three issues:

  • The priors are far too wide, I would recommend Normal(0, 1)
  • Your data isn’t normalised. Bayesian methods work best if you have normalised inputs and outputs. In my code I used zero mean and unit standard deviation. For this example it may not matter (I didn’t test) but it’s good practice to standardise for all your projects. There can be reasons where this isn’t always the case but for simple models like this it’s a good first step.
  • You are using an outdated sampling technique. Metropolis has been superceded by NUTS which PyMC3 uses by default.

My code is below which also shows how to do the posterior prediction.

from sklearn.preprocessing import LabelEncoder, Normalizer, StandardScaler
import numpy as np
import pymc3 as pm
import theano.tensor as tt
import arviz as az
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('', header=None, names=[0, 1, 2, 3, 'TYPE'])
data['TYPE']= LabelEncoder().fit_transform(data['TYPE'])
y_obs = data['TYPE'].values
x_n = data.columns[:-1]
x = data[x_n].values
x = StandardScaler().fit_transform(x)
ndata = x.shape[0]
nparam = x.shape[1]
nclass = len(data['TYPE'].unique())

print( y_obs.shape, x.shape )

with pm.Model() as hazmat_model:
    X_data = pm.Data('X_data', x)
    y_obs_data = pm.Data('y_obs_data', y_obs)
    alfa = pm.Normal('alfa', mu=0, sd=1, shape=nclass)
    beta = pm.Normal('beta', mu=0, sd=1, shape=(nparam, nclass))
    mu =, beta) + alfa
    p = tt.nnet.softmax(mu)
    yl = pm.Categorical('obs', p=p, observed=y_obs_data)
    trace = pm.sample()
    idata = az.from_pymc3(trace)

with hazmat_model:
    pm.set_data({'X_data':np.random.normal(size=(8, 4))})
    pred = pm.fast_sample_posterior_predictive(trace)
1 Like

Thanks much @nkaimcaudle. Will incorporate your suggestions. Have one more question. Do I need to convert the y_obs form a vector of length 'ndata" (number of observations) to more like a one-hot code array with shape (ndata, nclass) OR does PyMC3 does this internally?

pm.Categorical expects data like you have already provided. 1-dim array of length N with integers {0, …, p-1}

1 Like

Hi guys,
As always, Nicholas already brilliantly answered your question, but I just wanted to complement it and say that I wrote a NB showing how to prior and posterior predictive checks exactly on this kind of model. It’ll illustrate Nicholas’ point about your priors being too wide and show you a principled way of choosing them.

It extends the example you’ll find in Osvaldo Martin’s very good book, Bayesian Analysis with Python.
Hope this helps :vulcan_salute:


Hey Alex,
Your have explained very neatly the various ways to do predictive posterior checks in your NB on multinomial regression. I implemented some aspects of it in my code and have the follow-up questions for you.

  1. I have 4 features in my model and I ran the counter_fact plots for each of the parameter in turn while holding others at mean=0. Below is the plot for parameter-1.

    For all the four cases I get an accuracy of 66%. I guess the lower accuracy is due to my features. Either there is some correlation in my features or I have inadequate number of features for undertaking the regression. I tried changing the std dev for the priors alfa and beta and that didn’t help as NUTS had no convergence problems with std dev = 1.0. Also, my data does not equally represent the three types (typ-1, typ-2 and typ-3) of observations that I’m trying to predict. I have 60% of data for type-1, 25% for type-2 and 15% for type-3. Could this also be contributing to the lower accuracy?

  2. Next I ran ppc on 1800 raw data points (with default 20,000 sampling from the posterior). The plot of the mean of the 20,000 predictions against the observations (1800 observations) is as below. As you can see the predictions are really bad - the mean of the predictions is between type-1 (0) and type-2 (1). What could be the reason for this?

Any help or suggestions are welcome

Glad I could help!
The PPC do look quite bad. It looks like your model only predicts type1 or type2, and even then it can’t really decide between the two. It’s hard to speak specifically without looking at the model, but I think what I’d do is a mix of:

  • Prior predictive plots on the outcome scale (what you did above, but just with the priors, without any data. This should help you see if there is something wrong in your parametrization, that impedes the model from updating when it gets data).
  • Test with simpler models first: intercept-only, one intercept, two intercepts, etc. and see whether it changes predictions. Maybe you just have too many predictors and this is confounding inference? Model comparison should help you with that.
  • Thinking generatively about your model to select predictors: how can the data happen? What is the process at play and which predictors are potentially relevant to this process? This is realted to confounding.
  • If you think class unbalance is a problem, balance your dataset and sample from your model with it. My bet would be it’s not, since your Multinomial likelihood encodes knowledge about the number of trials in each category, so the model will takes this into account and be more uncertain for type 3 than for type 1 for instance. Plus, there should be a big bias in favor of type 1 in your last plot if class imbalance were a big problem.
  • After all these checks, try to extend to a hierarchical model to pool information across categoris and shrink parameters as a result.

Hope this helps :vulcan_salute:


In the iris dataset example, you are using the mean of the predictions to illustrate the accuracy of the model (see the code snippet below).

y_pred = post_checks[“yl”].mean(0)

Since the predictions in post_checks["yl’] are the category numerical labels (either 0, 1, 2) and not probabilities, is it appropriate to use the average of the category numerical labels as a measure of the accuracy? In you approach, the category “0” will not have any contribution to the mean. Instead how about using the category with the highest frequency as below?

with model_sf:
pm.set_data({“X”: x_s})
post_checks2 = pm.sample_posterior_predictive(trace_sf, var_names=[“θ”], random_seed=RANDOM_SEED)[“θ”]
θ_mean2 = post_checks2.mean(0)
y_pred2 = np.argmax(θ_mean, axis=1)
jitter = np.random.normal(0, 0.03, len(y_s))

plt.figure(figsize=(12, 5))
plt.scatter(y_s + jitter, y_pred2, alpha=0.4)
plt.xticks(range(3), iris.species.unique())
plt.xlabel(“Observed category”)
plt.yticks(range(3), iris.species.unique())
plt.ylabel(“Predicted category”)
plt.title(“In-sample posterior predictive check”);

I’m not sure you should do that: you’re forcing a decision boundary onto the posterior proportions of your model and trying to simulate new observations, yl, with y_pred2 = np.argmax(θ_mean, axis=1). But if you want that, just take posterior predictive samples of yl and plot them raw. That way, your predictions will include the all the uncertainty present in your model – while your current approach doesn’t take into account the sampling uncertainty in the likelihood.

That’s what I do in my approach, but I go one step further and reverse-engineer the latent proportions associated with the new, predicted data – that’s what post_checks["yl"].mean(0) does. This takes into account the whole uncertainty and lets you understand how confident the model is for each category by seeing how the proportions split up by category.
Note that the average is computed on the axis 0, which is the axis of the number of samples, not the axis of the categories. So, we’re computing the frequencies within each category, not across them.

Hope this is clearer :slight_smile:

I gave it a second thought and you are rite in pointing out that I should not be forcing decision boundaries on the posterior predictions.
I have now reduced the number of predictors to two and the accuracy has improved from mid 50’s to high 70’s. However, one of the predictors is continuous and one is discrete ( six distinct values - 1, 2, 3, 4, 5, 6). Is there a different approach in terms of the selection of priors or sampling algorithm when the predictors are comprised of discrete and continuous values?
In addition to the multivariate multi-nomial regression, I have recast the problem to a multivariate-binomial regression. I want to visualize the occurrence probability as a function of the two predictors. Can you do surface plots of posterior predictions using arviz in PyMC3. If so, can you point me to snippet of the code?

I again, want to thank you for all your help and advise.


Glad that helped!

  • Having categorical predictors doesn’t change the choice of sampler (PyMC3 chooses the best one for you in most cases anyway), but you have at least two ways of specifying the priors for them: either with indicator variables or with index variables. The latter is usually easier to specify priors and is also more scalable to cases with many categories as well as hierarchical models. You’ll find examples of index variables for categorical predictors in chapter 11 of Rethinking 2 – here is the port to PyMC3.

  • You can definitely do PP plots with ArviZ. The best is that you sift through the API functions.

Hope that helps :vulcan_salute: