Why my multinomial model with categorical predictors and response differs so much from results in R?

Hi everybody.
I’m new to both PyMC and multinomial regression. I’m currently running a model where both the outcome and the (multiple) predictors are categorical, the latter including an interaction term.

I just tried to adapt the following code:

with pm.Model() as model_sf:
    α = pm.Normal("α", mu=0.0, sd=1.0, shape=2)
    α_f = tt.concatenate([[0], α])

    β = pm.Normal("β", mu=0.0, sd=0.5, shape=(4, 2))
    β_f = tt.concatenate([np.zeros((4, 1)), β], axis=1)

    X = pm.Data("X", x_s)
    μ = α_f + pm.math.dot(X, β_f)
    θ = pm.Deterministic("θ", tt.nnet.softmax(μ))

    yl = pm.Categorical("yl", p=θ, observed=y_s)

    trace_sf = pm.sample(1000, tune=2000, cores=2, random_seed=RANDOM_SEED)

that I found in this very good tutorial.

Given my data (example csv attached) I adapted it as follows:

df = pd.read_csv('test.csv')
df['Y'] = pd.Categorical(df['phenotype'])
gen1 = df['genotype1'].to_numpy()
gen2 = df['genotype2'].to_numpy()
interaction = pd.Categorical(gen1 * gen2).codes
gen1 = pd.Categorical(gen1).codes
gen2 = pd.Categorical(gen2).codes
x_s = np.column_stack([gen1, gen2, interaction])
y_s = pd.Categorical(df['phenotype'].to_numpy() - 1).codes
n_categories = 3

# Model
import pymc as pm
with pm.Model() as model_sf:
    α = pm.Normal("α", mu=0.0, sd=10, shape=n_categories - 1) # -1 to leave a reference category 
    α_f = pm.math.concatenate([[0], α]) # Adds 0 for reference category
    
    # Predictors, excepting reference 
    β = pm.Normal("β", mu=0.0, sd=10, shape=(3, n_categories - 1))  # 3 regressors, 2 non-reference categories
    β_f = pm.math.concatenate([np.zeros((3, 1)), β], axis=1)  # Adds columns of 0's for reference category

    # Input data
    X = pm.Data("X", x_s)
    
    # Compute logits
    μ = α_f + pm.math.dot(X, β_f)
    
    # Convert logits to probabilities using softmax
    θ = pm.Deterministic("θ", pm.math.softmax(μ, axis=1))
    
    # Categorical observations
    yl = pm.Categorical("yl", p=θ, observed=y_s)

I run the same analysis in R, essentially using:
modelo <- multinom(phenotype ~ genotype1 * genotype2, data=data)

and the results differ a lot, with the R’s result being closer to what I should expect, given the data. The R model, for instance, gives me these results:

while the Bayesian model gives me

as you can see, nor the magnitude, direction or significance of the coefficients are similar.

I’ve found very few examples using categorical outcomes and predictors, and none with an interaction term. The most similar is this one, both with only one regressor (I’ve just ask him some questions to be able to replicate his model):

Could somebody kindly tell me what I’m doing wrong?

All the best,
Mauricio.

a-lobe_left.csv (408 Bytes)

What do prior predictive samples from your model look like?

Hi @Mauro, do you need/want to code this model in PyMC? Otherwise you can just use Bambi. See this example notebook Categorical Regression – Bambi.

1 Like

As @tcapretto said, if you’re a beginner on both PyMC and multinomial regression, I’d recommend starting with Bambi. These models are complex and have a lot of details you need to attend to.
Bambi is gonna make a lot of those choices for you, making it a good first choice (that you can then outgrow if you want to :wink: )

1 Like

Hi everyone,
thanks a lot for your suggestions. After been out of the town for a few days, I did my homework.

The problem was that the model was misspecified. I uploaded an example data where I tried to predict a categorical 4 levels outcome (phenotypes) given three categorical predictors (gen 1 and gen 2 and their interaction), but my real data contains four categorical predictors: one control gen, two mutants gen, and the interaction between the two mutants. I was specifying a multinomial regression in such a way that one outcome was used as a reference category (very well) but using all the four predictors (bad idea!).

Thus, I rewrote the same model this way:

# Categorical outcomes and the number of outcome categories:
y_s = pd.Categorical(df['phenotype']).to_numpy()
n_categories = 4

# Categorical predictors. Notice that I left gen1 out of the x_s stack!
gen1 = df['gen1'].to_numpy() # I included it here just for clarity about the specification
gen2 = df['gen2'].to_numpy()
gen3 = df['gen3'].to_numpy()
x_s = np.column_stack([gen2, gen3, gen2 * gen3])

# Model
with pm.Model() as model_sf:
    # Intercepts for non-references categories (category 1 as reference)
    alpha = pm.Normal("alpha", mu=0.0, sigma=1.0, shape=n_categories - 1)
    alpha_ref = pm.math.concatenate([[0], alpha]) # Adds 0 for reference category

    # Regressors (3 regressors, as regressor 1 has been excluded)
    betas = pm.Normal("betas", mu=0.0, sigma=0.5, shape=(3, n_categories - 1))
    betas_ref = pm.math.concatenate([np.zeros((3, 1)), betas], axis=1) # Add columns of 0s for reference category
    
    X = pm.Data("X", x_s)
    mu = alpha_ref + pm.math.dot(X, betas_ref)
    theta = pm.Deterministic("theta", pm.math.softmax(mu, axis=1))
    yl = pm.Categorical("yl", p=theta, observed=y_s)

Now, the results of the model look way better than the R (frequentist) outcome. Qualitatively, are the same, regarding the direction and significances of the coefficients, but the latter yields some non-credible very big or small values (i.e., the odd-ratios gave me some absurd probabilities). Another non-trivial advantage is that is much easier to export the model summary to a csv, because PyMC gives it immediately as a dataframe.

@jessegrabowski the prior predictive samples look equal to the reported in this tutorial which, as I mentioned, was the template for writing my code.

I played with the priors sigma to finally realise that those were the best parameters. Thanks for mentioned me that. However, I got a (naive) question: In the lines

for θ_s in zip(logistic(a_s - 3 * b_s), logistic(a_s + 4 * b_s)):
    plt.plot([-3, 4], θ_s, color="k", alpha=0.2)

what is the logic for giving the -3 and 4 values to the logistic arguments?

@tcapretto Thanks for your suggestion. I didn’t know Bambi. I played for a while with its models and tutorials, and it’s very nice and powerful.

@AlexAndorra Thanks also for your suggestion. I did look at its multinomial implementation, just to arrive to the same results as my specification was wrong XD. I have some experience with pyjags (I arrived here after realising that it has been not maintained for a couple of years), so I prefer to fight writing my own models. PyMC gives a way better experience.

Pd Is there a way to edit the title of my question? Could be misleading, as finally I got better results here than using R.

Best,
Mauricio.

3 Likes

Glad to hear you’re making progress!

Instead of removing the reference category, you could try pm.ZeroSumNormal for your priors. That builds in the n-1 degrees of freedom, so the marginal distributions of each component is normal, but they jointly always sums to zero, as the name promises.

If you want me to edit the title you can propose me a new one, but personally I think it’s OK that it reflects the state you started in (where R was better, then you tuned it and got nice results)

2 Likes