Softmax Regression with shared coefficients between classes

I want to build a softmax regression model, where the coefficients are assumed to be the same for the different classes, and what varies are the input features. I could only make it work (kind of, still having problems with the intercept) for two classes, using a Logistic Regression model. I am however failing to do it using what I thought would have been the equivalent Softmax Regression model. I would like to figure this out before attempting to build a model for more classes.

Here is how I generate the data:

n = 100
betas = [0, 2, 0]

x1 = np.random.rand(n*(len(betas)-1)) * 5 -2.5
x1 = x1.reshape(n, (len(betas)-1))
x1 = add_constant(x1)
mu1 = np.dot(x1, betas)

x2 = np.random.rand(n*(len(betas)-1)) * 5 -2.5
x2 = x2.reshape(n, (len(betas)-1))
x2 = add_constant(x2)
mu2 = np.dot(x2, betas)

softprobs = [softmax([t1, t2]) for t1, t2 in zip(mu1, mu2)]
y = [np.random.choice((0,1), p=p) for p in softprobs]

Here is my almost-working Logistic Regression model (the intercept is underdefined):
with pm.Model() as m1:
    b = pm.Normal('b', 0, 20, shape=len(betas))
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)    
    theta = pm.math.sigmoid(mu2-mu1)    
    yl = pm.Bernoulli('y', p=theta, observed=y)
    trace_m1 = pm.sample(1000)

Here is what I expected to be the equivalent Softmax model:
with pm.Model() as m2:    
    b = pm.Normal('b', 0, 20, shape=len(betas))    
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    mus = tt.transpose(tt.stack((mu1, mu2)))
    theta = tt.nnet.softmax(mus)
    yl = pm.Categorical('y', p=theta, observed=y)
    trace_m2 = pm.sample(1000)

I am, however, able to retrieve the same results of the first model if I simply use one of the columns of the softmax output with a Bernoulli distribution for the data:
with pm.Model() as m3:    
    b = pm.Normal('b', 0, 20, shape=len(betas))    
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    mus = tt.transpose(tt.stack((mu1, mu2)))
    theta = tt.nnet.softmax(mus)
    yl = pm.Bernoulli('y', p=theta[:, 1], observed=y)
    trace_m3 = pm.sample(1000)

Any idea what is going on? How may I go about fixing the Softmax model (as well as the intercept being underdefined in the Logistic model)?

I am grateful for any input you can give.

I would strongly recommend trying to generate data with pymc3 first; since it finds niggling little shape errors with the model.

To be pedantic: this fails:

import pymc3 as pm
import theano
import theano.tensor as tt

x1 = [1., 2., 3., 4., 5.]
x2 = [5., 5., 0., 0., -3.]
x3 = [0., -2, 3., 6., 0.]

with pm.Model() as m2:    
    b = pm.Normal('b', 0, 20, shape=len(x1))    
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    mu3 = pm.math.dot(x3, b)
    mus = tt.transpose(tt.stack((mu1, mu2, mu3)))
    theta = pm.Deterministic('theta', tt.nnet.softmax(mus))
    yl = pm.Categorical('y', p=theta)
    spp = pm.sample_prior_predictive(10)

but the error suggests that there is a size issue. Removing yl and running shows that

spp['theta'].shape

(10, 1, 3)

So theta is becoming a (1,3) matrix instead of a vector. Reshaping it leads to:

with pm.Model() as m2:    
    b = pm.Normal('b', 0, 20, shape=len(x1))    
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    mu3 = pm.math.dot(x3, b)
    mus = tt.transpose(tt.stack((mu1, mu2, mu3)))
    theta = pm.Deterministic('theta', tt.nnet.softmax(mus)).reshape((3,))
    yl = pm.Categorical('y', p=theta)
    spp = pm.sample_prior_predictive(10)

which works fine. The shape of theta under the prior predictive also tells you what to do when you lump your observations together:

x1 = np.random.uniform(size=(8,4))
x2 = np.random.uniform(-1, 1, size=(8, 4))
x3 = np.random.uniform(0,2,size=(8,4))

with pm.Model() as m2:    
    b = pm.Normal('b', 0, 20, shape=x1.shape[1])    
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    mu3 = pm.math.dot(x3, b)
    mus = tt.transpose(tt.stack((mu1, mu2, mu3)))
    theta = pm.Deterministic('theta', tt.nnet.softmax(mus)).reshape((x1.shape[0], 3))
    yl = pm.Categorical('y', p=theta, shape=x1.shape[0])
    spp = pm.sample_prior_predictive(10)

I appreciate the warning regarding shapes handling, but in my case that does not seem to be the issue. PyMC is able to infer the shapes correctly when I use my originally multidimensional Xs inputs.

In hindsight, it makes sense that the intercept would not be determined, since in this case it does not matter (one of the two options is always chosen, and the difference in probability is unaffected when adding the same constant to both).

My problem with not being able to replicate the results with the Categorical model (model 2) remains the same though :frowning:

Apologies, I hit the shape error and assumed that was your problem :blush:

You can recover the appropriate behavior by using pm.Multinomial in place of pm.Categorical

y_mn = np.array([[0,1] if t == 0 else [1,0] for t in y])

with pm.Model() as m1:
    b = pm.Normal('b', 0, 20, shape=x1.shape[1])
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)    
    theta = pm.Deterministic('theta_contrast', pm.math.sigmoid(mu2-mu1))
    yl = pm.Bernoulli('y_brn', p=theta, observed=y)

    trace1 = pm.sample(1000, tune=1000)
    
with pm.Model() as m2:
    b = pm.Normal('b', 0, 20, shape=x1.shape[1])
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    
    mus = tt.transpose(tt.stack((mu1, mu2)))
    theta2 = pm.Deterministic('theta_sm', tt.nnet.softmax(mus))
    yl2 = pm.Categorical('y_cat', p=theta2, observed=y)
    trace2 = pm.sample(1000, tune=1000)
    
with pm.Model() as m2:
    b = pm.Normal('b', 0, 20, shape=x1.shape[1])
    mu1 = pm.math.dot(x1, b)
    mu2 = pm.math.dot(x2, b)
    
    mus = tt.transpose(tt.stack((mu1, mu2)))
    theta2 = pm.Deterministic('theta_sm', tt.nnet.softmax(mus))
    yl2 = pm.Multinomial('y_cat', n=1, p=theta2, observed=y_mn)
    trace3 = pm.sample(1000, tune=1000)

pm.traceplot(trace1, ['b']);
pm.traceplot(trace2, ['b']);
pm.traceplot(trace3, ['b']);

here, b = np.array([20, 10, -15, -35])

(Bernoulli)

(Categorical)

(Multinomial)

Actually it looks like my coding of y_mn is reversed because b is; but you get the idea.

@lucianopaz
I think something is wrong with the way pm.Categorical is inferring the shape of the distribution vs the shape of the observed variable – or the doc needs to be more explicit about this case where it is clearly different from Multinomial(n=1,p).

1 Like

@chartl Thank you so much, that definitely fixed it!

Is there any theoretical reason why pm.Categorical should fail in this case?