Implementing softmax regression with a latent Gaussian variable

I’m trying to implement softmax regression with a Gaussian latent representation. The issue I’m facing is purely technical, rather than theoretical, but here is a brief mathematical introduction (you might as well skip it and go straight to the the MCVE implementation).


Consider a D-part probability vector p. Since vector p is constrained to an Aitchison simplex \mathcal{S}^{D}, we can map p onto an orthonormal basis in \mathbb{R}^{D-1}.

f(p): \mathcal{S}^{D} \rightarrow \mathbb{R}^{D-1} \\ f(p) = \log(p) \times {\Omega}^{T}

Where \Omega is a contrast matrix of D-1 rows and D columns. This transformation is an isometry on \mathcal{S}^{D}, therefore we can parametrise p using a latent Gaussian model, e.g.

\alpha \sim \mathcal{N}(0, 1) \\ \beta \sim \mathcal{N}(0, 1) \\ \sigma \sim \left | \mathcal{N}(0, 1) \right | \\ l \sim \mathcal{N}(x \cdot \beta + \alpha, {\sigma}^{2}) \\ p = {f}^{-1}(l) \\

where \alpha, \beta and \sigma are (D-1)-dimensional vectors and x is a predictor variable.


Here is an MCVE implementation

import numpy as np
import theano
import pymc3 as pm


def closure(x):
    return x / x.sum(1)[:, None]


samples = np.array(
    [[0.18, 0.2 , 0.21, 0.17, 0.23],
     [0.2 , 0.2 , 0.2 , 0.17, 0.23],
     [0.2 , 0.2 , 0.2 , 0.18, 0.21],
     [0.18, 0.19, 0.2 , 0.16, 0.27],
     [0.18, 0.21, 0.21, 0.19, 0.21],
     [0.19, 0.19, 0.19, 0.17, 0.26],
     [0.2 , 0.2 , 0.19, 0.18, 0.23],
     [0.2 , 0.2 , 0.2 , 0.17, 0.23],
     [0.17, 0.18, 0.17, 0.18, 0.29],
     [0.18, 0.18, 0.17, 0.17, 0.29]]
)
omega = np.array(
    [[ 0.55,  0.55, -0.37, -0.37, -0.37],
     [ 0.  ,  0.  , -0.41, -0.41,  0.82],
     [ 0.  ,  0.  , -0.71,  0.71,  0.  ],
     [-0.71,  0.71,  0.  ,  0.  ,  0.  ]]
)
x = theano.shared(np.arange(10))

def closure(x):
    return x / x.sum(1)[:, None]

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma)
    probs = closure(pm.math.exp(pm.math.dot(latent, omega)))
    y = pm.Categorical('y', p=probs, observed=samples)
    trace = pm.sample()

Unfortunately, I get raises the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0853e19afaf7> in <module>
     24     sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
     25     mu = x[:, None] * beta + alpha
---> 26     latent = pm.Normal('latent', mu=mu, sd=sigma)
     27     probs = pm.math.exp(latent) / pm.math.exp(latent).sum(1)
     28     y = pm.Categorical('y', p=probs, observed=samples)
...
TypeError: For compute_test_value, one input test value does not have the requested type.

The error when converting the test value to that variable type:
Wrong number of dimensions: expected 0, got 2 with shape (10, 4).

As far as I can tell, you can’t pass 2D arrays to parametrise stochastic variables, unless they are observable. I can easily solve the problem by moving the softmax part out of the model

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x[:, None] * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma, observed=(np.log(samples) @ omega.T))
    trace = pm.sample()

posterior_samples = pm.sample_posterior_predictive(trace, samples=100, model=model)
latent_hat = np.mean(posterior_samples['latent'], axis=0)
y_hat = np.exp(latent_hat @ omega) / np.exp(latent_hat @ omega).sum(1)[:, None]

, but I’d like to have it in the model itself.

The only solution I’ve found so far is to implement a custom distribution encapsulating these steps:

latent = pm.Normal('latent', mu=mu, sd=sigma)
probs = pm.math.dot(omega, pm.math.exp(latent)) / pm.math.exp(latent).sum(1)
y = pm.Categorical('y', p=probs, observed=samples)

Is there an easier/cleaner way? Technically speaking, I only need latent = pm.Normal('latent', mu=mu, sd=sigma) to sample a Gaussian.

Specifying the shape should do the tricks, but you do need to be careful checking them to make sure they are correct:

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x[:, None] * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma, shape=(10, 4))
    probs = pm.math.dot(pm.math.exp(latent), omega) / pm.math.exp(latent).sum(1, keepdims=True)
    y = pm.Categorical('y', p=probs, observed=samples)
    trace = pm.sample()

However, this will not work currently, as samples you defined above is float type array, and categorical accept observed being one-hot encoded int array (in this case, a length 10 array)
Also, normalizing random variables (i.e., the softmax in your model) makes your model unidentifiable - you might find this discussion here helpful: Multinomial hierarchical regression with multiple observations per group ("Bad energy issue")

1 Like
  1. I suppose, specifying a shape in latent means that any future input must follow this shape. Is this the case?
  2. I had clearly misunderstood the distinction between pm.Multinomial and pm.Categorical, because I thought the latter was simply a normalised variant of the former. I’ve updated the maths in my question to make it clear. Since my original data are count vectors, I’ve reformulated the problem in terms of multinomial regression by adding a shared variable with trial sizes.
def closure(x):
    return x / x.sum(1)[:, None]


samples = np.array(
    [[287, 319, 335, 271, 367],
     [306, 306, 306, 260, 352],
     [295, 295, 295, 265, 309],
     [285, 301, 317, 253, 428],
     [184, 214, 214, 194, 214],
     [289, 289, 289, 258, 395],
     [320, 320, 304, 288, 368],
     [238, 238, 238, 203, 274],
     [341, 361, 341, 361, 581],
     [278, 278, 262, 262, 447]]
)
omega = np.array(
    [[ 0.55,  0.55, -0.37, -0.37, -0.37],
     [ 0.  ,  0.  , -0.41, -0.41,  0.82],
     [ 0.  ,  0.  , -0.71,  0.71,  0.  ],
     [-0.71,  0.71,  0.  ,  0.  ,  0.  ]]
)

c = theano.shared(samples.sum(1))
x = theano.shared(np.arange(10))


with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma, shape=(10, 4))
    probs = closure(pm.math.dot(pm.math.exp(latent), omega))
    y = pm.Multinomial('y', n=c, p=probs, observed=samples)
    trace = pm.sample()

This model compiles, but I get the ParallelSamplingError: Bad initial energy error you’ve mentioned. My case is different from the simple normalisation you’ve linked. I use an isometry to map probability vectors onto an orthonormal basis (and model them as latent Gaussians) and back to the simplex, but I guess any type of closure triggers this issue.

Yes

As long as the mapping doesnt introduce unidentifiability, your model should run. In the FAQ there is tips on how to deal with bad energy error you can take a look at.