Implementing softmax regression with a latent Gaussian variable

grayfall · May 23, 2019, 10:50am

I’m trying to implement softmax regression with a Gaussian latent representation. The issue I’m facing is purely technical, rather than theoretical, but here is a brief mathematical introduction (you might as well skip it and go straight to the the MCVE implementation).

Consider a D-part probability vector p. Since vector p is constrained to an Aitchison simplex \mathcal{S}^{D}, we can map p onto an orthonormal basis in \mathbb{R}^{D-1}.

f(p): \mathcal{S}^{D} \rightarrow \mathbb{R}^{D-1} \\ f(p) = \log(p) \times {\Omega}^{T}

Where \Omega is a contrast matrix of D-1 rows and D columns. This transformation is an isometry on \mathcal{S}^{D}, therefore we can parametrise p using a latent Gaussian model, e.g.

\alpha \sim \mathcal{N}(0, 1) \\ \beta \sim \mathcal{N}(0, 1) \\ \sigma \sim \left | \mathcal{N}(0, 1) \right | \\ l \sim \mathcal{N}(x \cdot \beta + \alpha, {\sigma}^{2}) \\ p = {f}^{-1}(l) \\

where \alpha, \beta and \sigma are (D-1)-dimensional vectors and x is a predictor variable.

Here is an MCVE implementation

import numpy as np
import theano
import pymc3 as pm


def closure(x):
    return x / x.sum(1)[:, None]


samples = np.array(
    [[0.18, 0.2 , 0.21, 0.17, 0.23],
     [0.2 , 0.2 , 0.2 , 0.17, 0.23],
     [0.2 , 0.2 , 0.2 , 0.18, 0.21],
     [0.18, 0.19, 0.2 , 0.16, 0.27],
     [0.18, 0.21, 0.21, 0.19, 0.21],
     [0.19, 0.19, 0.19, 0.17, 0.26],
     [0.2 , 0.2 , 0.19, 0.18, 0.23],
     [0.2 , 0.2 , 0.2 , 0.17, 0.23],
     [0.17, 0.18, 0.17, 0.18, 0.29],
     [0.18, 0.18, 0.17, 0.17, 0.29]]
)
omega = np.array(
    [[ 0.55,  0.55, -0.37, -0.37, -0.37],
     [ 0.  ,  0.  , -0.41, -0.41,  0.82],
     [ 0.  ,  0.  , -0.71,  0.71,  0.  ],
     [-0.71,  0.71,  0.  ,  0.  ,  0.  ]]
)
x = theano.shared(np.arange(10))

def closure(x):
    return x / x.sum(1)[:, None]

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma)
    probs = closure(pm.math.exp(pm.math.dot(latent, omega)))
    y = pm.Categorical('y', p=probs, observed=samples)
    trace = pm.sample()

Unfortunately, I get raises the following error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0853e19afaf7> in <module>
     24     sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
     25     mu = x[:, None] * beta + alpha
---> 26     latent = pm.Normal('latent', mu=mu, sd=sigma)
     27     probs = pm.math.exp(latent) / pm.math.exp(latent).sum(1)
     28     y = pm.Categorical('y', p=probs, observed=samples)
...
TypeError: For compute_test_value, one input test value does not have the requested type.

The error when converting the test value to that variable type:
Wrong number of dimensions: expected 0, got 2 with shape (10, 4).

As far as I can tell, you can’t pass 2D arrays to parametrise stochastic variables, unless they are observable. I can easily solve the problem by moving the softmax part out of the model

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x[:, None] * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma, observed=(np.log(samples) @ omega.T))
    trace = pm.sample()

posterior_samples = pm.sample_posterior_predictive(trace, samples=100, model=model)
latent_hat = np.mean(posterior_samples['latent'], axis=0)
y_hat = np.exp(latent_hat @ omega) / np.exp(latent_hat @ omega).sum(1)[:, None]

, but I’d like to have it in the model itself.

grayfall · May 24, 2019, 10:41pm

The only solution I’ve found so far is to implement a custom distribution encapsulating these steps:

latent = pm.Normal('latent', mu=mu, sd=sigma)
probs = pm.math.dot(omega, pm.math.exp(latent)) / pm.math.exp(latent).sum(1)
y = pm.Categorical('y', p=probs, observed=samples)

Is there an easier/cleaner way? Technically speaking, I only need latent = pm.Normal('latent', mu=mu, sd=sigma) to sample a Gaussian.

junpenglao · May 25, 2019, 5:40am

Specifying the shape should do the tricks, but you do need to be careful checking them to make sure they are correct:

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x[:, None] * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma, shape=(10, 4))
    probs = pm.math.dot(pm.math.exp(latent), omega) / pm.math.exp(latent).sum(1, keepdims=True)
    y = pm.Categorical('y', p=probs, observed=samples)
    trace = pm.sample()

However, this will not work currently, as samples you defined above is float type array, and categorical accept observed being one-hot encoded int array (in this case, a length 10 array)
Also, normalizing random variables (i.e., the softmax in your model) makes your model unidentifiable - you might find this discussion here helpful: Multinomial hierarchical regression with multiple observations per group ("Bad energy issue")

grayfall · May 25, 2019, 2:09pm

I suppose, specifying a shape in latent means that any future input must follow this shape. Is this the case?
I had clearly misunderstood the distinction between pm.Multinomial and pm.Categorical, because I thought the latter was simply a normalised variant of the former. I’ve updated the maths in my question to make it clear. Since my original data are count vectors, I’ve reformulated the problem in terms of multinomial regression by adding a shared variable with trial sizes.

def closure(x):
    return x / x.sum(1)[:, None]


samples = np.array(
    [[287, 319, 335, 271, 367],
     [306, 306, 306, 260, 352],
     [295, 295, 295, 265, 309],
     [285, 301, 317, 253, 428],
     [184, 214, 214, 194, 214],
     [289, 289, 289, 258, 395],
     [320, 320, 304, 288, 368],
     [238, 238, 238, 203, 274],
     [341, 361, 341, 361, 581],
     [278, 278, 262, 262, 447]]
)
omega = np.array(
    [[ 0.55,  0.55, -0.37, -0.37, -0.37],
     [ 0.  ,  0.  , -0.41, -0.41,  0.82],
     [ 0.  ,  0.  , -0.71,  0.71,  0.  ],
     [-0.71,  0.71,  0.  ,  0.  ,  0.  ]]
)

c = theano.shared(samples.sum(1))
x = theano.shared(np.arange(10))


with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=1, shape=omega.shape[0])
    beta = pm.Normal('beta', mu=0, sd=1, shape=omega.shape[0])
    sigma = pm.HalfNormal('sigma', sd=1, shape=omega.shape[0])
    mu = x * beta + alpha
    latent = pm.Normal('latent', mu=mu, sd=sigma, shape=(10, 4))
    probs = closure(pm.math.dot(pm.math.exp(latent), omega))
    y = pm.Multinomial('y', n=c, p=probs, observed=samples)
    trace = pm.sample()

This model compiles, but I get the ParallelSamplingError: Bad initial energy error you’ve mentioned. My case is different from the simple normalisation you’ve linked. I use an isometry to map probability vectors onto an orthonormal basis (and model them as latent Gaussians) and back to the simplex, but I guess any type of closure triggers this issue.

junpenglao · May 26, 2019, 6:28am

Yes

As long as the mapping doesnt introduce unidentifiability, your model should run. In the FAQ there is tips on how to deal with bad energy error you can take a look at.

Topic		Replies	Views
Implementing Softmax Regresion Questions	3	2236	July 9, 2018
Multidimensional input using Gaussian Process Questions	6	3994	June 28, 2017
Multivariate linear regression with different prior distributions for coefficients v3 modeling	0	536	May 7, 2022
GP Latent implmenentation in PyMC version agnostic gaussian_process	4	68	February 23, 2025
Matrix operations involving observed and latent variables v3	9	922	March 14, 2022

Implementing softmax regression with a latent Gaussian variable

Related topics