Unique solution for probabilistic PCA

ColinMcAuliffe · June 2, 2018, 6:54pm

I’m tinkering around with probabilistic PCA and I’m having all kinds of convergence problems. I didn’t add any constraints to the model to guarantee a unique solution for the factor matrix and so I’m wondering if that causes multimodality or something like that. Does anyone have any experience with this?
Example code is attached, thanks!
PPCA.py (1.4 KB)

gBokiau · June 2, 2018, 11:47pm

Well yes, the joys of Bayesian PCA. This is a well known problem, bPCA’s have no unique solution. Without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching.

If you plan to apply this to real-life data, one approach that I have found successful is to avoid the rotation problem by forcing the factor-matrix to be very sparse. I have used ‘witch hat’ priors to that effect (a mixture of two gaussians, one with very large sigma, one with near-0 sigma). This forces the model to find an ‘optimal’ rotation under which the factor matrix is mostly comprised of 0’s (a desirable property for most analysis).

This doesn’t solve the symmetry and the component identity (label switching) issues, however. The most elegant solution to that is to simply change the signs, and the order of columns manually after the inference so that factor matrices (and scores) match, and re-run convergence tests on those altered matrices.

Finally, if you’re interested in large-scale approximation, I have found some success in using ADVI with the method described in this paper.

Happy to elaborate any of the above or guide towards literature (specific or general) if you have specific questions/cases.

To summarise, your example will and should, by design, not converge.

ColinMcAuliffe · June 3, 2018, 12:42am

Thanks that’s all very interesting. I had just been exploring Bayesian PCA and Bayesian Factor analysis after readingthis paper on causal inference. So I don’t have any urgent real life application for this but it’s fascinating to learn about.

DanWeitzenfeld · October 9, 2018, 1:05am

I’m trying to wrap my mind around this. By manually, do you literally mean manually looking through the traces/chains and determining where label switching occurred, or is there an automated way to do this? Can you point me to where I can read more, or if you’re feeling generous, amend @twiecki’s example to illustrate what you mean? (When I run that example with multiple chains, I get label switching, as you might expect)

Thanks in advance for any guidance you can provide.

gBokiau · October 9, 2018, 4:14pm

If the factors are sufficiently contrasted, label and sign switching will typically only occur across chains, not inside a single trace. If they’re not, it will tend to just be one big mush. So yes, I do mean literally manually matching, reordering and sign-correcting factors in the matrices across traces in order to apply convergence tests. Treating them this way implies there is some convergence to begin with, so the formal tests are really mostly of a formality at that point. It does however allow to pinpoint which specific factors (weights) might be more or less stable, which in turn allows one to select those data points that produce more contrasted/telling factors.

DanWeitzenfeld · October 10, 2018, 1:12am

got it, thanks!

dycontri · December 29, 2018, 6:10am

Hi @gBokiau

Do you have any examples of how to implement the methods described in the paper in pymc3?
I can’t seem to reconcile what’s in the paper with a model in which I’ve encountered symmetry

I’ll keep trying to figure it out in the mean time

Thank you!

gBokiau · December 29, 2018, 12:18pm

Hi @dycontri

To be clear, I presume you refer to the VI methods described in the article by Moore?

I never got around to implementing the orthogonal symmetrisation method, however the signflip method requires but a small change to the MeanFieldApproximation.

From memory (should give you the gist, might even just work as is):

from pymc3.variational.opvi import Group, node_property
from pymc3.variational.approximations import MeanFieldGroup

@Group.register
class SymmetricMeanFieldGroup(MeanFieldGroup):
     __param_spec__ = dict(smu=('d', ), rho=('d', ))
     short_name = 'sym_mean_field'
     alias_names = frozenset(['smf'])

     @node_property
     def mean(self):
          return self.params_dict['smu']

     @node_property
     def symbolic_logq_not_scaled(self):

         z = self.symbolic_random
         
         logq = pm.NormalMixture.dist([.5,.5], [self.mean, -self.mean], [self.std, self,std]).logp(z)
         return logq.sum(range(1, logq.ndim))

ferrine · December 30, 2018, 10:27pm

That looks really neat! So happy to see that it is extensible and easy to implement! Groups API really help here as well as you do not have signflip symmetry in general.

github.com

pymc-devs/pymc3/blob/master/pymc3/variational/approximations.py#L85


    rho = np.zeros((self.ddim,))
    if self.batched:
        start = np.tile(start, (self.bdim, 1))
        rho = np.tile(rho, (self.bdim, 1))
    return {'mu': theano.shared(
                pm.floatX(start), 'mu'),
            'rho': theano.shared(
                pm.floatX(rho), 'rho')}


@node_property
def symbolic_random(self):
    initial = self.symbolic_initial
    sigma = self.std
    mu = self.mean
    return sigma * initial + mu


@node_property
def symbolic_logq_not_scaled(self):
    z0 = self.symbolic_initial
    std = rho2sigma(self.rho)
    logdet = tt.log(std)

I think this is missed in SMF implementation

Variable creations are as well not correctly implemented, I see them hardcoded in mean field implementation

dycontri · December 31, 2018, 1:22am

Yes, that’s exactly what I was looking for! I appreciate your help!

I have been using pymc3 for quite a while but hadn’t heard of the “Group” API.
I’ll take a stab at it with what you’ve provided.

dycontri · January 2, 2019, 7:42pm

@gBokiau, @ferrine

Ok, I did build a new example without the pymc3 import *, and that fixed the previous issue.
I had to modify the NormalMixture class to get rid of the call to get_tau_sd(), as it was causing an error, and shouldn’t be necessary since I’m providng std, not tau.

Here’s the code:

class NormalMixture(pm.Mixture):
    def __init__(self, w, mu, comp_shape=(), *args, **kwargs):
        sd=kwargs.pop('sd',None)
        self.mu = mu = tt.as_tensor_variable(mu)
        self.sd = sd = tt.as_tensor_variable(sd)

        super(NormalMixture, self).__init__(w, pm.Normal.dist(mu, sd=sd, shape=comp_shape),
                                            *args, **kwargs)

    def _repr_latex_(self, name=None, dist=None):
        if dist is None:
            dist = self
        mu = dist.mu
        w = dist.w
        sd = dist.sd
        name = r'\text{%s}' % name
        return r'${} \sim \text{{NormalMixture}}(\mathit{{w}}={},~\mathit{{mu}}={},~\mathit{{sigma}}={})$'.format(name,
                                                get_variable_name(w),
                                                get_variable_name(mu),
                                                get_variable_name(sd))

@Group.register
class SymmetricMeanFieldGroup(MeanFieldGroup):
    """Symmetric MeanField Group for Symmetrized VI"""
    __param_spec__ = dict(smu=('d', ), rho=('d', ))
    short_name = 'sym_mean_field'
    alias_names = frozenset(['smf'])
    
    @node_property
    def mean(self):
        return self.params_dict['smu']

    def create_shared_params(self, start=None):
        if start is None:
            start = self.model.test_point
        else:
            start_ = start.copy()
            update_start_vals(start_, self.model.test_point, self.model)
            start = start_
        if self.batched:
            start = start[self.group[0].name][0]
        else:
            start = self.bij.map(start)
        rho = np.zeros((self.ddim,))
        if self.batched:
            start = np.tile(start, (self.bdim, 1))
            rho = np.tile(rho, (self.bdim, 1))
        return {'smu': theano.shared(
                    pm.floatX(start), 'smu'),
                'rho': theano.shared(
                    pm.floatX(rho), 'rho')}


    @node_property
    def symbolic_logq_not_scaled(self):

        z = self.symbolic_random
        logq = NormalMixture.dist([.5,.5], 
                                     mu=[self.mean, -self.mean], 
                                     sd=[self.std, self.std],
                                    ).logp(z)
        
        return logq.sum(range(1, logq.ndim))
    
class SADVI(ADVI):
    def __init__(self, *args, **kwargs):
        super(ADVI, self).__init__(SymmetricMeanField(*args, **kwargs))    


class SymmetricMeanField(SingleGroupApproximation):
    __doc__ = """**Single Group Mean Field Approximation**

    """ + str(SymmetricMeanFieldGroup.__doc__)
    _group_class = SymmetricMeanFieldGroup

Now .fit() fails on the first iteration with the error below.
It seems like the Join operation produces a (2,510), while the Composite produces a (1,510).

To me it seems like the Join operation joins a (1,510) with a copy multiplied by -1? I don’t really know how to debug this…

ValueError: Input dimension mis-match. (input[0].shape[0] = 1, input[1].shape[0] = 2)
Apply node that caused the error: Elemwise{sub,no_inplace}(Elemwise{add,no_inplace}.0, Join.0)
Toposort index: 157
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(1, 510), (2, 510)]
Inputs strides: [(4080, 8), (4080, 8)]
Inputs values: [‘not shown’, ‘not shown’]
Inputs type_num: [12, 12]
Outputs clients: [[Elemwise{pow,no_inplace}(Elemwise{sub,no_inplace}.0, InplaceDimShuffle{x,x}.0), Elemwise{pow}(Elemwise{sub,no_inplace}.0, Elemwise{sub}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\theano\configparser.py”, line 117, in res
return f(*args, **kwargs)
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\pymc3\variational\opvi.py”, line 1174, in symbolic_logq
return self.symbolic_logq_not_scaled
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\pymc3\memoize.py”, line 31, in memoizer
cache[key] = obj(*args, **kwargs)
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\theano\configparser.py”, line 117, in res
return f(*args, **kwargs)
File “”, line 42, in symbolic_logq_not_scaled
).logp(z)
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\pymc3\distributions\mixture.py”, line 146, in logp
return bound(logsumexp(tt.log(w) + self._comp_logp(value), axis=-1),
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\pymc3\distributions\mixture.py”, line 112, in comp_logp
return comp_dists.logp(value)
File “C:\Users\dycontri\AppData\Local\conda\conda\envs\analytics\lib\site-packages\pymc3\distributions\continuous.py”, line 480, in logp
return bound((-tau * (value - mu)**2 + tt.log(tau / np.pi / 2.)) / 2.,

ferrine · January 3, 2019, 12:13pm

Hi,

trying to run the example. Running PPCA for the very first time:) Could please anyone clarify on what variables can hold sign switching? That is not obvious at the first glance

dycontri · January 3, 2019, 5:26pm

@ferrine
I generated some fake data

import numpy as np
import datetime
from pymc3.variational.inference import ADVI
from pymc3.variational.opvi import Group, node_property
from pymc3.variational.approximations import MeanFieldGroup, SingleGroupApproximation
import pymc3 as pm
import theano
N=500
d = np.random.normal(size=(3, N))
d1 = (d[0]>0).astype(int)
d2 = -1*d[1]+np.random.normal(size=(1,N))/2
d3 = 2*d1+np.random.normal(size=(1,N))
d4=.5*d2+np.random.normal(size=(1,N))
yd = 4.*d1 +7.*d2 +5*d3+.3*d4+d[2]
#yd=.5*d1+d[2]
#d1[[1,5,25,14,32,11,125,157, 13,46,81,99,168,101,145,111]]=-999
d1True=d1.copy()
#d1[:450]=-999.
d1

Here is the model code:

import theano.tensor as tt
obs=np.array([d1,d2[0],d3[0],d4[0], yd[0]]).T
with pm.Model() as model1:
    s = pm.HalfCauchy('s', 10., shape=5)
    latent_std=10#HalfCauchy('l_std',100.)
    w = pm.Normal('w',0.,10, shape=(5,1))
    
    latent_ = pm.Normal('latent_',0.,1., shape=(len(d1),1))
    latent = pm.Deterministic('latent', latent_*latent_std)
    p = pm.Deterministic('p',tt.dot(latent, w.T))
    cov = pm.Deterministic('cov', latent_std**2*tt.dot(w, w.T)+(s**2)*np.eye(5))
    sd=pm.Deterministic('sd', tt.sqrt(tt.diag(cov)))
    latent_like = pm.Normal('latent_like', mu=p, sd=s, observed=obs)

with model1:
    approx=pm.fit(method=SADVI())
    trace=approx.sample()

I know for sure that the parameter w exhibits sign-switching, at least. Make sure to sample with NUTS 4 chains to reliably see the sign-switching occur. Or run ADVI to see parameters regularized to 0.

Just FYI, in this model, w is the matrix of factor loadings (5,1)

dycontri · January 3, 2019, 6:42pm

@ferrine

I got the VI to run, but the loss is always -inf or nan…

To get it to even run, I had to edit NormalMixture again to remove the pm.Normal(shape=) syntax, and instead create a list of 2 pm.Normal.dist() instances using the mu and -mu respectively. Otherwise, I got a “Normal object is not iterable” error. See below.

class NormalMixture(pm.Mixture):
    def __init__(self, w, mu, comp_shape=(), *args, **kwargs):
        sd=kwargs.pop('sd',None)
        self.mu = mu = tt.as_tensor_variable(mu)
        self.sd = sd = tt.as_tensor_variable(sd)

        super(NormalMixture, self).__init__(w, [pm.Normal.dist(mu[0], sd=sd[0]),pm.Normal.dist(mu[1], sd=sd[1])],
                                            *args, **kwargs)

    def _repr_latex_(self, name=None, dist=None):
        if dist is None:
            dist = self
        mu = dist.mu
        w = dist.w
        sd = dist.sd
        name = r'\text{%s}' % name
        return r'${} \sim \text{{NormalMixture}}(\mathit{{w}}={},~\mathit{{mu}}={},~\mathit{{sigma}}={})$'.format(name,
                                                get_variable_name(w),
                                                get_variable_name(mu),
                                                get_variable_name(sd))


@Group.register
class SymmetricMeanFieldGroup(MeanFieldGroup):
    """Symmetric MeanField Group for Symmetrized VI"""
    __param_spec__ = dict(smu=('d', ), rho=('d', ))
    short_name = 'sym_mean_field'
    alias_names = frozenset(['smf'])
    
    @node_property
    def mean(self):
        return self.params_dict['smu']

    def create_shared_params(self, start=None):
        if start is None:
            start = self.model.test_point
        else:
            start_ = start.copy()
            update_start_vals(start_, self.model.test_point, self.model)
            start = start_
        if self.batched:
            start = start[self.group[0].name][0]
        else:
            start = self.bij.map(start)
        rho = np.zeros((self.ddim,))
        if self.batched:
            start = np.tile(start, (self.bdim, 1))
            rho = np.tile(rho, (self.bdim, 1))
        return {'smu': theano.shared(
                    pm.floatX(start), 'smu'),
                'rho': theano.shared(
                    pm.floatX(rho), 'rho')}


    @node_property
    def symbolic_logq_not_scaled(self):

        z = self.symbolic_random
        logq = NormalMixture.dist(w=[np.ones((1,510))*.5,np.ones((1,510))*.5], 
                                     mu=[self.mean, -self.mean], 
                                     sd=[self.std, self.std]
                                    ).logp([z,-z])
        
        return logq.sum(range(1, logq.ndim))
    
class SADVI(ADVI):
    def __init__(self, *args, **kwargs):
        super(ADVI, self).__init__(SymmetricMeanField(*args, **kwargs))    


class SymmetricMeanField(SingleGroupApproximation):
    __doc__ = """**Single Group Mean Field Approximation**

    """ + str(SymmetricMeanFieldGroup.__doc__)
    _group_class = SymmetricMeanFieldGroup

dycontri · January 3, 2019, 7:38pm

Incidentally, I want to point out @gBokiau, @ferrine , while the avg. loss while running symmetrized ADVI is -inf, it looks like it’s correctly fixing the sign-switching, as the weight parameters are no longer regularized to 0.

ferrine · January 3, 2019, 10:01pm

Probably because you missed symbolic_random. Sampling from q(z) is still wrong

    @node_property
    def symbolic_random(self):
        initial = self.symbolic_initial
        sign = theano.tensor.as_tensor([-1, 1])[self._rng.binomial(initial.shape)]
        sigma = self.std
        mu = self.mean
        return sigma * initial + mu * sign

ferrine · January 3, 2019, 10:04pm

Moreover, you should specify variables that refer to a specific group (https://docs.pymc.io/api/inference.html#pymc3.variational.opvi.Group)

dycontri · January 3, 2019, 10:15pm

Thank you very much for your help, @ferrine

I’m looking for how to specify which variables go with which group.
Do you mean that in the SADVI() class I need to add parameters to pass variable names to SymmetricMeanFieldGroup vs. MeanFieldGroup?

@ferrine - nvm. I found it!

dycontri · January 3, 2019, 11:08pm

@ferrine

I implemented the symbolic_random as you suggested, but it actually reintroduced the sign-switching, regularizing behavior as with normal ADVI…

While before, the w parameter’s weights were correctly estimated, now they’re all set to 0:

without change to symbolic_random:
The weights match those found with NUTS, having a decent spread. I worry that the uncertainty of these parameters is underestimated, though…

AFTER change to sybolic_random:
The parameters are all located at 0

ferrine · January 3, 2019, 11:31pm

Interesting, I think I need some time to dig in tomorrow. (I yet has no time)

Thing to try number one. Smart starting value for std/mean in smf. Gradients may be too noizy as a sample from positive mode jumps over zero.

Thing to try number 2 (a bit advanced) https://arxiv.org/abs/1805.08498
In here there is a better reparametrization gradient for mixture distribution.

Another hotfix to the model may be just setting positive support fow w using half normal dist as prior

Good night

Topic		Replies	Views
Probabilistic PCA and identification modeling	11	201	April 20, 2025
Breaking symmetries in Bayesian MDS Questions	4	549	January 23, 2020
Convergence problems with poor chain mixing pPCA model v5 modeling , sampling	4	731	February 10, 2023
Rookie Question On Combining PCA and Bayesian Inference version agnostic modeling	12	1096	January 19, 2023
Factor Analysis Questions	22	2685	September 1, 2017

Unique solution for probabilistic PCA

Related topics