Unique solution for probabilistic PCA

chartl · February 28, 2019, 1:46am

Another one to dig up this old thread. I’m a little late to the party, but this may be germane discussion. I thought to identify a unique factorization (I’m looking more generally at a factor analysis Y_i ~ N(BF_i, D) ; F_i ~ N(0, H)) by the following:

Set loadings to lower-triangular (no more rotation invariance)
Set diagonal to positive (no more sign invariance)
Order diagonal (no more permutation invariance)

this would look something like

def expand_packed_block_triangular(n, k, packed, diag=None):
    # like expand_packed_triangular, but with n > k.
    assert n >= k
    out = tt.zeros((n, k), dtype=float)
    if diag is None:
        idxs = np.tril_indices(n, m=k)
        out = tt.set_subtensor(out[idxs], packed)
    else:
        idxs = np.tril_indices(n, k=-1, m=k)
        out = tt.set_subtensor(out[idxs], packed)
        idxs = (np.arange(k), np.arange(k))
        out = tt.set_subtensor(out[idxs], diag)
    return out

ZeroTruncNormal = pm.Bound(pm.Normal, lower=0.0)

# using `obs` data from this thread

c=5
n=N
k=2
with pm.Model() as bfa:
    # factors are normally i.i.d.
    F = pm.Normal('F', 0., 1., shape=(k, n)) # latent vars
    # loadings are right lower triangular with a diagonal
    Wod = pm.Normal('W_packed_od', 0., 10., shape=(k*c - k*(k-1)/2 - k,))
    Wd = ZeroTruncNormal('W_packed_d', mu=1., sd=3., shape=(k,), 
                         transform=pm.distributions.transforms.Ordered(),
                        testval=1. + 1 * np.arange(k))
    W = pm.Deterministic('W', expand_packed_block_triangular(c, k, Wod, Wd))
    L = pm.Deterministic('L', tt.dot(W, F))
    d = pm.HalfCauchy('err_d', 5.)
    Capprox = pm.Normal('Capprox', mu=L, sd=d, shape=(c,n))  # far faster
    
    traces = [pm.sample(200, chains=1, init='adapt_diag') for _ in xrange(6)]

I’m finding that it doesn’t quite work right, with NUTS resulting in over-broad loadings which result in switches:

import seaborn as sbn
colors = ['blue', 'red', 'green', 'yellow', 'black', 'magenta', 'orange']

for tr in traces:
    for j in xrange(2):
        sbn.kdeplot(tr['W_packed_d'][:, j], color=colors[j])

and ADVI routinely giving inf values for the loss, and generating broad estimates well into the infeasible model regions

So, are there better ways of modeling these constraints? Or is the symmetric approach far superior?

junpenglao · February 28, 2019, 6:44am

Did you compare with the symmetric approach above?

chartl · February 28, 2019, 7:38am

NUTS basically can’t handle the constraints – even with tuning the accept rate way up (0.99) and drastically increasing the tree depth (25) it is a very unhappy sampler. Possibly Gibbs Sampling would be better since it would give more granular control over the integration limits (in particular the Gibbs pass over the ordered diagonal).

It’s also possible that the hard limit at 0 is causing a significant problem, so a latent formulation for a lognormal problem may help ADVI (i.e. W_lat ~ N(), W_d = exp(W_lat)). Or an auxiliary exponential formulation with Y1, ..., Yn ~ exp(b), X1 = Y1, X2 = Y1 + Y2, … But these really constrain the space of priors over the diagonal.

chartl · March 2, 2019, 12:14am

Okay. At the risk of never being able to live this down…

I forgot to pass in the observed data (oops!) The fact that there were ugly divergences in NUTS without any actual data in the model strongly suggests that HalfNormal diagonals is a bad way to go.

At the same time, modeling the differences between the diagonal entries appears to work much better; something like

del_ = pm.HalfCauchy('deltas', 1., size=(k,))
lambda = pm.Deterministic('lambda', tt.extra_ops.cumsum(del_))

One lingering issue is that there’s an obvious relationship between the scales of the diagonal terms, and the scales of the off-diagonal terms in the same column, which is a hindrance to ADVI. This suggests that a more ADVI-aligned model for W would be

c=5
n=N
k=2

# (c, n) ~ (c, k) x (k, n) 
with pm.Model() as bfa:
    # factors are normally i.i.d.
    F = pm.Normal('F', 0., 1., shape=(k, n)) # latent vars
    # loadings are right lower triangular with a diagonal
    if constrain:
        v = k*c - k*(k-1)/2 - k
        del_ = pm.HalfCauchy('del', 1., shape=(k,))
        lambda_ = pm.Deterministic('lambda', tt.extra_ops.cumsum(del_))
        W_od = pm.Normal('W_od', 0., 1., shape=(v,))
        ones = [1.] * k
        W1 = expand_packed_block_triangular(c, k, W_od, ones)
        W = pm.Deterministic('W', tt.dot(W1, tt.nlinalg.diag(lambda_)))
    else:
        W = pm.Laplace('W', mu=0, b=0.25, shape=(c, k))
    L = pm.Deterministic('L', tt.dot(W, F))
    d = pm.HalfCauchy('err_d', 5.)
    Capprox = pm.Normal('Capprox', mu=L, sd=d, shape=(c,n), observed=obs.T)  # far faster
    cpcc = pm.callbacks.CheckParametersConvergence
    advi_mods = [pm.ADVI() for _ in xrange(3)]
    advi_fits = [m.fit(1 * 10**5, callbacks=[cpcc(diff='absolute', tolerance=7.5e-3)]) for m in advi_mods]
    for j, f in enumerate(advi_mods):
        advi_fits[j] = f.fit(1000, obj_n_mc=100) # refinement
    advi_sam = [f.sample(200) for f in advi_fits]
    nuts_sam = [pm.sample(200, chains=1, init='advi+adapt_diag', tune=1000) for _ in xrange(4)]

This appears to work quite well. I see no poor averaging across components, and no component switching across the ADVI fits:

for j, tr in enumerate(advi_sam):
    ctr = 0
    for t in xrange(2):
        u = 5 if t == 0 else 4
        for i in xrange(u):
            sbn.kdeplot(tr['W'][:, i, t], color=colors[ctr], linestyle=textures[j])
            ctr += 1
        plt.ylim([0, 40])
    
plt.figure()

NUTS was quite happy with this parametrization as well, with no divergences during sampling. However (and strangely?) the NUTS posteriors, while consistent across the different samplings, are qualitatively different:

for j, tr in enumerate(nuts_sam):
    ctr = 0
    for t in xrange(2):
        u = 5 if t == 0 else 4
        for i in xrange(u):
            sbn.kdeplot(tr['W'][:, i, t], color=colors[ctr], linestyle=textures[j])
            ctr += 1

I can understand underestimating the parameter covariances – I could even (in this case) understand potentially shuffling off-diagonal entries. But it appears that the point estimates don’t even line up between ADVI and NUTS (with the possible exception of the diagonal entries). This isn’t an artifact of using ADVI initialization, jitter+adapt looks the same

I think it would be strange for the ADVI rank to have such a large effect on expectations; but I’ll run with rullrank ADVI to see what happens.

In sum – if you can live with this parametrization (and in an exploratory analysis where FA/PCA is typical, I don’t see what the major drawbacks are), it seems to be both efficient and parametrize a specific element in the equivalence class of solutions.**

A couple minor things I noticed in the API which are a little unclear:

It’d be useful to pass in a specific ADVI fit to NUTS, it’s not clear how to do this
I’d like to fit ADVI_fullrank starting at an initial (fitted) ADVI point
I’ve got several fitted ADVI models in the code above; but I can’t sample from the model objects themselves, instead I have to do something like model.fit(1).sample(100) to get a fit result. There must be a cleaner way…?

** Actually cases of repeated eigenvalues will show instability. Luckily, in this parametrization, you should be able to see this from del_ values shrinking to near 0.

junpenglao · March 2, 2019, 7:06am

Thanks for the detail write up! This is very informative!
Are you interested in writing it up into a doc?

chartl · March 4, 2019, 9:56pm

Crazy but true, allowing for covariances between the factors does alter the point-estimates for the means. I haven’t the foggiest why this should be the case.

Average Loss = 3,600.2: 100%|██████████| 125000/125000 [04:07<00:00, 504.57it/s]
Finished [100%]: Average Loss = 3,600.2
Average Loss = 3,600.1: 100%|██████████| 125000/125000 [03:03<00:00, 680.25it/s]
Finished [100%]: Average Loss = 3,600.1
Average Loss = 3,601.1: 100%|██████████| 125000/125000 [03:06<00:00, 671.65it/s]
Finished [100%]: Average Loss = 3,601.2
Average Loss = 3,580.2: 100%|██████████| 25000/25000 [07:23<00:00, 56.43it/s]
Finished [100%]: Average Loss = 3,580.2
Average Loss = 3,580.2: 100%|██████████| 25000/25000 [07:08<00:00, 58.37it/s]
Finished [100%]: Average Loss = 3,580.2
Average Loss = 3,580.5:  30%|██▉       | 7422/25000 [02:17<06:17, 46.57it/s]
Interrupted at 7,425 [29%]: Average Loss = 3,583.1
Average Loss = 3,937.5: 100%|██████████| 100000/100000 [38:53<00:00, 42.86it/s]
Finished [100%]: Average Loss = 3,937.5
Average Loss = 3,930.3: 100%|██████████| 100000/100000 [34:40<00:00, 48.06it/s]
Finished [100%]: Average Loss = 3,930.2
Average Loss = 3,583.3: 100%|██████████| 25000/25000 [14:05<00:00, 29.56it/s]
Finished [100%]: Average Loss = 3,583.3
Average Loss = 3,583.9: 100%|██████████| 25000/25000 [14:14<00:00, 29.24it/s]
Finished [100%]: Average Loss = 3,583.9
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [01:44<00:00, 31.11it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [02:43<00:00, 19.82it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [02:18<00:00, 23.39it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [02:12<00:00, 24.50it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks

ADVI:

FR-ADVI:

NUTS:

Are you interested in writing it up into a doc?

What would that entail?

junpenglao · March 5, 2019, 5:49am

Write it up as a notebook, and send a pull request to https://github.com/pymc-devs/pymc3/tree/master/docs/source/notebooks

chartl · March 8, 2019, 10:57pm

Topic		Replies	Views
Probabilistic PCA and identification modeling	11	193	April 20, 2025
Breaking symmetries in Bayesian MDS Questions	4	547	January 23, 2020
Convergence problems with poor chain mixing pPCA model v5 modeling , sampling	4	729	February 10, 2023
Rookie Question On Combining PCA and Bayesian Inference version agnostic modeling	12	1086	January 19, 2023
Factor Analysis Questions	22	2685	September 1, 2017

Unique solution for probabilistic PCA

Related topics