 # Unique solution for probabilistic PCA

#43

Another one to dig up this old thread. I’m a little late to the party, but this may be germane discussion. I thought to identify a unique factorization (I’m looking more generally at a factor analysis `Y_i ~ N(BF_i, D) ; F_i ~ N(0, H)`) by the following:

1. Set loadings to lower-triangular (no more rotation invariance)
2. Set diagonal to positive (no more sign invariance)
3. Order diagonal (no more permutation invariance)

this would look something like

``````def expand_packed_block_triangular(n, k, packed, diag=None):
# like expand_packed_triangular, but with n > k.
assert n >= k
out = tt.zeros((n, k), dtype=float)
if diag is None:
idxs = np.tril_indices(n, m=k)
out = tt.set_subtensor(out[idxs], packed)
else:
idxs = np.tril_indices(n, k=-1, m=k)
out = tt.set_subtensor(out[idxs], packed)
idxs = (np.arange(k), np.arange(k))
out = tt.set_subtensor(out[idxs], diag)
return out

ZeroTruncNormal = pm.Bound(pm.Normal, lower=0.0)

# using `obs` data from this thread

c=5
n=N
k=2
with pm.Model() as bfa:
# factors are normally i.i.d.
F = pm.Normal('F', 0., 1., shape=(k, n)) # latent vars
# loadings are right lower triangular with a diagonal
Wod = pm.Normal('W_packed_od', 0., 10., shape=(k*c - k*(k-1)/2 - k,))
Wd = ZeroTruncNormal('W_packed_d', mu=1., sd=3., shape=(k,),
transform=pm.distributions.transforms.Ordered(),
testval=1. + 1 * np.arange(k))
W = pm.Deterministic('W', expand_packed_block_triangular(c, k, Wod, Wd))
L = pm.Deterministic('L', tt.dot(W, F))
d = pm.HalfCauchy('err_d', 5.)
Capprox = pm.Normal('Capprox', mu=L, sd=d, shape=(c,n))  # far faster

traces = [pm.sample(200, chains=1, init='adapt_diag') for _ in xrange(6)]

``````

I’m finding that it doesn’t quite work right, with NUTS resulting in over-broad loadings which result in switches:

``````import seaborn as sbn
colors = ['blue', 'red', 'green', 'yellow', 'black', 'magenta', 'orange']

for tr in traces:
for j in xrange(2):
sbn.kdeplot(tr['W_packed_d'][:, j], color=colors[j])

`````` and ADVI routinely giving `inf` values for the loss, and generating broad estimates well into the infeasible model regions So, are there better ways of modeling these constraints? Or is the symmetric approach far superior?

#44

Did you compare with the symmetric approach above?

#45

NUTS basically can’t handle the constraints – even with tuning the accept rate way up (0.99) and drastically increasing the tree depth (25) it is a very unhappy sampler. Possibly Gibbs Sampling would be better since it would give more granular control over the integration limits (in particular the Gibbs pass over the ordered diagonal).

It’s also possible that the hard limit at 0 is causing a significant problem, so a latent formulation for a lognormal problem may help ADVI (i.e. `W_lat ~ N(), W_d = exp(W_lat)`). Or an auxiliary exponential formulation with `Y1, ..., Yn ~ exp(b)`, `X1 = Y1`, `X2 = Y1 + Y2`, … But these really constrain the space of priors over the diagonal.

#46

Okay. At the risk of never being able to live this down…

I forgot to pass in the observed data (oops!) The fact that there were ugly divergences in NUTS without any actual data in the model strongly suggests that HalfNormal diagonals is a bad way to go.

At the same time, modeling the differences between the diagonal entries appears to work much better; something like

``````del_ = pm.HalfCauchy('deltas', 1., size=(k,))
lambda = pm.Deterministic('lambda', tt.extra_ops.cumsum(del_))
``````

One lingering issue is that there’s an obvious relationship between the scales of the diagonal terms, and the scales of the off-diagonal terms in the same column, which is a hindrance to ADVI. This suggests that a more ADVI-aligned model for W would be

``````c=5
n=N
k=2

# (c, n) ~ (c, k) x (k, n)
with pm.Model() as bfa:
# factors are normally i.i.d.
F = pm.Normal('F', 0., 1., shape=(k, n)) # latent vars
# loadings are right lower triangular with a diagonal
if constrain:
v = k*c - k*(k-1)/2 - k
del_ = pm.HalfCauchy('del', 1., shape=(k,))
lambda_ = pm.Deterministic('lambda', tt.extra_ops.cumsum(del_))
W_od = pm.Normal('W_od', 0., 1., shape=(v,))
ones = [1.] * k
W1 = expand_packed_block_triangular(c, k, W_od, ones)
W = pm.Deterministic('W', tt.dot(W1, tt.nlinalg.diag(lambda_)))
else:
W = pm.Laplace('W', mu=0, b=0.25, shape=(c, k))
L = pm.Deterministic('L', tt.dot(W, F))
d = pm.HalfCauchy('err_d', 5.)
Capprox = pm.Normal('Capprox', mu=L, sd=d, shape=(c,n), observed=obs.T)  # far faster
cpcc = pm.callbacks.CheckParametersConvergence
advi_mods = [pm.ADVI() for _ in xrange(3)]
advi_fits = [m.fit(1 * 10**5, callbacks=[cpcc(diff='absolute', tolerance=7.5e-3)]) for m in advi_mods]
for j, f in enumerate(advi_mods):
advi_fits[j] = f.fit(1000, obj_n_mc=100) # refinement
advi_sam = [f.sample(200) for f in advi_fits]
nuts_sam = [pm.sample(200, chains=1, init='advi+adapt_diag', tune=1000) for _ in xrange(4)]

``````

This appears to work quite well. I see no poor averaging across components, and no component switching across the ADVI fits:

``````for j, tr in enumerate(advi_sam):
ctr = 0
for t in xrange(2):
u = 5 if t == 0 else 4
for i in xrange(u):
sbn.kdeplot(tr['W'][:, i, t], color=colors[ctr], linestyle=textures[j])
ctr += 1
plt.ylim([0, 40])

plt.figure()
`````` NUTS was quite happy with this parametrization as well, with no divergences during sampling. However (and strangely?) the NUTS posteriors, while consistent across the different samplings, are qualitatively different:

``````for j, tr in enumerate(nuts_sam):
ctr = 0
for t in xrange(2):
u = 5 if t == 0 else 4
for i in xrange(u):
sbn.kdeplot(tr['W'][:, i, t], color=colors[ctr], linestyle=textures[j])
ctr += 1
`````` I can understand underestimating the parameter covariances – I could even (in this case) understand potentially shuffling off-diagonal entries. But it appears that the point estimates don’t even line up between ADVI and NUTS (with the possible exception of the diagonal entries). This isn’t an artifact of using ADVI initialization, jitter+adapt looks the same I think it would be strange for the ADVI rank to have such a large effect on expectations; but I’ll run with rullrank ADVI to see what happens.

In sum – if you can live with this parametrization (and in an exploratory analysis where FA/PCA is typical, I don’t see what the major drawbacks are), it seems to be both efficient and parametrize a specific element in the equivalence class of solutions.**

A couple minor things I noticed in the API which are a little unclear:

1. It’d be useful to pass in a specific ADVI fit to NUTS, it’s not clear how to do this
2. I’d like to fit ADVI_fullrank starting at an initial (fitted) ADVI point
3. I’ve got several fitted ADVI models in the code above; but I can’t sample from the model objects themselves, instead I have to do something like `model.fit(1).sample(100)` to get a fit result. There must be a cleaner way…?

** Actually cases of repeated eigenvalues will show instability. Luckily, in this parametrization, you should be able to see this from `del_` values shrinking to near 0.

#47

Thanks for the detail write up! This is very informative!
Are you interested in writing it up into a doc?

#48

Crazy but true, allowing for covariances between the factors does alter the point-estimates for the means. I haven’t the foggiest why this should be the case.

``````Average Loss = 3,600.2: 100%|██████████| 125000/125000 [04:07<00:00, 504.57it/s]
Finished [100%]: Average Loss = 3,600.2
Average Loss = 3,600.1: 100%|██████████| 125000/125000 [03:03<00:00, 680.25it/s]
Finished [100%]: Average Loss = 3,600.1
Average Loss = 3,601.1: 100%|██████████| 125000/125000 [03:06<00:00, 671.65it/s]
Finished [100%]: Average Loss = 3,601.2
Average Loss = 3,580.2: 100%|██████████| 25000/25000 [07:23<00:00, 56.43it/s]
Finished [100%]: Average Loss = 3,580.2
Average Loss = 3,580.2: 100%|██████████| 25000/25000 [07:08<00:00, 58.37it/s]
Finished [100%]: Average Loss = 3,580.2
Average Loss = 3,580.5:  30%|██▉       | 7422/25000 [02:17<06:17, 46.57it/s]
Interrupted at 7,425 [29%]: Average Loss = 3,583.1
Average Loss = 3,937.5: 100%|██████████| 100000/100000 [38:53<00:00, 42.86it/s]
Finished [100%]: Average Loss = 3,937.5
Average Loss = 3,930.3: 100%|██████████| 100000/100000 [34:40<00:00, 48.06it/s]
Finished [100%]: Average Loss = 3,930.2
Average Loss = 3,583.3: 100%|██████████| 25000/25000 [14:05<00:00, 29.56it/s]
Finished [100%]: Average Loss = 3,583.3
Average Loss = 3,583.9: 100%|██████████| 25000/25000 [14:14<00:00, 29.24it/s]
Finished [100%]: Average Loss = 3,583.9
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [01:44<00:00, 31.11it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [02:43<00:00, 19.82it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [02:18<00:00, 23.39it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
Only 250 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [err_d, W_od, del, F]
100%|██████████| 3250/3250 [02:12<00:00, 24.50it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks
``````  NUTS: Are you interested in writing it up into a doc?

What would that entail?

#49

Write it up as a notebook, and send a pull request to https://github.com/pymc-devs/pymc3/tree/master/docs/source/notebooks #50