Okay. At the risk of never being able to live this down…
I forgot to pass in the observed data (oops!) The fact that there were ugly divergences in NUTS without any actual data in the model strongly suggests that HalfNormal diagonals is a bad way to go.
At the same time, modeling the differences between the diagonal entries appears to work much better; something like
del_ = pm.HalfCauchy('deltas', 1., size=(k,))
lambda = pm.Deterministic('lambda', tt.extra_ops.cumsum(del_))
One lingering issue is that there’s an obvious relationship between the scales of the diagonal terms, and the scales of the off-diagonal terms in the same column, which is a hindrance to ADVI. This suggests that a more ADVI-aligned model for W would be
c=5
n=N
k=2
# (c, n) ~ (c, k) x (k, n)
with pm.Model() as bfa:
# factors are normally i.i.d.
F = pm.Normal('F', 0., 1., shape=(k, n)) # latent vars
# loadings are right lower triangular with a diagonal
if constrain:
v = k*c - k*(k-1)/2 - k
del_ = pm.HalfCauchy('del', 1., shape=(k,))
lambda_ = pm.Deterministic('lambda', tt.extra_ops.cumsum(del_))
W_od = pm.Normal('W_od', 0., 1., shape=(v,))
ones = [1.] * k
W1 = expand_packed_block_triangular(c, k, W_od, ones)
W = pm.Deterministic('W', tt.dot(W1, tt.nlinalg.diag(lambda_)))
else:
W = pm.Laplace('W', mu=0, b=0.25, shape=(c, k))
L = pm.Deterministic('L', tt.dot(W, F))
d = pm.HalfCauchy('err_d', 5.)
Capprox = pm.Normal('Capprox', mu=L, sd=d, shape=(c,n), observed=obs.T) # far faster
cpcc = pm.callbacks.CheckParametersConvergence
advi_mods = [pm.ADVI() for _ in xrange(3)]
advi_fits = [m.fit(1 * 10**5, callbacks=[cpcc(diff='absolute', tolerance=7.5e-3)]) for m in advi_mods]
for j, f in enumerate(advi_mods):
advi_fits[j] = f.fit(1000, obj_n_mc=100) # refinement
advi_sam = [f.sample(200) for f in advi_fits]
nuts_sam = [pm.sample(200, chains=1, init='advi+adapt_diag', tune=1000) for _ in xrange(4)]
This appears to work quite well. I see no poor averaging across components, and no component switching across the ADVI fits:
for j, tr in enumerate(advi_sam):
ctr = 0
for t in xrange(2):
u = 5 if t == 0 else 4
for i in xrange(u):
sbn.kdeplot(tr['W'][:, i, t], color=colors[ctr], linestyle=textures[j])
ctr += 1
plt.ylim([0, 40])
plt.figure()

NUTS was quite happy with this parametrization as well, with no divergences during sampling. However (and strangely?) the NUTS posteriors, while consistent across the different samplings, are qualitatively different:
for j, tr in enumerate(nuts_sam):
ctr = 0
for t in xrange(2):
u = 5 if t == 0 else 4
for i in xrange(u):
sbn.kdeplot(tr['W'][:, i, t], color=colors[ctr], linestyle=textures[j])
ctr += 1

I can understand underestimating the parameter covariances – I could even (in this case) understand potentially shuffling off-diagonal entries. But it appears that the point estimates don’t even line up between ADVI and NUTS (with the possible exception of the diagonal entries). This isn’t an artifact of using ADVI initialization, jitter+adapt looks the same

I think it would be strange for the ADVI rank to have such a large effect on expectations; but I’ll run with rullrank ADVI to see what happens.
In sum – if you can live with this parametrization (and in an exploratory analysis where FA/PCA is typical, I don’t see what the major drawbacks are), it seems to be both efficient and parametrize a specific element in the equivalence class of solutions.**
A couple minor things I noticed in the API which are a little unclear:
- It’d be useful to pass in a specific ADVI fit to NUTS, it’s not clear how to do this
- I’d like to fit ADVI_fullrank starting at an initial (fitted) ADVI point
- I’ve got several fitted ADVI models in the code above; but I can’t sample from the model objects themselves, instead I have to do something like
model.fit(1).sample(100)to get a fit result. There must be a cleaner way…?
** Actually cases of repeated eigenvalues will show instability. Luckily, in this parametrization, you should be able to see this from del_ values shrinking to near 0.