Probabilistic PCA and identification

I’ve been playing around with Bayesian Probabilistic PCA. Conceptually, the model is very simple, but the main problem with it is that while priors take care of location and scale invariance, by default the solutions are only unique up to an orthonormal rotation.

There has been a fair bit of discussion on this topic in this forum before, most notably in here which even resulted in an example in the library.

The conclusion reached there on the matter was from Chartl who proposed

Set loadings to lower-triangular (no more rotation invariance)
Set diagonal to positive (no more sign invariance)
Order diagonal (no more permutation invariance)

Thing is, this approach does no longer work for k>3 (which is easy to verify by setting k=5 in the example notebook).

I’ve been racking my brain and trying to find sources for this for the past few days and I think I figured it out, but I would like someone else to verify my reasoning. Hence the post.

The idea of forcing the matrix lower-triangular with positive diagonal elements seems to trace back to Geweke, Zhou 1996: “Measuring the Pricing Error of the Arbitrage Pricing Theory”. In reading the paper itself, however, there is an important yet easy to overlook caveat to their reasoning: namely, they assume that the weight matrix (f in their notation) is semi-orthonormal i.e. f @ f.T = I and that assumption is required in the theorem they use to show rotational invariance (end of pg10).

Now, from what I remember from my university linear algebra, to get an orthonormal matrix, you can start with any vector, then pick the next vector from the orthogonal space of the first vector as the second, then one from the orthogonal space of first two vectors as third and so on, i.e. you have one dimension of freedom less on each pick, effectively removing k(k-1)/2 degrees of freedom. Now forcing the matrix constructed that way to also be lower diagonal removes an additional k(k-1)/2 degrees. So it seems to me that instead of fixing k(k-1)/2 values for identification, one should actually fix twice that i.e. k(k-1).

I tested the library example with the following W construction:

def makeW2(d, k, dim_names):
    z = pm.HalfCauchy("W_z", 1.0, dims="latent_columns")
    L = pm.Cauchy("W_b", 0.0, 1.0, shape=(d-k,k), dims=("packed_dim",dim_names[1]))
    W = pm.Deterministic("W", pt.concatenate([pt.diag(z),L],axis=0), dims=dim_names)
    print(W.shape.eval(),L.shape.eval())
    return W

and that seems to work for k=4, k=5, meaning the chains converge and average rhat is 1.01. So the logic seems sound.

Granted, starting with a diagonal matrix in the weights, while identifying the model probably also skews the probability space by forcing first 3 questions all on separate factors. So it is definitely not ideal. So the question I have is: GZ96 seem to assume you can just sample orthonormal matrices without (at least seemingly) elaborating on how. Is there something obvious I am missing?

2 Likes

CC @aseyboldt maybe he has some ideas

From the footnote on page 10 of the linked paper, it looks like they generate weight matrices as follows:

    # This initial distribution is dealer's choice I think
    beta = pm.Normal('beta', dims=['latent_columns', 'latent_columns'])
    beta_posdef = beta @ beta.T
    L = pt.linalg.cholesky(beta_posdef)
    P = pt.linalg.solve(L, beta)
    beta_K = beta @ P.T

beta_K will always be lower triangular with positive diagonal. I didn’t see them mention orthonormality in their paper though. Etiher way, their setup differs somewhat because the vector of asset returns is a vector, and the factor matrix is observed data. The second point is not really important, but the first is. To adapt this method to the probabilistic PCA case, you’ll have to think about how to handle a non-square beta matrix. Something with SVD probably? That would get you two orthonomal square matrices from the weights, which seems to be a step in the right direction.

2 Likes

Thank you for the reply @jessegrabowski . But I think you skimmed the paper and missed that the pg 10 footnote is about the uniqueness of the orthonormal transformation P, not generating W.

Which is not to say I read the paper properly either. In re-reading it, I realised the only assumptions it makes on W (f) is E[W] = 0, E[W @ W.T] = I, and the reason he claims identification needs to be done w.r.t. orthonormal rotations is because orthronormal matrices are the class of matrices that preserve E[W @ W.T] value if W’ = P @ W, so it checks out.

I re-checked my code and realised that non-identification happened when k_true < k i.e. the model was forced to find a factor in the noise. In which case it makes sense it found multiple equally good options. Not something you are likely to see in real data where even the lower order components tend to have distinct enough eigenvalues up to pretty large values.

One thing I still don’t get is:

Order diagonal (no more permutation invariance)

Because in terms of matrices, permutations are also orthonormal so this seems completely unnecessary. Am I correct in this?

There are two approaches to this that can be easily automated with linear algebra packages:

  1. Take a standard normal matrix and do QR-decomposition and take the Q matrix, which will be uniformly distributed over orthonormal matrices. Because the space of orthonormal matrices is constrained, this will be a proper uniform distribution.

  2. Take a skew-symmetric matrix and apply matrix exponentiation. There I’m not sure what prior to put on the skew-symmetric matrix so that the resulting orthonormal is uniform.

Either way, you have to take care of the determinant (-1 gives you reflections, 1 gives you rotations, if I’m remembering correctly).

2 Likes

That 1. point is a really cool property @bob-carpenter! Any chance you can point me to a reference for it?

See, e.g.,

We have QR decomposition in pytensor but gradients aren’t implemented. This paper gives the procedure for LQ, which is just the transpose for QR. @Velochy if it’s something you want could you open an issue about it?

I don’t think so, at least not at this time. As you can see, the problem got solved without the need to actually create orthogonal matrices, so right now I don’t need the QR decomposition.

Thank you again for your help though!