Using LKJCorr together with MvNormal

We are talking of a linear transformation here between a LKJ corr matrix and LKJ Cov matrix (multiplying both rows and columns with the sd vector). Therefore, if we took _LKJCholeksyCovRV_logp and just removed the logp_sd from the final sum, shouldnt’t it give us the logp for the cholesky factorization of the pure correlation matrix?

In which case I would structure things so that LKJCorrRV returns the cholesky factorization and LKJCorr allows you to parametrize if you want matrix or its cholesky factorization.

Is LKJ Cov matrix the LKJCholeskyCov one? If that is the case, you should be able to invert the linear transformation deterministically after you get draws from the LKJCholeskyCov. My guess is that you cannot.

If you were referring to another object ignore the message (and let me know which one it is)

This is in effect the workaround I’m currently using, actually.

But I was thinking about ENH: LKJCorr should return matrix rather than vector · Issue #7074 · pymc-devs/pymc · GitHub and how to get LKJCorr to actually output either the cholesky factorization or corr as matrix, without needing to sample sd_dist in the first place.

Anyways, I unfortunately think I’m underqualified to actually do this as I can just very, very barely keep up with what you are saying unfortunately. So I think I’m tapping out.

Alternative suggestion then: a thin wrapper around LKJCorr that behaves like the wrapper around LKJCholeskyCov, that basically applies @Velochy 's original helper function to the outputs of LKJCorr if chol=True. I think handling the packing/unpacking and doing an “unnecessary” cholesky factorization if you want to do non-centered parameterization is already quite nice.

This would avoid having to worry about any logp stuff.

Problem is, my original helper does not provide the cholesky factorization, just the full matrix.

My personal workaround is to sample from LKJCholeskyCov and then scale it back by it’s own sd vector i.e.

chol, corr, sd = LKJCholeskyCov(...., compute_corr=True)
corr_chol = (chol.T/sd).T

But this still requires specifying a distribution for sd_dist and sampling from it spuriously

Something like this doesn’t work?

corr_flat = pm.LKJCorr(...)
corr = unpacking_helper(corr_lat)
chol_corr = pt.linalg.cholesky(corr)

No it does :smiley: I was actually not aware of pt.linalg as google did not provide any matches with “pytensor cholesky”. Thank you @jessegrabowski

In this case, you are right, the thin wrapper seems like a good plan, and one I can probably manage to implement and do a PR for. I’ll at least give it a try :slight_smile:

2 Likes

Yeah google doesn’t index our docs very well, sadly. But things in pytensor are organized to follow numpy/scipy as closely as possible, so you can always try to look in the corresponding place. The API docs are here but they need some love :frowning:

Tag me on the PR and I’ll be happy to give it a review

Turns out there might still be some issues though.

I tired running

    c_arr = pm.LKJCorr(name,eta=eta,n=n)
    tri = pt.zeros( (n,n) )
    tri = pt.subtensor.set_subtensor(tri[np.triu_indices(n,1)],c_arr)
    cmat = tri + tri.T + pt.diag(np.ones(n))
    corr_chol = pt.linalg.cholesky(cmat)

in the context of a slightly more complex model and got the following error:

LinAlgError: 7-th leading minor of the array is not positive definite
Apply node that caused the error: Cholesky{lower=True, destructive=False, on_error='raise'}(Add.0)
Toposort index: 226
Inputs types: [TensorType(float64, shape=(None, None))]
Inputs shapes: [(12, 12)]
Inputs strides: [(96, 8)]
Inputs values: ['not shown']
Outputs clients: [[Dot22(Cholesky{lower=True, destructive=False, on_error='raise'}.0, Composite{...}.0), Dot22Scalar(Cholesky{lower=True, destructive=False, on_error='raise'}.0, Composite{...}.0, 3.0), Dot22Scalar(Cholesky{lower=True, destructive=False, on_error='raise'}.0, Composite{...}.0, 3.0)]]

You guys have any thoughts on how this might have come about? Because looking at the matrix with .eval, it looks like a perfectly fine correllation matrix.

Also, in the same situation

chol, corr, sd = LKJCholeskyCov(...., compute_corr=True)
corr_chol = (chol.T/sd).T

works just fine. Reverted to that for the time being.

Can you post a MVE that reproduces the bug? It samples fine if I am just trying to estimate the correlations, like this:

import pymc as pm
import pytensor.tensor as pt
import numpy as np
import arviz as az

# Generate Data
n = 5
eta = 1

corr_values = pm.LKJCorr.dist(n=n, eta=eta)
corr = pt.zeros((n, n))
corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
corr = corr + corr.T + pt.identity_like(corr)

chol_corr = pt.linalg.cholesky(corr)

chol = pm.draw(chol_corr, 1)
y = pm.draw(pm.MvNormal.dist(mu=0, chol=chol), 100)

# PyMC Model
with pm.Model() as m:
    
    corr_values = pm.LKJCorr('corr_values', n=n, eta=eta)
    corr = pt.zeros((n, n))
    corr = pt.set_subtensor(corr[np.triu_indices(n, k=1)], corr_values)
    corr = corr + corr.T + pt.identity_like(corr)

    chol_corr = pt.linalg.cholesky(corr)
    y_hat = pm.MvNormal('y_hat', mu=0, chol=chol_corr, observed=y)
    idata = pm.sample()

Without seeing a model, my hypothesis is that the sampler is proposing illegal values that should just be discarded because they have -np.inf logp, but are causing a computation problem for you before that can happen (see here for a similar situation in a totally unrelated context). If I’m right, the fix is just to pass on_error='nan' to pt.linalg.cholesky. That will prevent it from erroring out if a proposal is not PSD, and the proposal will be rejected anyway so it’s no harm no foul.

It seems the minimal reproducible example is the same code you sent me, but with a higher n (say 10 or 20). I actually got the error first time I ran it as it was (with n=5), but with smaller n it works very often but the error rate very clearly increases as n does.

I also tried setting the on_error=‘nan’ and while the sampling runs then, it is clear from looking at the trace it actually fails miserably, so it is not really a fix but just sweeps the error under the rug.

Do you have any more ideas as to what it might be. I’m pretty stumped.

Having thought about this for a bit:

Isn’t the problem that in backwards sampling, the logp for LKJCorr in no way selects for positive semidefiniteness. The interval transform only transforms all the values to within [-1,1] range and then LKJCorr logp applies the determinant-based distribution which already assumes a semi-definite matrix, but neither of these really nudges the numbers to form a semi-definite matrix.

This is consistent with failures increasing as n gets larger, as PSD matrices become increasingly less likely. As for why the probability does not drop off a cliff after 2 or 3, my guess is that the interval transform steers away from values close to -1 and 1 in favor of more middling values which lets the ones on the diagonal dominate for a while.

This is also consistent with the sampling failing if instead of chol=chol_corr you use cov=corr in the MvNormal parametrization. Same reason - it ends up proposing a non-PSD matrix. In this case, the error message contains the initial value though, so it is easy to check:

vec = [-0.39718353,  0.94680841, -0.41832789,  0.67318716,  0.74000446,
       -0.91103117,  0.17496013,  0.05499186, -0.60507651, -0.37989516,
        0.00281214, -0.83842367,  0.04520434,  0.51635372, -0.26586628,
        0.96277075,  0.66443148, -0.46217671,  0.77692162, -0.20452818,
        0.99873618, -0.23998811,  0.65304258,  0.98352337,  0.64847332,
       -0.08343583,  0.52405396,  0.15585091,  0.31938468,  0.90564176,
       -0.35090953, -0.43284892, -0.21282992, -0.48646797, -0.50581992,
       -0.06065079, -0.57660361,  0.90999947,  0.07498183, -0.01768012,
       -0.54075488,  0.8193952 , -0.13612187,  0.29677601,  0.07794663]
n=10
corr = np.zeros((n, n))
corr[np.triu_indices(n, k=1)] = vec
corr = corr + corr.T + np.eye(n)
print(np.linalg.eig(corr)[1].min())

indeed returns -0.7785569784890056 i.e. a negative eigenvalue => not PSD

So it looks like LKJCorr as it currently stands is fundamentally broken, i.e. it does not sample from the LKJ distribution as it does not respect the PSD requirement.

I think the way to fix this would be to move to sampling cholesky decomposition (as in LKJCholeskyCorr) as the existance of it already implies the matrix is PSD hence selecting for it.

Although - at that point - maybe it makes sense to just rewrite LKJCorr as a thin wrapper around LKJCholeskyCov that just returns the ‘corr’ part?

Thoughts @ricardoV94 @jessegrabowski ?

1 Like

I’m not sure about this part. In the end you will need the logp, and for sampling efficiency, a transform that unconstrains LKJ Corr matrices. It’s unclear to me how much logic would be shared between these methods and the ones LKJCholeskyCov uses.

Also LKJCholeskyCov has more parameters than the Corr counterpart

We may need something like this: tfp.bijectors.CorrelationCholesky  |  TensorFlow Probability

I was quite surprised that we were using just an IntervalTransform, so much I missed the regression described in this issue when we forbid univariate transforms on multivariate rvs.

Apparently it was indeed not sufficient?

Edit: the Stan pages are pretty informative as usual:

We do have a check in the logp that should reject non PSD values. The problem is the logp of MvNormal will raise before we get the chance of rejecting that proposal, unless you don’t set that on_error flag.

Actually our check is to just try to do a cholesky and see if it raises :D. Maybe we should use the eigenvalues as a check? I guess that’s slower, or maybe not sufficient?

Even if we reject non PSD values, we should try to have a transform that guarantees invalid values are never proposed, or sampling will suck like you noticed.

The 0 return is the eigenvalues, you are checking the eigenvectors here. Better to just use np.linalg.eigvals to avoid this problem. But your point is still correct, and there are negative eigenvalues.

I believe the problem is in the transformation, because when you just forward sample from the model (e.g. with pm.sample_prior_predictive), all the correlation matrices will be PSD.

For now I think it’s fine how we do it. In the future if we have special rewrites for eigenvalues on different types of matrices it might be faster to check eigenvalues. Eigenvalues can also be imaginary, and right now the imaginary component of returns get set to 0 during casting to float64, so we’d miss negative eigenvalues of the form 0-xj

I’m taking an engineering / product management approach here.

LKJCorr seems to have been broken for a while without anyone noticing. Presumably because people using it have only used small matrices (n=2 or 3 is all I could find googling on forums).

Now. We could spend time implementing the logp. Probably at least a day of my time, probably less for either of you, but I’m still guessing hours.

Or - we could just rewrite LKJCorr as a thin wrapper around LKJCholeskyCov by providing it with sd_dist=pm.LogNormal.dist() (or any other well-behaving distribution, really) and just marginalize it out by returning corr_chol = (chol.T/sd).T (or returned corr directly).
We would get correct behavior. We would not significantly effect existing users (as n=2 and n=3 cases are so small adding extra variables to sample over has a very marginal overhead ). And most notably - this would take a fraction of the time to accomplish.

Anyways - you guys are core team so you are better positioned to make such decisions. From my perspective though, it seems like an easy win as opposed to a hard but ideal solution to a problem no-one except me is having.

To maybe better explain what I mean, I took the 15 minutes required to write that thin wrapper:

I’m not saying it is the ideal solution. But it might be the pragmatic one.

You are discarding the variances right? That’s still inefficient because you are sampling more parameters than you actually need. I don’t know about this specific case, but that can sometimes introduce bad indeterminacies in the posterior (unless the extra parameters are completely orthogonal)

It may be less inefficient than the not completely constrained LKJCorr we have right now, but not great either.

To be pedantic, the LKJCorr is not broken, everything it does seems correct AFAICT, we just don’t have a transformation that allows NUTS to sample in a completely unconstrained space, which means that for some models you will be getting divergences.

That’s something we definitely should address.

1 Like

I believe the TFP implementation has everything we need, I’m looking at it now. @Velochy could you open an issue on github explaining that LKJCorr breaks during sampling?