Getting random variables in my logp function while using pymc_extras.statespace

Hi all,

I’m new to PyMC (thanks to the community and the contributors for this amazing library!). I’m developing a BVAR model that should, in principle, work with the state-space implementation. While I’ve managed to set up most of the model correctly, I’ve hit a roadblock that I can’t seem to resolve.

When I use the following code:

state_chol, _, _ = pm.LKJCholeskyCov(
    "state_chol", eta=2, n=m,  # Only for the stationary part (2x2)
    sd_dist=sd_dist
)
state_cov_stationary = pm.Deterministic("state_cov_stationary", state_chol @ state_chol.T)

The model runs smoothly. However, I want to use the inverse Wishart distribution that is invariant to the order of the variables. For this, I implemented:

nu = pm.ConstantData("nu", m + 4)
V = pm.ConstantData("V", np.eye(len(bvar_trends.stationary_var_names)))
Sigma_draws = inverse_wishart_dist(nu, V)

Here’s my implementation of inverse_wishart_dist:

def inverse_wishart_dist(nu: pt.TensorVariable, V: pt.TensorVariable):
    def random(nu, V, size=None, rng=None):
        if rng is None:
            rng = np.random.default_rng()
        nu_val = float(np.asarray(nu).item())
        X = stats.wishart.rvs(df=nu_val, scale=np.linalg.inv(V), random_state=rng)
        Sigma = np.linalg.inv(X)
        return (Sigma + Sigma.T) / 2 + np.eye(V.shape[0]) * 1e-6

    def logp(value, nu, V):
        sign, logdet = pt.nlinalg.slogdet(value)
        if sign <= 0:
            return -np.inf

        log_det_v = pt.nlinalg.slogdet(V)[1]
        n_stationary = V.shape[0]

        log_det_terms = (nu / 2) * log_det_v - ((nu + n_stationary + 1) / 2) * logdet
        trace_term = -0.5 * pt.nlinalg.trace(V @ pt.linalg.inv(value))

        return log_det_terms + trace_term

    return pm.CustomDist(
        "InverseWishart",
        nu,
        V,
        random=random,
        logp=logp,
        ndim_supp=2,
        ndims_params=[0, 2],
        signature="(),(m,m)->(m,m)",
        dtype="float64"
    )

However, when I try to use this custom distribution, I encounter an error:
“logp function depends on a random variable.”

I’m unsure how to resolve this issue.

Has anyone encountered a similar problem or have suggestions on how to address this? Any help would be greatly appreciated!

Thanks in advance!

Andres

There’s a couple things going on here I think. For the proximate problem, I think you have a bug in your model. I’d need to see the whole thing to know. But before you do that, a couple other things to consider:

  1. Your random method is numerically unstable, and generates non-PSD samples from time-to-time. I suggest working with triangular decompositions:
def iw_random(nu, V, size=None, rng=None):
    if rng is None:
        rng = np.random.default_rng()
    n = V.shape[0]
    V_chol_T = linalg.cholesky(V, lower=False)
    x = rng.normal(size=(nu, n))
    R = np.linalg.qr(x, 'r')
    T = linalg.solve_triangular(R.T, V_chol_T, lower=True).T
    return T @ T.T
  1. If you provide a signature to pm.CustomDist, you don’t need to provide ndim_supp or ndims_params. These are inferred from the signature.
  2. In the logp term, you can’t directly use if/else control flow. sign is a symbolic tensor with no actual value, so python doesn’t know how to evaluate if sign <= 0. Instead, use pt.switch, which is the differentiable control flow operator:
def iw_logp(value, nu, V):
    sign_det_value, logdet_value = pt.linalg.slogdet(value)
    n_stationary = V.shape[0]
    _, logdet_V = pt.linalg.slogdet(V)
    
    log_det_terms = (nu / 2) * logdet_V - ((nu + n_stationary + 1) / 2) * logdet_value
    trace_term = -0.5 * pt.trace(pt.linalg.solve(value.T, V.T, assume_a='pos', check_finite=False).T)


    return pt.switch(pt.lt(sign_det_value, 0),
                     -np.inf,
                     log_det_terms + trace_term)
  1. I also replaced V @ inv(value) with solve(value.T, V.T).T, which is more numerically stable. One should avoid doing inverse and multiply separately, and instead use solve. Even better if you can use a specialized solver via the assume_a keyword (in this case, assume_a="pos", because we require value to be PSD (since it’s a draw from the IW). As an aside, if it’s not PSD, solve will silently return the wrong result (that is, no checking is done). That’s fine in our case because of the switch at the end.

  2. If you don’t like all the boilerplate of pm.CustomDist polluting your model, I suggest you write a thin wrapper function like this:

def InverseWishart(name, nu, V, *args, **kwargs):
    return pm.CustomDist(
        name,
        nu,
        V,
        random=iw_random,
        logp=iw_logp,
        signature="(),(m,m)->(m,m)",
        *args,
        **kwargs
    )

Then a model will look like this:

with pm.Model() as m:
    mu = pm.Normal('mu', size=(10,))
    V_diag = pm.Exponential('V_diag', 1, size=(10,))
    tau = InverseWishart('tau', nu=10, V=pt.diag(V_diag))
    eig = pm.Deterministic('tau_eig', pt.linalg.eig(tau)[0])
    obs = pm.MvNormal('obs', mu=mu, tau=tau)
    samples = pm.sample_prior_predictive()

So now on to the real problem: you can’t sample from this model. HMC knows nothing about the domains of the random variables it samples from. This is why PyMC defines bijective transformations for all variables not defined on \mathbb R. In the background, we do a change of variables (with jacobian correction), to ensure that no matter what the sampler picks, we end up with a valid draw from any distribution. We know no such transformation for the IW distribution, so things will go as follows:

  1. You initialize with a draw from the prior, which is valid. Call it x_0
  2. The sampler takes this point and begins construction a sample trajectory. It does this by evaluating the gradient of the logp, and taking a step in that direction, roughly something like: x_1 = x_0 + \eta \nabla \mathcal{L}_x
  3. The gradients know nothing about what it means to be PSD, so x_1 is guaranteed to not be PSD, since the probability of randomly landing on a PSD matrix in \mathbb{R}^n is zero.
  4. Your sampler will die with an error because your covariance matrix is not invertible.

This is why LKJ is the recommended prior for covariance matrices.

As an aside, @ricardoV94 and friends were tinkering with conjugate step samplers over in pymc-extras, see here and here. If you were hell-bent on using an IW prior, you would need something like this. A sampler could in principle exploit the conjugacy between the IW and the MvNormal observation distribution to ensure valid draws, but we don’t have anything like that currently implemented.

Hi Jesse,

Thanks for your email and for pointing out the details regarding the Inverse Wishart Cholesky implementation. I appreciate you taking the time to clarify things and sharing your corrections.

I try to implement the functions as they are available in stan to potentially integrate them with PyMC. However, as you rightly pointed out, there are nuances with the algorithm that may be implemented in Stan that need careful consideration.

Just in case it is of any use I’m copying the code here. The random function works well but sampling with NUTS not really as you pointed out.

Regarding the project I’m using your pymc statespace to estimate a model similar to the one in Global Trends in Interest Rates by Giannone. My implementation works well with Minnesota priors and other
priors but I have to use the LKJ priors for the Cholesky decomposition of the covariance matrix. I wanted to use IW for comparison.

Thanks for your help,

Andres

import numpy as np
import pytensor
import pytensor.tensor as pt
from scipy.linalg import solve_triangular
from scipy.stats import invwishart

def inv_wishart_sampler_cholesky(df, scale_mat, size=None, rng=None):
"""
Sample from an Inverse Wishart distribution using Cholesky decomposition.
Matches Stan's implementation.
"""
if rng is None:
rng = np.random.default_rng()

# Handle scalar df passed as array
if hasattr(df, 'shape'):
df = float(df.flat[0])

k = scale_mat.shape[0]
L_S = np.linalg.cholesky(scale_mat)

def _single_sample():
B = np.zeros((k, k))
for i in range(k):
for j in range(i):
B[i, j] = rng.normal(0, 1)
# Match Stan's degrees of freedom calculation
B[i, i] = np.sqrt(rng.chisquare(df - (k-1) + i))

L_Y = solve_triangular(B.T, L_S.T, lower=False).T
return L_Y @ L_Y.T

if size is None:
return _single_sample()
else:
return np.array([_single_sample() for _ in range(500)])

def inv_wishart_cholesky_logp(L_Y, nu, scale_mat):

""" Compute the log probability density of the Inverse Wishart distribution parameterized by its Cholesky decomposition.
The scale matrix is provided directly and its Cholesky factor is computed internally. The change of variables from Y, a positive-definite matrix, to L_Y, the lower triangular Cholesky factor, follows Theorem 2.1.9 in Muirhead, R. J. (2005) Aspects of Multivariate Statistical Theory. Wiley-Interscience.
Args: L_Y: Lower triangular Cholesky factor of a covariance matrix. nu: Scalar degrees of freedom (must be greater than k-1, where k is matrix dimension). scale_mat: Scale matrix of the inverse Wishart distribution.
Returns: The log probability density at L_Y given nu and scale_mat.
References: Muirhead, R. J. (2005). Aspects of Multivariate Statistical Theory. Wiley-Interscience.

Translated from the original Stan implementation:
"""

# Compute Cholesky of scale matrix
L_S = pt.slinalg.cholesky(scale_mat)
k = L_S.shape[0]

# Matches Stan's multigamma implementation
def multigammaln(a, k=k):
i = pt.arange(k)
pi_term = pt.sum(pt.log(np.pi) * ((k - i - 1) / 2))
gamma_term = pt.sum(pt.gammaln(a + (1 - i - 1) / 2))
return pi_term + gamma_term

# Following Stan's implementation order:

# 1. Compute L_Y^-1 * L_S
L_YinvL_S = pt.slinalg.solve_triangular(L_Y, L_S, lower=True)

# 2. Compute trace term (Frobenius norm squared)
dot_LYinvLS = pt.sum(L_YinvL_S * L_YinvL_S)

# 3. Compute log determinant terms
log_det_Y = 2 * pt.sum(pt.log(pt.diag(L_Y)))
log_det_S = 2 * pt.sum(pt.log(pt.diag(L_S)))

# 4. Compute multivariate gamma term
gamma_term = -multigammaln(0.5 * nu, k)

# 5. Combine all terms following Stan's order
logp = (gamma_term # Multivariate gamma term
+ (0.5 * nu) * log_det_S # Scale matrix term
- 0.5 * (nu + k + 1.0) * log_det_Y # Random variable term
- 0.5 * dot_LYinvLS # Trace term
- (nu * k * 0.5) * pt.log(2)) # Log 2 term

return logp

def create_compiled_logp(k=3):
"""
Create a compiled version of the inverse Wishart log probability function.

Parameters:

I’m not sure what STAN does, they might have a transformation that allows them to sample IW. We definitely don’t, though. Maybe @bob-carpenter can weigh in.

Yes, Stan lets yo declare parameters to be positive definite and we have an inverse Wishart distribution. For efficiency and numerical stability, we recommend parameterizing directly with Cholesky factors as follows.

data {
  cholesky_factor[10] L_S;
  real<lower=0> nu;
}
parameters {
  // ensures L_Sigma lower triangular w. strictly positive diagonal
  cholesky_factor[10] L_Sigma;  
}
model {
  L_Sigma ~ inv_Wishart_cholesky(nu, L_S);
}

For inspection or output, you can recover the latent positive definite matrix Sigma as follows.

generated quantities {
  cov_matrix[10] Sigma = L_Sigma * L_Sigma';
}

For details on the transforms and Jacobian adjustments, see:

2 Likes