Tractable sampling of large n multivariate distribution?


I’m trying to fit a multivariate normal distribution to a high dimensional dataset (~5000 dimensions). It seems like the default PyMC sampler isn’t able to handle a small fraction of the data; at more than 50 dimensions, it stalls. Doing a manual sample covariance shrinkage is working pretty well for me but I was hoping for a Bayesian upgrade. Any advice for how to make this problem tractable?


import pymc as pm
import arviz as az
import pandas as pd

data = pd.read_parquet('./data/dataset.parquet').iloc[::100,::100]

n_samples, n_features = data.shape
n_samples, n_features # (23, 50)

with pm.Model() as m:
    mu = pm.Normal('mu',shape=n_features,sigma=.5)
    sd_dist = pm.TruncatedNormal.dist(lower=0., mu=5., sigma=5., shape=n_features)
    chol, _, _ = pm.LKJCholeskyCov(
    vals = pm.MvNormal('vals', mu=mu, chol=chol, observed=data)
    idata = pm.sample()


Auto-assigning NUTS sampler...
2023-03-28 14:54:58,901 INFO    mcmc      : Auto-assigning NUTS sampler...                                                  
Initializing NUTS using jitter+adapt_diag...
2023-03-28 14:54:58,902 INFO    mcmc      : Initializing NUTS using jitter+adapt_diag...                                    
Multiprocess sampling (4 chains in 4 jobs)
2023-03-28 14:55:09,478 INFO    mcmc      : Multiprocess sampling (4 chains in 4 jobs)                                      
NUTS: [mu, cov]
2023-03-28 14:55:09,479 INFO    mcmc      : NUTS: [mu, cov]                                                                 
 0.00% [0/8000 00:00<? Sampling 4 chains, 0 divergences]

The sampler hangs at 0/8000. I’m running on a big computer, an AWS EC2 r5.16xlarge. No cores appear to be working, the sampler appears to be totally hung.

Thanks for your help!