Tractable sampling of large n multivariate distribution?

JoeyKL · March 28, 2023, 6:56pm

Hi,

I’m trying to fit a multivariate normal distribution to a high dimensional dataset (~5000 dimensions). It seems like the default PyMC sampler isn’t able to handle a small fraction of the data; at more than 50 dimensions, it stalls. Doing a manual sample covariance shrinkage is working pretty well for me but I was hoping for a Bayesian upgrade. Any advice for how to make this problem tractable?

Code:

import pymc as pm
import arviz as az
import pandas as pd

data = pd.read_parquet('./data/dataset.parquet').iloc[::100,::100]

n_samples, n_features = data.shape
n_samples, n_features # (23, 50)

with pm.Model() as m:
    
    mu = pm.Normal('mu',shape=n_features,sigma=.5)
    sd_dist = pm.TruncatedNormal.dist(lower=0., mu=5., sigma=5., shape=n_features)
    chol, _, _ = pm.LKJCholeskyCov(
        'cov',
        n=n_features,
        eta=1.,
        sd_dist=sd_dist,
        compute_corr=True
    )
    
    vals = pm.MvNormal('vals', mu=mu, chol=chol, observed=data)
    idata = pm.sample()

Output:

Auto-assigning NUTS sampler...
2023-03-28 14:54:58,901 INFO    mcmc      : Auto-assigning NUTS sampler...                                                  
Initializing NUTS using jitter+adapt_diag...
2023-03-28 14:54:58,902 INFO    mcmc      : Initializing NUTS using jitter+adapt_diag...                                    
Multiprocess sampling (4 chains in 4 jobs)
2023-03-28 14:55:09,478 INFO    mcmc      : Multiprocess sampling (4 chains in 4 jobs)                                      
NUTS: [mu, cov]
2023-03-28 14:55:09,479 INFO    mcmc      : NUTS: [mu, cov]                                                                 
 0.00% [0/8000 00:00<? Sampling 4 chains, 0 divergences]

The sampler hangs at 0/8000. I’m running on a big computer, an AWS EC2 r5.16xlarge. No cores appear to be working, the sampler appears to be totally hung.

Thanks for your help!

Topic		Replies	Views
How does pyMC handle the curse of dimensionality? version agnostic modeling	3	352	February 4, 2024
Issues with NUTS performance for large scale problems Questions	1	747	January 30, 2019
MutableData Container - Dimensions for LKJCholeskyCov Distribution v5 modeling	4	104	July 1, 2024
About settings in MvNormal() Questions	3	584	January 11, 2022
MvNormal Normal - Fast Sampling, Slow Prediction v5 modeling	0	40	November 6, 2024

Tractable sampling of large n multivariate distribution?

Related topics