Following the discussions here:
I have realized that in Ubuntu when I am using pymc 5.9.1 Multivariate Gaussian is incredibly slow. I am aware that there is already an open issue in github. I have not however seen a discussion of the effect of the pymc version on this. For me the slow down only starts happening after and including 5.9. When I upgrade from 5.8.0 to 5.9.0 the amount of time taken for sampling doubles and when I pass from 5.9.0 to 5.9.1 it just becomes unreasonable, only for MV. I will dump below some info regarding this. I am on Ubuntu 20.04.
1- The code I am using is due to @jessegrabowski in the link above:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pymc as pm
import numpy as np
import matplotlib.pyplot as plt
import pytensor.tensor as pt
import arviz as az
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
n_clusters = 5
data, labels = make_blobs(n_samples=1000, centers=n_clusters, random_state=10)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
plt.scatter(*scaled_data.T, c=labels)
coords={"cluster": np.arange(n_clusters),
"obs_id": np.arange(data.shape[0]),
"coord":['x', 'y']}
# When you use the ordered transform, the initial values need to be
# monotonically increasing
sorted_initvals = np.linspace(-2, 2, 5)
with pm.Model(coords=coords) as model:
# Use alpha > 1 to prevent the model from finding sparse solutions -- we know
# all 5 clusters should be represented in the posterior
w = pm.Dirichlet("w", np.full(n_clusters, 10), dims=['cluster'])
# Mean component
x_coord = pm.Normal("x_coord", sigma=1, dims=["cluster"],
transform=pm.distributions.transforms.ordered,
initval=sorted_initvals)
y_coord = pm.Normal('y_coord', sigma=1, dims=['cluster'])
centroids = pm.Deterministic('centroids', pt.concatenate([x_coord[None], y_coord[None]]).T,
dims=['cluster', 'coord'])
# Diagonal covariances. Could also model the full covariance matrix, but I didn't try.
sigma = pm.HalfNormal('sigma', sigma=1, dims=['cluster', 'coord'])
covs = [pt.diag(sigma[i]) for i in range(n_clusters)]
# Define the mixture
components = [pm.MvNormal.dist(mu=centroids[i], cov=covs[i]) for i in range(n_clusters)]
y_hat = pm.Mixture("y_hat",
w,
components,
observed=scaled_data,
dims=["obs_id", 'coord'])
idata = pm.sample(init='advi+adapt_diag')
2- In all my tests, I have installed a fresh enviroment with the following command:
conda create -n pymc_env_X_Y_Z pymc=X.Y.Z scikit-learn spyder-kernels=2.4
3- I have tested various versions starting at 5.2 to 5.9.1. In 5.9.1 but the advi initialization and sampling is unreasonably slow so I had to test sampling with just pm.sample() rather than above.
pymc 5.2
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|โโโโโ---------------| 26.98% [53965/200000 02:20<06:21 Average Loss = 1,062]9]Convergence achieved at 54000
Interrupted at 53,999 [26%]: Average Loss = 1,915.4
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|โโโโโโโโโโโโ| 100.00% [8000/8000 04:37<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 277 seconds.
pymc 5.5
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|โโโ-----------------| 16.59% [33176/200000 01:24<07:05 Average Loss = 1,494.8]Convergence achieved at 33200
Interrupted at 33,199 [16%]: Average Loss = 2,372.7
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|โโโโโโโโโโโโ| 100.00% [8000/8000 05:40<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 340 seconds.
pymc 5.6
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|โโโโโ---------------| 26.33% [52663/200000 02:41<07:31 Average Loss = 1,062.3]Convergence achieved at 52700
Interrupted at 52,699 [26%]: Average Loss = 1,928
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|โโโโโโโโโโโโ| 100.00% [8000/8000 04:56<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 297 seconds.
version 5.7
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|โโโโโ---------------| 25.73% [51468/200000 02:39<07:39 Average Loss = 1,062.4]Convergence achieved at 51500
Interrupted at 51,499 [25%]: Average Loss = 1,947.9
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|โโโโโโโโโโโโ| 100.00% [8000/8000 05:03<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 303 seconds.
version 5.8
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|โโโโโ---------------| 25.64% [51285/200000 02:03<05:57 Average Loss = 1,062.6]Convergence achieved at 51300
Interrupted at 51,299 [25%]: Average Loss = 1,948.4
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|โโโโโโโโโโโโ| 100.00% [8000/8000 05:10<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 310 seconds.
version 5.9
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|โโโโโ---------------| 29.93% [59855/200000 02:57<06:54 Average Loss = 1,038.5]Convergence achieved at 59900
Interrupted at 59,899 [29%]: Average Loss = 1,821.7
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|โโโโโโโโโโโโ| 100.00% [8000/8000 11:06<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 666 seconds.
version 5.9.1
Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
|--------------------| 0.04% [80/200000 00:16<11:45:19 Average Loss = 3,777.7]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [w, x_coord, y_coord, sigma]
|--------------| 0.42% [34/8000 00:31<2:01:45 Sampling 4 chains, 0 divergences]
4- Here are the blas results in enviroments 5_9_1 and 5_8
5_9_1:
Some results that you can compare against. They were 10 executions
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
All memory layout was in C order.
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
Core i7 950(3.07GHz, hyper-threads enabled)
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)
Libraries tested:
* numpy with ATLAS from distribution (FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads
* goto2 1.13 compiled with multiple threads enabled
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
Test time in float32. There were 10 executions of gemm in
float32 with matrices of shape 5000x5000 (M=N=K=5000)
All memory layout was in C order.
cuda version 8.0 7.5 7.0
gpu
M40 0.45s 0.47s
k80 0.92s 0.96s
K6000/NOECC 0.71s 0.69s
P6000/NOECC 0.25s
Titan X (Pascal) 0.28s
GTX Titan X 0.45s 0.45s 0.47s
GTX Titan Black 0.66s 0.64s 0.64s
GTX 1080 0.35s
GTX 980 Ti 0.41s
GTX 970 0.66s
GTX 680 1.57s
GTX 750 Ti 2.01s 2.01s
GTX 750 2.46s 2.37s
GTX 660 2.32s 2.32s
GTX 580 2.42s
GTX 480 2.87s
TX1 7.6s (float32 storage and computation)
GT 610 33.5s
Some PyTensor flags:
blas__ldflags= -L/home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib -lcblas -lblas -lcblas -lblas -Wl,-rpath,/home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib
compiledir= /home/avicenna/.pytensor/compiledir_Linux-5.15--generic-x86_64-with-glibc2.31-x86_64-3.11.6-64
floatX= float64
device= cpu
Some OS information:
sys.platform= linux
sys.version= 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
sys.prefix= /home/avicenna/miniconda3/envs/pymc_env_5_9_1
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib']
include_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib']
include_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_9_1/include']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
not found = AVX512_SPR
Numpy dot module: numpy
Numpy location: /home/avicenna/miniconda3/envs/pymc_env_5_9_1/lib/python3.11/site-packages/numpy/__init__.py
Numpy version: 1.25.2
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).
Total execution time: 9.80s on CPU (with direct PyTensor binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
5_8:
python -m pytensor.misc.check_blas
Some results that you can compare against. They were 10 executions
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
All memory layout was in C order.
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
Core i7 950(3.07GHz, hyper-threads enabled)
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)
Libraries tested:
* numpy with ATLAS from distribution (FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads
* goto2 1.13 compiled with multiple threads enabled
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
Test time in float32. There were 10 executions of gemm in
float32 with matrices of shape 5000x5000 (M=N=K=5000)
All memory layout was in C order.
cuda version 8.0 7.5 7.0
gpu
M40 0.45s 0.47s
k80 0.92s 0.96s
K6000/NOECC 0.71s 0.69s
P6000/NOECC 0.25s
Titan X (Pascal) 0.28s
GTX Titan X 0.45s 0.45s 0.47s
GTX Titan Black 0.66s 0.64s 0.64s
GTX 1080 0.35s
GTX 980 Ti 0.41s
GTX 970 0.66s
GTX 680 1.57s
GTX 750 Ti 2.01s 2.01s
GTX 750 2.46s 2.37s
GTX 660 2.32s 2.32s
GTX 580 2.42s
GTX 480 2.87s
TX1 7.6s (float32 storage and computation)
GT 610 33.5s
Some PyTensor flags:
blas__ldflags= -L/home/avicenna/miniconda3/envs/pymc_env_5_8/lib -lcblas -lblas -lcblas -lblas
compiledir= /home/avicenna/.pytensor/compiledir_Linux-5.15--generic-x86_64-with-glibc2.31-x86_64-3.11.6-64
floatX= float64
device= cpu
Some OS information:
sys.platform= linux
sys.version= 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
sys.prefix= /home/avicenna/miniconda3/envs/pymc_env_5_8
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
blas_info:
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/lib']
include_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/include']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
libraries = ['cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/lib']
include_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/include']
language = c
lapack_info:
libraries = ['lapack', 'blas', 'lapack', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
library_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/lib']
language = c
define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
include_dirs = ['/home/avicenna/miniconda3/envs/pymc_env_5_8/include']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
not found = AVX512_SPR
Numpy dot module: numpy
Numpy location: /home/avicenna/miniconda3/envs/pymc_env_5_8/lib/python3.11/site-packages/numpy/__init__.py
Numpy version: 1.25.2
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).
Total execution time: 8.78s on CPU (with direct PyTensor binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.