Speed up the model: how large is a model large enough to benefit from using GPU?

HongyuXie · September 21, 2022, 3:33am

Dear all,
I currently have a hierarchical model (I can put it out if necessary) with data of dimension (20, x, 1000), where x can be at top several thousands. With a subset of data of dimension (20, 20, 5) it took 20 min and with (20, 500, 1000) and it ran for more than 1 day so I stopped (it could be that my model is structured in an inefficient way). From my limited knowledge, I am considering GPU (need to apply for resources), minibatch and VI. I have read the dev notes and understand GPU is not recommanded when the model is small.
So my two questions are: 1) Is my model large enough to benefit from GPU at any chance, or should I consider minibatch, VI or improving my model? 2) Are other PPLs such as TFP theoretically (in terms of memory copying or such) different from pymc so that GPU is worth trying in my case?
Thanks in advance.

ricardoV94 · September 21, 2022, 5:55am

Very difficult to say without knowing more about your model. What makes it take 20 minutes with 2k datapoints? Does the sampler complain with divergence or non convergence?

HongyuXie · September 21, 2022, 7:03am

Here is the model. Dimensions are given tailing each line. Basically it’s a Gamma GLM followed by a Poisson. Three dimensions of observed data are associated through 2 levels of pooling. An identity link function is used for now.

TD = pm.MutableData('TD', dataDict['TD'])  # (nSample, nPos, nComo)
AD = pm.MutableData('AD', dataDict['AD'])  # (nSample, nPos, nComo)

coeff_cov1 = pm.HalfNormal('coeff_cov1', sigma=1, shape=(nCov1,))  # (nCov1,)
coeff_cov2 = pm.HalfNormal('coeff_cov2', sigma=1, shape=(nCov2,))  # (nCov2,)
coeff_cov3 = pm.HalfNormal('coeff_cov3', sigma=1, shape=(nCov3,))  # (ncov3,)
coeff_cov4 = pm.HalfNormal('coeff_cov4', sigma=1, shape=(nCov4,))  # (nCov4,)
intercept = pm.HalfNormal('intercept', sigma=1, shape=1)

# Below is to obtain nComo combinations of above covariates, some of which are categorical others discrete.
# Each observed datum corresponds to one combo.
# Each element of dataDict['compIndices'] is a list of indices, each element of which is the index of each covariate
# I chose this way of implementation because the 4 covariates are in different dimensions.
gammaMu_c = pm.Deterministic("gammaMu_c", at.stack([at.math.sum(
            intercept
            + coeff_cov1[indices[0]] * cov1
            + coeff_cov2[indices[1]] * cov2
            + coeff_cov3[indices[2]] * cov3
            + coeff_cov4[indices[3]]) * cov4
            for indices in dataDict['comb_indices']
             ]))  # (nComo,)

# pooling level 1: each combo-level Mu populates to position-level Mus
gammaMu_p = pm.Normal("gammaMu_p", mu=gammaMu_c, sigma=1, shape=(nPos, nComo))  # (nPos, nComo)
# pooling level 2: each position-level Mu populates to sample-level Mus
gammaMu_s = pm.Normal("gammaMu_p", mu=gammaMu_p, sigma=1, shape=(nSample, nPos, nComo))  # (nSample, nPos, nComo)
gammaStd = pm.HalfNormal("gammaStd", sigma=5, shape=1)  # (1,)
gammaShape = pm.Deterministic('gammaShape', gammaMu_s ** 2 / gammaStd ** 2)  # (nSample, nPos, nComo)
gammaBeta = pm.Deterministic('gammaBeta', (gammaShape / gammaMu_s))  # (nSample, nPos, nComo)

AF = pm.Gamma('AF', alpha=gammaShape, beta=gammaBeta)  # (nSample, nPos, nComo)

lambda_p = pm.Deterministic('lambda_p', TD * AF)  # (nSample, nPos, nComo)
# mask out missing data using a binary mask
obs = pm.Potential('obs', pm.logp(pm.Poisson.dist(mu=lambda_p), AD) * dataDict['AD_mask'])

I use the following code to generate toy data ignoring missing data:
from scipy.stats import halfnorm
rng = np.random.default_rng(123)
nSample, nPos, nCombo = 20, 20, 10
gammaMu_c_true = halfnorm.rvs(loc=2, scale=.5, size=nCombo)
gammaMu_p_true = rng.normal(loc=gammaMu_c_true, scale=.05, size=(nPos, nCombo))
gammaMu_s_true = rng.normal(loc=gammaMu_p_true, scale=.1, size=(nSample, nPos, nCombo))

gammaMu_s_true = gammaMu_s_true # identity link
gammaStd_true = halfnorm.rvs(loc=0, scale=2, size=(nSample, nPos, nCombo))
gammaShape_true = gammaMu_true ** 2 / gammaStd_true ** 2
gammaBeta_true = gammaShape_true / gammaMu_true
AF_true = rng.gamma(shape=gammaShape_true, scale=1 / gammaBeta_true)
TD_true = np.random.randint(2000, 4000, (nSample, nPos, nCombo))
lambda_p_true = TD_true * AF_true
AD_true = rng.poisson(lambda_p_true)

With 2 chains 1000 draws, there were around 20 divergences.
Using nSample, nPos, nCombo = 20, 20, 10, it roughly converged ok by posterior plots.

ricardoV94 · September 21, 2022, 7:59am

By the way you don’t need to wrap everything in deterministics, only if you need those values afterwards. It will save on memory/computation.

You might benefit from changing how you compute gammaMu_c, but as you commented it might not be trivial.

Otherwise it looks fine, but the fact you get divergences could mean bad priors/ challenging model. I would focus on fixing that before worrying about speedup. As you would still be getting invalid samples, just faster.

Topic		Replies	Views
Sampling time GPU vs CPU v5 modeling	1	2545	February 7, 2023
Batch process capability for pymc.sampling_jax.sample_numpyro_nuts() with GPU? v5 modeling	3	541	September 12, 2022
Is it possible to speed up PyMC sampling? version agnostic	3	3152	May 20, 2022
Inference parallelization using gpu v5 gpu , modeling	0	377	November 21, 2023
Possible ways to speed up spatio-temporal GP modeling v5 modeling	16	764	October 22, 2023

Speed up the model: how large is a model large enough to benefit from using GPU?

Related topics