Sampling from multilevel multivariable regression model is very slow

mishooax · May 14, 2020, 9:09pm

Hi there,

I’ve built a multilevel, multi-variable linear regression model inspired by the example in section 1.13 of the Stan manual. We have N samples organised into J groups. K is the number of predictors. The model looks as follows:

y_n \sim {\rm Normal} (x_n \cdot \beta_{jj[n]}, \sigma) \\ \sigma \sim {\rm HalfCauchy}(2.5) \\ \beta_j \sim {\rm MvNormal}(\mu_j, \Sigma) \\ \mu_j \sim {\rm Normal}(0, 5.) \\ \Sigma = {\rm diag}(\tau) \cdot L_\Omega L_\Omega^T \cdot {\rm diag}(\tau) \\ \tau \sim {\rm Cauchy}(2.5) \\ L_\Omega \sim {\rm LKJCholesky}(\eta = 3)

# X_train = (centered) training data
# y_train = observed regression targets

# QR reparametrization
Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
Q_train = Q_train_ * np.sqrt(n_samples - 1)
R_train = R_train_ / np.sqrt(n_samples - 1)
R_train_inv = np.linalg.inv(R_train)  # i know...

# (...)

with pm.Model() as linear_model:
    # std for likelihood
    s = pm.HalfCauchy('sigma', 2.5)
    # covariance matrix
    packed_L_Omega = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=3, sd_dist=pm.HalfCauchy.dist(2.5))
    L_Omega = pm.expand_packed_triangular(n_predictors, packed_L_Omega)
    # group mean
    mu = pm.Normal('mu', 0., 5., shape=(n_predictors, n_groups))
    # group slopes
    beta_mat = [pm.MvNormal(f'_beta_{i_grp:04d}', mu[:, i_grp], chol=L_Omega, shape=n_predictors) for i_grp in range(n_groups)]
    beta = pm.Deterministic('beta', tt.stack(beta_mat, axis=1))
    # group intercepts
    alpha = pm.Normal('alpha', 0., 5., shape=(n_samples, n_groups))
    # expected value
    y_hat = alpha[:, group_idx] + tt.dot(Q_train, beta[:, group_idx])
    # data likelihood
    y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)
    # sample
    trace = pm.sample(draws=1000, chains=2, cores=1, tune=1000)
    # PPC
    posterior_predictive = pm.sample_posterior_predictive(trace)

Two questions:

1/ the sampler starts up, but it is very slow. N \approx 3500, pymc3 estimates one chain will take approx 2 days to run
Is there an obvious way to speed this up? I was hoping to scale this model up to 350K samples and 200 predictors, is there any real hope of doing that? CPU power is not an issue, as I am planning to run this on a supercomputer.

2/ I get some weird Theano warnings

python3.7/site-packages/theano/tensor/basic.py:6611: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  result[diagonal_slice] = x
python3.7/site-packages/theano/tensor/subtensor.py:2339: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  out[0][inputs[2:]] = inputs[1]
python3.7/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
python3.7/site-packages/theano/tensor/subtensor.py:2339: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  out[0][inputs[2:]] = inputs[1]
python3.7/site-packages/theano/tensor/basic.py:6611: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  result[diagonal_slice] = x
python3.7/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])

Any idea what part of the code is generating this?

Thank you!

AlexAndorra · May 15, 2020, 9:20am

Hi there,

I gather your predictors and target are standardized, right?
Have you tried with more regularizing priors: Exponential or HalfNormal instead of HalfCauchy. Also, Normal(0, 5) is huge with standardized data (it expects 95% of the data to be within 10 stds!), so here also a more regularizing prior could help. Here are general guidelines for prior choice.
Some parameters may also be close to boundaries, or some clusters don’t vary a lot, so a non-centered parametrization could help. You can look at this NB to see how to do it with MvNormal models.
Finally, the theano warning is harmless and appear when you do indexing, as in hierarchical models. You can get rid of it with:

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

Hope this helps

mishooax · May 15, 2020, 9:52am

Hi Alex! Thanks a bunch for your interest.

I gather your predictors and target are standardized, right?
That is correct.

Have you tried with more regularizing priors: Exponential or HalfNormal instead of HalfCauchy. Also, Normal(0,5) is huge with standardized data (it expects 95% of the data to be within 10 stds!), so here also a more regularizing prior could help. Here are general guidelines for prior choice.

Good observation, thanks. I’ve narrowed down the priors (e.g. Normal(0, 0.5)) - but it didn’t make a real difference in the sampler runtime. With 800 data samples, 10 groups and 18 predictors, the model needs several hours to generate 2k samples.

Some parameters may also be close to boundaries, or some clusters don’t vary a lot, so a non-centered parametrization could help. You can look at this NB to see how to do it with MvNormal models.

Isn’t my QR re-parametrization taking care of that? As in

        # X_train is my standardised training data
        Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
        Q_train = Q_train_ * np.sqrt(n_samples - 1)
        # (...)
        with pm.Model() as linear_model:
           # (...)
           y_hat = alpha[:, group_idx] + tt.dot(Q_train, beta[:, group_idx])
           y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)

Or am I misunderstanding something here?

Finally, the theano warning is harmless and appear when you do indexing, as in hierarchical models. You can get rid of it with (…)

That did the trick - thanks!

I’m suspecting this line may be slowing down the sampler:

    beta_mat = [pm.MvNormal(f'_beta_{i_grp:04d}', mu[:, i_grp], chol=L_Omega, shape=n_predictors) for i_grp in range(n_groups)]

any idea how I can vectorise this with Theano? My beta should be of size (n_predictors, n_groups).

junpenglao · May 15, 2020, 9:57am

Yeah the for loop is likely slow, however our batch support for MvNormal is pretty limited so there is no other way to do it yet (the MatrixNormal might help if you can make it work). How many groups you have?

mishooax · May 15, 2020, 10:10am

That’s an interesting suggestion, I’ll try to get MatrixNormal to work, thanks.

How many groups you have?

10 in this example. BTW, the full data set has about 50 groups, ca. 200 predictors and O(10^5) samples). However, I could simplify (“sparsify”) the prior correlation patterns between the predictors. Is it realistic to expect pymc3 to scale to that data size?

junpenglao · May 15, 2020, 10:47am

Depends, I think there will need some effort to formulate the problem and try a few different parameterization.

mishooax · May 15, 2020, 12:38pm

@junpenglao @AlexAndorra

just to confirm my understanding, the (single-predictor) re-parameterization example you linked to is equivalent to the (more general) QR reparametrization, as described e.g. in section 1.2 of the Stan manual? Cheers!

AlexAndorra · May 15, 2020, 4:57pm

Mmmh, I can’t answer this categorically – I don’t know the QR reparametrization well enough TBH

junpenglao · May 16, 2020, 7:15am

It is different - the QR decomposition is very helpful for speeding up the model, but what @AlexAndorra meant is you should also use a non-centered parameterization on top of that:

Below part is slow, since you use a for loop to build beta_mat

beta_mat = [pm.MvNormal(f'_beta_{i_grp:04d}', mu[:, i_grp], chol=L_Omega, shape=n_predictors) for i_grp in range(n_groups)]
beta = pm.Deterministic('beta', tt.stack(beta_mat, axis=1))

So use non-centered parameterization instead, and also get rid of the for-loop:

    # group slopes
    beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
    beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))

mishooax · May 18, 2020, 10:56am

junpenglao:

So use non-centered parameterization instead, and also get rid of the for-loop:
    # group slopes
    beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
    beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))

@junpenglao Gotcha - thanks!! However, the sampling is still slow even with the vectorised + QR + non-centered param version. One chain with 2000 samples takes about 10 hrs to draw. I reckon this has to do with the complicated shape of the posterior - any ideas on how to speed this code up further? I’ve tried to increase the no. of cores when calling pm.sample() but then I run into another issue.

mishooax · May 18, 2020, 1:47pm

hmmm… I suspect that my likelihood formulation

y_hat = alpha[:, group_idx] + pm.math.dot(Q_train, beta[:, group_idx])
y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)

is incorrect. For one, I should have one alpha intercept per group, not a 2D matrix of alpha's.
Also, that call to pm.math.dot generates a (huge) n_samples x n_samples matrix.

junpenglao · May 18, 2020, 1:47pm

Did you also switching from HalfCauchy to HalfNormal like @AlexAndorra suggested? That usually help a bit (and more informative prior in general)

junpenglao · May 18, 2020, 1:47pm

In that case it sounds like a error in shape handling

mishooax · May 18, 2020, 7:32pm

very much so, yes. what I’m looking for is

y_n \sim {\rm Normal}(\hat{y}_n := \alpha_{jj[n]} + x_n \cdot \beta_{jj[n]}, \sigma)

for n = 1 \ldots N_{\rm samples} and jj[n] is the group index of sample n. I’ve defined the group slopes as

 # group slopes
 beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
 beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))

Clearly my likelihood definition

alpha = pm.Normal('alpha', 0., 1., shape=(n_groups))
y_hat = alpha[group_idx] + pm.math.dot(X_train, beta[:, group_idx])
y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)

is incorrect (here group_idx is the jj array). The pm.math.dot tries to create a huge matrix - but I only need its diagonal elements. Long story short, I’m looking for a vectorised pm.math function that can calculate (in one line) the dot products

< X_train[n, 1:K], beta[1:K, jj[n]] >

for all sample indices n. Here the no of predictors K \sim {\cal O}(10^2), but N_{\rm samples} is large - {\cal O}(10^5) or more. Any ideas?

junpenglao · May 19, 2020, 8:35am

I see, I would suggest the following:

(n_obs, n_predictor) (n_predictors, n_groups) ==> (n_obs, n_groups)
linear_effect = pm.math.dot(X_train, beta)
y_hat = alpha[group_idx] + linear_effect[np.arange(n_group), group_idx]

I have not tested it so you might need to try a few different way of indexing from linear_effect

mishooax · May 19, 2020, 8:44am

@junpenglao I’ve managed to get it to work with

    # expected value: shape (n_samples,)
    y_hat = alpha[group_idx] + (X_train * beta[:, group_idx].T).sum(axis=-1)
    # data likelihood (i.i.d. data)
    y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train[:,0])

Your solution may be faster, as it does not involve the element wise multiplication of two (n_samples, n_predictors) matrices. I’ll report back - thanks!

Topic		Replies	Views
Slow sampling in hierarchical logistic regression Questions	3	908	February 22, 2022
Hierarchical Model - Slow Sampling ( 1.8 sec per draw ) - Primary suspect is how the multi-level hierarchy is coded Questions	5	840	November 2, 2021
Multivariate Time Series very slow with sampling Questions	6	1122	December 14, 2019
Prior Predictive Sampling in a Multilevel Linear Model v5 prior , hierarchical , pymc-marketing	2	269	May 7, 2024
Improve Sampling of Multilevel Regressions Model Questions	13	1218	March 9, 2020

Sampling from multilevel multivariable regression model is very slow

Related topics