Sampling from multilevel multivariable regression model is very slow

Hi there,

I’ve built a multilevel, multi-variable linear regression model inspired by the example in section 1.13 of the Stan manual. We have N samples organised into J groups. K is the number of predictors. The model looks as follows:

y_n \sim {\rm Normal} (x_n \cdot \beta_{jj[n]}, \sigma) \\ \sigma \sim {\rm HalfCauchy}(2.5) \\ \beta_j \sim {\rm MvNormal}(\mu_j, \Sigma) \\ \mu_j \sim {\rm Normal}(0, 5.) \\ \Sigma = {\rm diag}(\tau) \cdot L_\Omega L_\Omega^T \cdot {\rm diag}(\tau) \\ \tau \sim {\rm Cauchy}(2.5) \\ L_\Omega \sim {\rm LKJCholesky}(\eta = 3)
# X_train = (centered) training data
# y_train = observed regression targets

# QR reparametrization
Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
Q_train = Q_train_ * np.sqrt(n_samples - 1)
R_train = R_train_ / np.sqrt(n_samples - 1)
R_train_inv = np.linalg.inv(R_train)  # i know...

# (...)

with pm.Model() as linear_model:
    # std for likelihood
    s = pm.HalfCauchy('sigma', 2.5)
    # covariance matrix
    packed_L_Omega = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=3, sd_dist=pm.HalfCauchy.dist(2.5))
    L_Omega = pm.expand_packed_triangular(n_predictors, packed_L_Omega)
    # group mean
    mu = pm.Normal('mu', 0., 5., shape=(n_predictors, n_groups))
    # group slopes
    beta_mat = [pm.MvNormal(f'_beta_{i_grp:04d}', mu[:, i_grp], chol=L_Omega, shape=n_predictors) for i_grp in range(n_groups)]
    beta = pm.Deterministic('beta', tt.stack(beta_mat, axis=1))
    # group intercepts
    alpha = pm.Normal('alpha', 0., 5., shape=(n_samples, n_groups))
    # expected value
    y_hat = alpha[:, group_idx] + tt.dot(Q_train, beta[:, group_idx])
    # data likelihood
    y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)
    # sample
    trace = pm.sample(draws=1000, chains=2, cores=1, tune=1000)
    # PPC
    posterior_predictive = pm.sample_posterior_predictive(trace)

Two questions:

1/ the sampler starts up, but it is very slow. N \approx 3500, pymc3 estimates one chain will take approx 2 days to run :astonished:
Is there an obvious way to speed this up? I was hoping to scale this model up to 350K samples and 200 predictors, is there any real hope of doing that? CPU power is not an issue, as I am planning to run this on a supercomputer.

2/ I get some weird Theano warnings

python3.7/site-packages/theano/tensor/basic.py:6611: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  result[diagonal_slice] = x
python3.7/site-packages/theano/tensor/subtensor.py:2339: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  out[0][inputs[2:]] = inputs[1]
python3.7/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])
python3.7/site-packages/theano/tensor/subtensor.py:2339: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  out[0][inputs[2:]] = inputs[1]
python3.7/site-packages/theano/tensor/basic.py:6611: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  result[diagonal_slice] = x
python3.7/site-packages/theano/tensor/subtensor.py:2197: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  rval = inputs[0].__getitem__(inputs[1:])

Any idea what part of the code is generating this?

Thank you!

Hi there,

  • I gather your predictors and target are standardized, right?
  • Have you tried with more regularizing priors: Exponential or HalfNormal instead of HalfCauchy. Also, Normal(0, 5) is huge with standardized data (it expects 95% of the data to be within 10 stds!), so here also a more regularizing prior could help. Here are general guidelines for prior choice.
  • Some parameters may also be close to boundaries, or some clusters don’t vary a lot, so a non-centered parametrization could help. You can look at this NB to see how to do it with MvNormal models.
  • Finally, the theano warning is harmless and appear when you do indexing, as in hierarchical models. You can get rid of it with:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

Hope this helps :vulcan_salute:

Hi Alex! Thanks a bunch for your interest.

I gather your predictors and target are standardized, right?
That is correct.

Have you tried with more regularizing priors: Exponential or HalfNormal instead of HalfCauchy. Also, Normal(0,5) is huge with standardized data (it expects 95% of the data to be within 10 stds!), so here also a more regularizing prior could help. Here are general guidelines for prior choice.

Good observation, thanks. I’ve narrowed down the priors (e.g. Normal(0, 0.5)) - but it didn’t make a real difference in the sampler runtime. With 800 data samples, 10 groups and 18 predictors, the model needs several hours to generate 2k samples.

Some parameters may also be close to boundaries, or some clusters don’t vary a lot, so a non-centered parametrization could help. You can look at this NB to see how to do it with MvNormal models.

Isn’t my QR re-parametrization taking care of that? As in

        # X_train is my standardised training data
        Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
        Q_train = Q_train_ * np.sqrt(n_samples - 1)
        # (...)
        with pm.Model() as linear_model:
           # (...)
           y_hat = alpha[:, group_idx] + tt.dot(Q_train, beta[:, group_idx])
           y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)

Or am I misunderstanding something here?

Finally, the theano warning is harmless and appear when you do indexing, as in hierarchical models. You can get rid of it with (…)

That did the trick - thanks!

I’m suspecting this line may be slowing down the sampler:

    beta_mat = [pm.MvNormal(f'_beta_{i_grp:04d}', mu[:, i_grp], chol=L_Omega, shape=n_predictors) for i_grp in range(n_groups)]

any idea how I can vectorise this with Theano? My beta should be of size (n_predictors, n_groups).

Yeah the for loop is likely slow, however our batch support for MvNormal is pretty limited so there is no other way to do it yet (the MatrixNormal might help if you can make it work). How many groups you have?

1 Like

That’s an interesting suggestion, I’ll try to get MatrixNormal to work, thanks.

How many groups you have?

10 in this example. BTW, the full data set has about 50 groups, ca. 200 predictors and O(10^5) samples). However, I could simplify (“sparsify”) the prior correlation patterns between the predictors. Is it realistic to expect pymc3 to scale to that data size?

Depends, I think there will need some effort to formulate the problem and try a few different parameterization.

@junpenglao @AlexAndorra

just to confirm my understanding, the (single-predictor) re-parameterization example you linked to is equivalent to the (more general) QR reparametrization, as described e.g. in section 1.2 of the Stan manual? Cheers!

Mmmh, I can’t answer this categorically – I don’t know the QR reparametrization well enough TBH

It is different - the QR decomposition is very helpful for speeding up the model, but what @AlexAndorra meant is you should also use a non-centered parameterization on top of that:

Below part is slow, since you use a for loop to build beta_mat

beta_mat = [pm.MvNormal(f'_beta_{i_grp:04d}', mu[:, i_grp], chol=L_Omega, shape=n_predictors) for i_grp in range(n_groups)]
beta = pm.Deterministic('beta', tt.stack(beta_mat, axis=1))

So use non-centered parameterization instead, and also get rid of the for-loop:

    # group slopes
    beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
    beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))
3 Likes

@junpenglao Gotcha - thanks!! However, the sampling is still slow even with the vectorised + QR + non-centered param version. One chain with 2000 samples takes about 10 hrs to draw. I reckon this has to do with the complicated shape of the posterior - any ideas on how to speed this code up further? I’ve tried to increase the no. of cores when calling pm.sample() but then I run into another issue.

hmmm… I suspect that my likelihood formulation

y_hat = alpha[:, group_idx] + pm.math.dot(Q_train, beta[:, group_idx])
y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)

is incorrect. For one, I should have one alpha intercept per group, not a 2D matrix of alpha's.
Also, that call to pm.math.dot generates a (huge) n_samples x n_samples matrix.

Did you also switching from HalfCauchy to HalfNormal like @AlexAndorra suggested? That usually help a bit (and more informative prior in general)

In that case it sounds like a error in shape handling :sweat_smile:

very much so, yes. what I’m looking for is

y_n \sim {\rm Normal}(\hat{y}_n := \alpha_{jj[n]} + x_n \cdot \beta_{jj[n]}, \sigma)

for n = 1 \ldots N_{\rm samples} and jj[n] is the group index of sample n. I’ve defined the group slopes as

 # group slopes
 beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
 beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))

Clearly my likelihood definition

alpha = pm.Normal('alpha', 0., 1., shape=(n_groups))
y_hat = alpha[group_idx] + pm.math.dot(X_train, beta[:, group_idx])
y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train)

is incorrect (here group_idx is the jj array). The pm.math.dot tries to create a huge matrix - but I only need its diagonal elements. Long story short, I’m looking for a vectorised pm.math function that can calculate (in one line) the dot products

< X_train[n, 1:K], beta[1:K, jj[n]] >

for all sample indices n. Here the no of predictors K \sim {\cal O}(10^2), but N_{\rm samples} is large - {\cal O}(10^5) or more. Any ideas?

I see, I would suggest the following:

(n_obs, n_predictor) (n_predictors, n_groups) ==> (n_obs, n_groups)
linear_effect = pm.math.dot(X_train, beta)
y_hat = alpha[group_idx] + linear_effect[np.arange(n_group), group_idx]

I have not tested it so you might need to try a few different way of indexing from linear_effect

1 Like

@junpenglao I’ve managed to get it to work with

    # expected value: shape (n_samples,)
    y_hat = alpha[group_idx] + (X_train * beta[:, group_idx].T).sum(axis=-1)
    # data likelihood (i.i.d. data)
    y = pm.Normal('y', mu=y_hat, sd=s, observed=y_train[:,0])

Your solution may be faster, as it does not involve the element wise multiplication of two (n_samples, n_predictors) matrices. I’ll report back - thanks!

1 Like