Fit many regressions simultaneously

jdossgollin · November 3, 2017, 5:02pm

I’m doing a simulated data study in which I am seeking to explore whether my regression model is able to represent a (known) data-generating model with limited data. I first generate fake data, then fit the fake data using HMC (NUTS).

For the sake of simplicity, let’s assume that the true model is y ~ normal(mu + beta*X, sigma) and that I’m fitting this (known) model with PyMC3 – the real model is a little more complicated but that’s not relevant here.

First I generate fake data:

import numpy as np
import pymc3 as pm
J = 100
N = 50
X = np.arange(0, N)
mu_true = 1
beta_true = 0.1
sigma_true = 2
y_obs = np.ones(shape=(N,J))
for j in np.arange(J):
    y_obs[:,j] = mu_true beta_true * X + np.random.normal(sigma_true, N)

Next I’d like to fit this data. It’s relatively straightforward to build for loop in which each simulation’s parameters are recovered with something such as:

with pm.Model() as basic_model:
    # PRIORS
    mu = pm.Normal('mu', 10, 10)
    beta = pm.Normal(beta', 0, 0.5)
    sigma = pm.HalfNormal('sigma', 1)
    # FOR CONCISION
    mu = mu + beta * X
    # DATA MODEL
    y = pm.Normal('y' mu=mu, sd=std, observed=y_obs[:,j])

However, looping through this model is pretty slow because the for loop is in python, and so data has to be sent from python to C (I’m not a developer – this is a very heurestic understanding) J times and NUTS has to be initialized J times as well.

The hierarchical GLM tutorial on the docs website demonstrates an approach for working with data from many categories in an unpooled setting by defining an index and then looping through each row of a data frame (or similar structure) – that looks like this (I’ve used analagous approaches in stan):

radon_est = a[county_idx] + b[county_idx]*data.floor.values

However, this is also pretty slow – mainly because the matrix structure of the data (all my y are the same dimension) is neglected. Is there a way to fit this? In stan I would declare y as a matrix (or vector of vectors possibly) and then loop through each variable

for (j in 1:J)
    y[,j] ~ normal(mu[j] + beta[j]*X, sigma[j]);

Is there a way to do something analogous in PyMC3? This would be a huge speedup for my code.

sammosummo · November 3, 2017, 5:40pm

You could certainly do this without looping. You just need to create J-length arrays of stochastic nodes using the shape= kwarg rather than creating one node per stochastic per model, then map them to the correct data in y using appropriately designed design matrices.

Personally I would reshape your data matrix y to be a single column of length J*N, then create a new design matrix (call it Z) which designates each data point to the correct regression and matrix-multiply Z by your stochastic arrays to map them. A benefit of this approach would be that you wouldn’t need the same amount of data in each regression like you would in your example.

jdossgollin · November 3, 2017, 6:21pm

Got it, thanks. I had naively assumed that doing this would cause the model to assume it’s a multivariate normal (which is a different class in pymc3…).

The following seems to work:

import numpy as np
import pymc3 as pm

J = 13
N = 500
mu_true = 10
beta_true = 1.5/N
sigma_true = 1.5

X = np.ones(shape=(N,J)) * np.arange(0, N)[:,np.newaxis]
y_obs = np.ones(shape=(N,J))
for j in np.arange(J):
    y_obs[:,j] = mu_true + beta_true * X[:,j] + np.random.normal(loc=0, scale=sigma_true, size=N)
y_obs

with pm.Model() as basic_model:
    # PRIORS
    mu = pm.Normal('mu', 10, 10, shape=J)
    beta = pm.Normal('beta', 0, 2, shape=J)
    sigma = pm.HalfNormal('sigma', 2, shape=J)
    # FOR CONCISION
    mu = mu + beta*X
    # DATA MODEL
    y = pm.Normal('y', mu=mu, sd=sigma, observed=y_obs)

with basic_model:
    trace = pm.sample(draws=1000)

Is there a way to speed up by changing the definition of mu above?

Thanks! (This isn’t Stack Overflow so hope it’s OK to say that…)

sammosummo · November 3, 2017, 6:40pm

Its a bit weird that you name both the stochastic node representing the intercept and intercept plus effect of the covariate mu, but I don’t think you will get this to run much faster.

jdossgollin · November 3, 2017, 6:47pm

OK cool.
Agreed that the naming’s off – happened when I was translating my example into something closer to a MWE

Topic		Replies	Views
Modeling of multiple regression model with array form v5 development , linear_model , shape_issue , modeling	12	1223	December 17, 2024
Fitting multiple independent hierarchical models at once version agnostic modeling	2	486	August 5, 2022
Multiple regression caveats Questions	6	825	July 23, 2018
Understanding indexing and dimensions in hierarchical models v5 time_series	9	1887	January 26, 2023
Managing indices/data shape in a multi-level linear regression for incremental model updating Questions	0	374	September 10, 2020

Fit many regressions simultaneously

Related topics