Fit many regressions simultaneously

I’m doing a simulated data study in which I am seeking to explore whether my regression model is able to represent a (known) data-generating model with limited data. I first generate fake data, then fit the fake data using HMC (NUTS).

For the sake of simplicity, let’s assume that the true model is y ~ normal(mu + beta*X, sigma) and that I’m fitting this (known) model with PyMC3 – the real model is a little more complicated but that’s not relevant here.

First I generate fake data:

import numpy as np
import pymc3 as pm
J = 100
N = 50
X = np.arange(0, N)
mu_true = 1
beta_true = 0.1
sigma_true = 2
y_obs = np.ones(shape=(N,J))
for j in np.arange(J):
    y_obs[:,j] = mu_true beta_true * X + np.random.normal(sigma_true, N)

Next I’d like to fit this data. It’s relatively straightforward to build for loop in which each simulation’s parameters are recovered with something such as:

with pm.Model() as basic_model:
    # PRIORS
    mu = pm.Normal('mu', 10, 10)
    beta = pm.Normal(beta', 0, 0.5)
    sigma = pm.HalfNormal('sigma', 1)
    mu = mu + beta * X
    y = pm.Normal('y' mu=mu, sd=std, observed=y_obs[:,j])

However, looping through this model is pretty slow because the for loop is in python, and so data has to be sent from python to C (I’m not a developer – this is a very heurestic understanding) J times and NUTS has to be initialized J times as well.

The hierarchical GLM tutorial on the docs website demonstrates an approach for working with data from many categories in an unpooled setting by defining an index and then looping through each row of a data frame (or similar structure) – that looks like this (I’ve used analagous approaches in stan):

radon_est = a[county_idx] + b[county_idx]*data.floor.values

However, this is also pretty slow – mainly because the matrix structure of the data (all my y are the same dimension) is neglected. Is there a way to fit this? In stan I would declare y as a matrix (or vector of vectors possibly) and then loop through each variable

for (j in 1:J)
    y[,j] ~ normal(mu[j] + beta[j]*X, sigma[j]);

Is there a way to do something analogous in PyMC3? This would be a huge speedup for my code.

You could certainly do this without looping. You just need to create J-length arrays of stochastic nodes using the shape= kwarg rather than creating one node per stochastic per model, then map them to the correct data in y using appropriately designed design matrices.

Personally I would reshape your data matrix y to be a single column of length J*N, then create a new design matrix (call it Z) which designates each data point to the correct regression and matrix-multiply Z by your stochastic arrays to map them. A benefit of this approach would be that you wouldn’t need the same amount of data in each regression like you would in your example.

Got it, thanks. I had naively assumed that doing this would cause the model to assume it’s a multivariate normal (which is a different class in pymc3…).

The following seems to work:

import numpy as np
import pymc3 as pm

J = 13
N = 500
mu_true = 10
beta_true = 1.5/N
sigma_true = 1.5

X = np.ones(shape=(N,J)) * np.arange(0, N)[:,np.newaxis]
y_obs = np.ones(shape=(N,J))
for j in np.arange(J):
    y_obs[:,j] = mu_true + beta_true * X[:,j] + np.random.normal(loc=0, scale=sigma_true, size=N)

with pm.Model() as basic_model:
    # PRIORS
    mu = pm.Normal('mu', 10, 10, shape=J)
    beta = pm.Normal('beta', 0, 2, shape=J)
    sigma = pm.HalfNormal('sigma', 2, shape=J)
    mu = mu + beta*X
    y = pm.Normal('y', mu=mu, sd=sigma, observed=y_obs)

with basic_model:
    trace = pm.sample(draws=1000)

Is there a way to speed up by changing the definition of mu above?

Thanks! (This isn’t Stack Overflow so hope it’s OK to say that…)

Its a bit weird that you name both the stochastic node representing the intercept and intercept plus effect of the covariate mu, but I don’t think you will get this to run much faster.

OK cool.
Agreed that the naming’s off – happened when I was translating my example into something closer to a MWE