I’m doing a simulated data study in which I am seeking to explore whether my regression model is able to represent a (known) data-generating model with limited data. I first generate fake data, then fit the fake data using HMC (NUTS).
For the sake of simplicity, let’s assume that the true model is y ~ normal(mu + beta*X, sigma) and that I’m fitting this (known) model with PyMC3 – the real model is a little more complicated but that’s not relevant here.
First I generate fake data:
import numpy as np
import pymc3 as pm
J = 100
N = 50
X = np.arange(0, N)
mu_true = 1
beta_true = 0.1
sigma_true = 2
y_obs = np.ones(shape=(N,J))
for j in np.arange(J):
y_obs[:,j] = mu_true beta_true * X + np.random.normal(sigma_true, N)
Next I’d like to fit this data. It’s relatively straightforward to build for loop in which each simulation’s parameters are recovered with something such as:
with pm.Model() as basic_model:
# PRIORS
mu = pm.Normal('mu', 10, 10)
beta = pm.Normal(beta', 0, 0.5)
sigma = pm.HalfNormal('sigma', 1)
# FOR CONCISION
mu = mu + beta * X
# DATA MODEL
y = pm.Normal('y' mu=mu, sd=std, observed=y_obs[:,j])
However, looping through this model is pretty slow because the for loop is in python, and so data has to be sent from python to C (I’m not a developer – this is a very heurestic understanding) J times and NUTS has to be initialized J times as well.
The hierarchical GLM tutorial on the docs website demonstrates an approach for working with data from many categories in an unpooled setting by defining an index and then looping through each row of a data frame (or similar structure) – that looks like this (I’ve used analagous approaches in stan
):
radon_est = a[county_idx] + b[county_idx]*data.floor.values
However, this is also pretty slow – mainly because the matrix structure of the data (all my y are the same dimension) is neglected. Is there a way to fit this? In stan
I would declare y
as a matrix
(or vector of vectors possibly) and then loop through each variable
for (j in 1:J)
y[,j] ~ normal(mu[j] + beta[j]*X, sigma[j]);
Is there a way to do something analogous in PyMC3
? This would be a huge speedup for my code.