Sampling rate when running independent regressions in vectorized form

I am trying to run several independent regressions together in vectorized format. The data looks something like this:

In [8]: firing_data.head()
Out[8]: 
     Firing  Laser  Palatability  Time
0 -1.339181    0.0             3     0
1  1.152318    1.0             3     0
2  0.321819    0.0             3     0
3  1.152318    0.0             3     0
4  1.982818    1.0             3     0

I am trying to regress Firing (dependent) on Palatability (independent). Palatability is discrete, and can assume values in [1, 2, 3, 4]. I want to run such a regression independently for every combination of Laser (discrete, 0 or 1) and Time (discrete, 0 to 70). So, eventually, I run 2*71 = 142 regressions and want to estimate a coefficient for Palatability independently in each case.

I am trying to merge all 142 regressions into the same model by writing it in vectorized form. Here’s the model:

with pm.Model() as model:
	# Palatability slopes, one for each time point (one set for each laser condition)
	coeff_pal = pm.Normal("coeff_pal", mu = 0, sd = 1.0, shape = (71, 2))
	# Observation standard deviation
	sd = pm.HalfCauchy("sd", 0.5, shape = (71, 2))
	# Regression equation for the mean observations
	regression = coeff_pal[tt.cast(firing_data["Time"], 'int32'), tt.cast(firing_data["Laser"], 'int32')]*firing_data["Palatability"]
	# Actual observations
	obs = pm.Normal("obs", mu = regression, sd = sd[tt.cast(firing_data["Time"], 'int32'), tt.cast(firing_data["Laser"], 'int32')], observed = firing_data["Firing"])

NUTS sampling is very slow, something like 10s per iteration after the first few iterations. What confuses me however is that the sampling is ~1000it/s (takes about 6-8s for 6000 iterations) if I run one of these 142 regressions on its own. Of course, I understand that the vectorized model has to make 142 gradient evaluations etc compared to just 1 for the single regression - but even with that taken into account, the vectorized model is about 2 orders of magnitude slower than running 142 regressions serially. I guess I am missing something in how the model is set up - is running a single model 142 times serially the best option for me? Is there some way to parallelize this? Any suggestions are really appreciated :slight_smile:

The full dataset has ~600k entries. I am happy to share it in case it will help in our discussion, let me know!

Update: I have been using Metropolis for the full model (with 142 sets of parameters) for now, but as expected, I get very correlated samples (only a few hundred effective samples after sampling for 50k samples). Any advice with NUTS on this problem will be very appreciated, thanks in advance for the help :slight_smile:

Maybe try putting a hyper prior on coeff_pal and initialise the sampler at the MLE?