Harnessing multiple cores to speed up fits with small number of chains

I think you’re on the right track. When you call pm.sample() with new_step, I think it tunes by default. So maybe try

pm.sample(1000,step=new_step, tune=0)

and see if the step size is intact. Then I think you may want to extract a point from tune_trace and feed it in as a start point to new_step. That should (I think?) prevent sample() from trying to initialize. And then I would try to make sure that whatever other sampler parameters are optimized during tuning are also being copied over from tune_trace to new_step. I suspect that the list is longer than step size (tree size? max energy?).