I am looking at the code and I wonder if it might be due to that early_max_tree_depth. I used tree depth of 25 so the difference between the early_max_tree_depth (8) and 25 is even greater than when using the default value. The NUTS sampler uses step.iter_count to determine if the early_max_tree_depth or normal max_tree_depth should be used. In parallel sampling, the NUTS step is pickled and thus in practice there are separate step objects for each chain. But in sequential sampling, the step object is initialized once and shared between the chains. This is “hacked” regarding step.tune so that the following chains start tuning again, but iter_count and any other state maintained by the step object appears to still be dependent on the previous chains. As a result, if given, say, tune=500, draws=500, then the first chain would start at iter_count = 0, the second at iter_count = 1000, and the third at iter_count = 2000, etc. They would each proceed from a different start point, but only the first would use early_max_tree_depth because only its iter_count < 200. (Also, why is this magic number 200 not configurable as well?) This would also apply for any other state maintained by the step method.