@cluhmann yeah, as far as I can see, storing the results from appropriately-sized tuning (given your model complexity), and then re-using the sampler settings across a large number of small chains in parallel is the only way to satisfy the trade off between tuning requirements and redundancy.
Perhaps easier said than done, but it would be great if this could become a pm.sample option. Something like a “tune_sharing” parameter: By default the sampler will tune as many times as the number of specified chains, however this parameter could then allow you to share the tuning across more than one chain.
Something like:
pm.sample(1000,tune=50000,tune_sharing=4,cores=400)
Would create 400 chains of 1000 samples, but only 4 samplers trained on 50000 tuning steps, which are then shared evenly across the 400 chains, i.e., 100 per sample state.