Optimization suggestion for Hierarchical Model using NUTS on CPU/GPU

My model can run both for chains=1 and chains=4. In some Linux computer, it hangs doing for chains=4. What I have read from Github is related with joblib package. One last question, will chains = 4, cores = 4 will be faster?