Thank you @cluhmann for taking look into this.
Sorry for taking time to get back to you. I started run it on GPU system and had to Install Linux with Ubuntu, set conda and pymc3 env on it to preform better speed.
So my latest modification on this Hierarchical model based on your recommendation is reducing target_accept and increase tune to 8 000, also draws=10000:
trace_hr_nc = pm.sample(chains=2,
draws=10000,
tune=8000,
# max_treedepth = 15,
return_inferencedata=True)
As result - tons of divergences and following :
