Parallelization is only a small part of my original proposal.
I can understand Matt’s response. He does not want you replacing working code unless there is a clear advantage. If you are just doing embarassingly parallel tasks, then joblib is fine. If you need to update a distributed, logical array, iteratively with cross-communication between the nodes holding onto different chunks, then you will need dask.
The biggest opportunity is for PyMC4 to help improve the foundations of differentiable array-programming in Python. Perhaps, what PyMC4 needs to do is build an interface that can use multiple-backends (i.e. define an API or use/improve something like Keras). So, that users of PyMC4 do not struggle if the backend chosen becomes less maintained.