Regarding the optimal number of threads. I have seen in other applications (non-PyMC) that using less threads can boost performance. This happens in particular when the tasks are memory intensive; the data transport overhead then cloggs the processor.
It can also help to try various linear algebra libraries.
But np.show_config() is known to not always be accurate: