The source of the other threads will not be multiprocessing I think, but either openmp through theano if you have big datasets somewhere, or blas if you are using eg matrix-vector products.
You can configure the theano parallelization using as described here:
http://deeplearning.net/software/theano/library/config.html
For BLAS it depends a bit on which implementation you are using (one of MKL, openblas, blis probably). On an intel cpu usually MKL is the goto implementation, you should get that automatically if it is using conda internally, I’m not sure how that works on your base image.
You should be able to control the number of MKL threads with the environment variable OMP_NUM_THREADS
.
If the trouble is worth it really depends on your model. If you do large matrix vector products it might very well be.