I am trying to push some trainer in production. I would need to sample on some 75K datasets, so I want to parallelize that training.
I noticed a significant drop of speed while switch from local (Macbook Pro Retina 2015, I7) to the cloud. From some 450iter/sec while sampling with Metropolis, it dropped to 80iter/sec on a 2.4Ghz Xeon Skylake.
I directly tested again local, using Docker this time on the same Macbook Pro, and the performances got halved. I allocated 2CPUs and 4G of memory to docker. It is the only container running, I am training with 1 job and 1 chain.
cc @AustinRochford who might have more experience on this.
I got some more info about it. when I run the sampling in docker on small machines, it is faster.
In Google Cloud:
- 1 container on 1CPU, 3.75GB: around 300iter/sec
- 4 containers on 4CPU, 15GB: around 80 iter/sec for each container
This is pretty weird, how many chains you are running?
Could you please try running the multiple chains with the chain number the same as CPU?
My guess is that theano trying to distributed the computation to multiple CPU, and the overhead ends up pretty bad.
I have the same feeling, but I don’t get why the overhead would be that important.
I give you some new info as soon as I ran it