I have a small model which I need to be able to run on many separate datasets concurrently. The model takes about 5s to compile and roughly 20s to perform ADVI, per dataset. The goal here is to find a way to engineer the model so that it can be easily scaled out to as many CPUs as required, preferably without having to use containers.
The model is bottlenecked by sequential compuations that cannot be optimized out (it’s a core part of the compute), so each ‘run’ should be limited to 1 thread. However, given as how the compilation of the model comprises a substantial portion of the total CPU time, I want to re-use a compiled graph as much as possible. This suggests the following approach:
Initialise a model and compile the graph, then use theano.shared tensors to alter the observed data between runs. It’s not clear how to ‘re-initialise’ the state of the model between pm.fit calls - should I delete and recreate the ADVI object to reinit?
Explicitly instruct theano to use at most 1 cpu, and then call separate instances of the whole python program on the same machine, up to the number of cpu cores available. This might work, but playing around suggests I will encounter some problems with the size of the tmp files for graph compilation byproducts, as well as theano locking conflicts.
Does anybody know of a good way to deal with this? There’s always the approach of simply having a large number of very small (1 cpu, low ram) container instances, each running a separate instance of the python program, and cycling through datasets. However, I’m hoping to avoid this if possible due to lack of experience with docker/kubernetes.