Leverage Apache Spark During PyMC Sampling

Is it possible to efficiently incorporate PyMC sampling within a Spark cluster? Curious to know if anyone has any experience with this.

1 Like

:heavy_plus_sign:1 on this question. I’m working with a large dataset and wondering if spark can be used to speed up sampling.

From experience with pymc and apache spark I don’t think its possible and if it was it’d probably be too much trouble. Spark does its computation in scala and I dont think we have a scala sampler. It might be possible to use Rainier which is a scala PPL designed for scalability but I’m not sure if its maintained and have never seen it be done

I have seen pymc sampling be sped up across a fargate clusters, though it took some trickery to split up the model and then recombine into infdata that was valid.

Are you having trouble with sampling speed?

What about within PySpark, which works directly with a JVM using Py4J?

The rest of my project runs within PySpark, and due to the size of my data, it has become inefficient to run pymc sample out of this PySpark context.

I’ve come across an experiment with Stan and Spark - not directly the solution I’m looking for but perhaps it’s helpful.