Leverage Apache Spark During PyMC Sampling

sldough21 · September 18, 2023, 9:24pm

Is it possible to efficiently incorporate PyMC sampling within a Spark cluster? Curious to know if anyone has any experience with this.

JAB · October 7, 2023, 1:53am

1 on this question. I’m working with a large dataset and wondering if spark can be used to speed up sampling.

RavinKumar · October 7, 2023, 3:05pm

From experience with pymc and apache spark I don’t think its possible and if it was it’d probably be too much trouble. Spark does its computation in scala and I dont think we have a scala sampler. It might be possible to use Rainier which is a scala PPL designed for scalability but I’m not sure if its maintained and have never seen it be done

I have seen pymc sampling be sped up across a fargate clusters, though it took some trickery to split up the model and then recombine into infdata that was valid.

Are you having trouble with sampling speed?

sldough21 · October 10, 2023, 5:00pm

What about within PySpark, which works directly with a JVM using Py4J?

The rest of my project runs within PySpark, and due to the size of my data, it has become inefficient to run pymc sample out of this PySpark context.

I’ve come across an experiment with Stan and Spark - not directly the solution I’m looking for but perhaps it’s helpful.

Topic		Replies	Views
Is it possible to speed up PyMC sampling? version agnostic	3	3212	May 20, 2022
How to run PyMC3 in a multi-node cluster? Is it possible at the moment? Questions	12	2543	December 2, 2021
PyMC runtime question (Jax, BLAS, aesara) v5	1	773	October 1, 2022
Unable to Replicate the sampling speed given in the example v5	8	608	November 24, 2022
State of the Art samplers? Computational Lebesgue Integration techniques? version agnostic	1	46	November 22, 2024

Leverage Apache Spark During PyMC Sampling

Related topics