Slower Multiprocess Sampling vs Sequential PyMC (v5.10.4)

Ryan_Chowdhury · June 4, 2024, 7:54pm

Hello, I’m new to PyMC and Bayesian statistics, so forgive me if this is a dumb question.

I’m running a script that does MCMC sampling using the NUTS sampler. The dataset is split into 2 groups (treatment and baseline), and each row represents the batch yield (%) of a batch. The rows are also labelled by year week, here is the number of batches per week:

Currently, my script runs iteratively per week using a while loop where the data accumulates. So, for example, 1st loop iteration would run on 2023W35, then the next would run on 2023W35 + 2023W36 (so in this case, 2120 rows of data) and so on. The issue I’m running into is that when sampling on 4 chains, setting cores = 1 runs a lot faster than setting cores = 4. It seems that for each loop iteration when multiprocessing, it gets stuck at this stage for around 80 seconds:

and then multiprocess sampling runs as expected, taking around 10 seconds per iteration. However, when I run it sequentially (cores = 1) it doesn’t get stuck at the previously mentioned stage at all and runs sequential sampling normally, which takes around 7 seconds per iteration. I want to eventually run this script on much larger datasets (longer date range and more data per week) and I don’t want to rely on sequential sampling because I tried it once and it took over 10 hours to run. I would expect multiprocessing to be much faster, but it doesn’t seem to be the case here. Here are my computer specs:

I’m running this in vscode but also ran into this issue when running on spyder. Does anyone know why I might be having this issue? I can provide more info about my code if needed.

ricardoV94 · June 4, 2024, 10:41pm

Can you try the latest version v5 15.1?

Ryan_Chowdhury · June 5, 2024, 5:07pm

Hi thanks for the suggestion, but I’m still running into the same issue. It still gets stuck at this stage for around 80 seconds:

and then runs multiprocess sampling in around 7-8 seconds, as expected. Tested in both spyder an vscode.

ricardoV94 · June 5, 2024, 5:34pm

Do you have any custom operations in your model that may not behave well in multi-processing?

ricardoV94 · June 5, 2024, 5:49pm

Otherwise we may need to know more details about the model / get a reproducible script to investigate

Ryan_Chowdhury · June 5, 2024, 8:17pm

Here is what my model looks like:

with pm.Model() as model:
	# Priors for the mean of treatment and baseline groups
	muBaseline = pm.Normal('muBaseline', mu=np.mean(dataBaseline), sigma=10)
	muTreatment = pm.Normal('muTreatment', mu=np.mean(dataBaseline), sigma=10)

	# Priors for the standard deviation of treatment and baseline groups
	sigmaTreatment = pm.HalfNormal('sigmaTreatment', sigma=10)
	sigmaBaseline = pm.HalfNormal('sigmaBaseline', sigma=10)

	Treatment_likelihood = pm.Normal('Treatment_likelihood', mu=muTreatment, sigma=sigmaTreatment, observed=dataTreatment)
	Baseline_likelihood = pm.Normal('Baseline_likelihood', mu=muBaseline, sigma=sigmaBaseline, observed=dataBaseline)
	
	diff_means = pm.Deterministic('diff_means', muTreatment - muBaseline)
	
	trace = pm.sample(500, chains = 4, cores = nCores) 
posterior = trace.posterior.stack(sample=['chain', 'draw'])
vals = posterior['diff_means']

I have one dataframe that I split into 2 based on the group (baseline/treatment) and use those as the inputs. I sample and then take the difference of means as my output.

ricardoV94 · June 5, 2024, 10:26pm

Strange, looks like a very simple model. What’s the size of your dataset?

Ryan_Chowdhury · June 6, 2024, 1:58am

It’s in the original post, but the total number of batches/rows is 8517 (if you sum up the number of batches per week), however, each iteration of the loop runs for the total number of rows up to that week. So, loop 1 (2023W34) would run on 741 rows, then loop 2 (2023W35) would run on 741+1114 = 1855 rows. Also, if what you’re asking for is the file size, its 331KB.

aseyboldt · June 6, 2024, 11:03am

I think you might be running into compilation time issues.
You can avoid all recompilation if you use nutpie to sample:

import pymc as pm
import numpy as np
import pytensor.tensor as pt

dataBaseline = np.random.randn(8000)
dataTreatment = np.random.randn(8000)

with pm.Model() as model:
    dataBaseline = pm.Data("dataBaseline", dataBaseline)
    dataTreatment = pm.Data("dataTreatment", dataTreatment)

    # Priors for the mean of treatment and baseline groups
    muBaseline = pm.Normal('muBaseline', mu=pt.mean(dataBaseline), sigma=10)
    muTreatment = pm.Normal('muTreatment', mu=pt.mean(dataBaseline), sigma=10)

    # Priors for the standard deviation of treatment and baseline groups
    sigmaTreatment = pm.HalfNormal('sigmaTreatment', sigma=10)
    sigmaBaseline = pm.HalfNormal('sigmaBaseline', sigma=10)

    Treatment_likelihood = pm.Normal('Treatment_likelihood', mu=muTreatment, sigma=sigmaTreatment, observed=dataTreatment)
    Baseline_likelihood = pm.Normal('Baseline_likelihood', mu=muBaseline, sigma=sigmaBaseline, observed=dataBaseline)

    diff_means = pm.Deterministic('diff_means', muTreatment - muBaseline)

# Compile once (takes about 8s on my machine)
compiled = nutpie.compile_pymc_model(model)

# Run with some new dataset (takes about 80ms)
compiled_data1 = compiled.with_data(dataBaseline=np.random.randn(1000), dataTreatment=np.random.randn(1000))
tr = nutpie.sample(compiled_data1)

You can then change the dataset in the loop and resample without having to recompile.

Ignoring the technical side of how to do this, maybe it is worth thinking a bit about why you want to do this in the first place. I don’t think this is wrong in any way, but it is a bit unusual.
If you are interested in how the treatment effect changes over time for instance, would it not be better to fit the whole dataset, but allow the treatment effect to vary?

So for instance you could do something like

import pymc as pm
import numpy as np
import pytensor.tensor as pt
import pandas as pd

data = pd.DataFrame({
    "week": ["week1", "week1", "week1", "week2", "week2"],
    "batch": list(range(5)),
    "group": ["treatment", "baseline", "baseline", "treatment", "baseline"],
    "yield_value": np.random.rand(5) * 100,
})

week_idx, weeks = pd.factorize(data["week"])
is_treatment = (data["group"] == "treatment").values

coords = {
    "week": weeks
}

with pm.Model(coords=coords) as model:
    # What is the mean over time of the baseline value
    intercept = pm.Normal("intercept", sigma=10)

    # How much does the baseline differ between weeks?
    week_effect_sd = pm.HalfNormal("week_effect_sd", sigma=3)

    # How much does each individual weeks baseline differ from the mean?
    # This uses a non-centered parametrization
    week_effect_centered = pm.ZeroSumNormal("week_effect_centered", dims="week")
    week_effect = pm.Deterministic(
        "week_effect",
        week_effect_sd * week_effect_centered,
        dims="week",
    )

    # This would be a centered parametrization. Sometimes one of those
    # samples faster and more stable, sometimes the other
    #week_effect = pm.ZeroSumNormal("week_effect", sigma=week_effect_sd, dims="week")

    # How much bigger is the treament value on average?
    treatment_effect = pm.Normal("treatment_effect", sigma=5)

    # How does the per-week value differ from that average?
    week_treatment_effect_sd = pm.HalfNormal("week_treatment_effect_sd", sigma=3)
    week_treatment_effect_centered = pm.ZeroSumNormal("week_treatment_effect_centered", dims="week")
    week_treatment_effect = pm.Deterministic(
        "week_treatment_effect",
        week_effect_sd * week_effect_centered,
        dims="week",
    )

    mu = (
        intercept
        + week_effect[week_idx]
        + is_treatment * treatment_effect
        + is_treatment * week_treatment_effect[week_idx]
    )

    sigma = pm.HalfNormal("sigma")
    pm.Normal("yield_value", mu=mu, sigma=sigma, observed=data["yield_value"])

If your observations are percentages, maybe you also want to change the likelihood?
Assuming the yield is always strictly between 0 and 100, you could for instance use

sigma = pm.HalfNormal("sigma")
pm.LogitNormal("yield_ratio", mu=mu, sigma=sigma, observed=data["yield_value"] / 100)

Ryan_Chowdhury · June 7, 2024, 3:49am

Hello! Thank you so much for your suggestion. It took some refactoring but using nutpie to compile then sample the model worked perfectly. The runtime went from 115 seconds on average to just 18 seconds (for 11 weeks’ worth of data).

Also, to answer your question about fitting the whole dataset but allow the treatment effect to vary, right now I am running the script this way to perform some retrospective analysis on some known cases to see how effective this method is at showing roughly when the treatment was better/worse. In terms of my end goal, I don’t really care about the weekly effect of the treatment. Splitting it by week and running it iteratively with accumulation is more for experimental purposes. However, I will take your suggestions into consideration as I continue to work on this, so thank you very much for your support.

Topic		Replies	Views
Sample_numpyro_nuts hangs when parallelizing over datasets with multiprocessing v5 jax	0	685	November 9, 2022
PyMC gradually slows down v5 bug	14	185	December 23, 2024
Pm.sample gets stuck after init with cores > 1 Questions	17	3971	January 4, 2021
Multiprocessing GP extremely slow starting from certain size Questions	4	1364	December 28, 2018
New User - NUTS Running forever on regression tutorial	8	125	February 14, 2025

Slower Multiprocess Sampling vs Sequential PyMC (v5.10.4)

Related topics