Slower Multiprocess Sampling vs Sequential PyMC (v5.10.4)

Hello, I’m new to PyMC and Bayesian statistics, so forgive me if this is a dumb question.

I’m running a script that does MCMC sampling using the NUTS sampler. The dataset is split into 2 groups (treatment and baseline), and each row represents the batch yield (%) of a batch. The rows are also labelled by year week, here is the number of batches per week:
image
Currently, my script runs iteratively per week using a while loop where the data accumulates. So, for example, 1st loop iteration would run on 2023W35, then the next would run on 2023W35 + 2023W36 (so in this case, 2120 rows of data) and so on. The issue I’m running into is that when sampling on 4 chains, setting cores = 1 runs a lot faster than setting cores = 4. It seems that for each loop iteration when multiprocessing, it gets stuck at this stage for around 80 seconds:
image
and then multiprocess sampling runs as expected, taking around 10 seconds per iteration. However, when I run it sequentially (cores = 1) it doesn’t get stuck at the previously mentioned stage at all and runs sequential sampling normally, which takes around 7 seconds per iteration. I want to eventually run this script on much larger datasets (longer date range and more data per week) and I don’t want to rely on sequential sampling because I tried it once and it took over 10 hours to run. I would expect multiprocessing to be much faster, but it doesn’t seem to be the case here. Here are my computer specs:
image
I’m running this in vscode but also ran into this issue when running on spyder. Does anyone know why I might be having this issue? I can provide more info about my code if needed.

Can you try the latest version v5 15.1?

Hi thanks for the suggestion, but I’m still running into the same issue. It still gets stuck at this stage for around 80 seconds:
image
and then runs multiprocess sampling in around 7-8 seconds, as expected. Tested in both spyder an vscode.

Do you have any custom operations in your model that may not behave well in multi-processing?

Otherwise we may need to know more details about the model / get a reproducible script to investigate

Here is what my model looks like:

with pm.Model() as model:
	# Priors for the mean of treatment and baseline groups
	muBaseline = pm.Normal('muBaseline', mu=np.mean(dataBaseline), sigma=10)
	muTreatment = pm.Normal('muTreatment', mu=np.mean(dataBaseline), sigma=10)

	# Priors for the standard deviation of treatment and baseline groups
	sigmaTreatment = pm.HalfNormal('sigmaTreatment', sigma=10)
	sigmaBaseline = pm.HalfNormal('sigmaBaseline', sigma=10)

	Treatment_likelihood = pm.Normal('Treatment_likelihood', mu=muTreatment, sigma=sigmaTreatment, observed=dataTreatment)
	Baseline_likelihood = pm.Normal('Baseline_likelihood', mu=muBaseline, sigma=sigmaBaseline, observed=dataBaseline)
	
	diff_means = pm.Deterministic('diff_means', muTreatment - muBaseline)
	
	trace = pm.sample(500, chains = 4, cores = nCores) 
posterior = trace.posterior.stack(sample=['chain', 'draw'])
vals = posterior['diff_means']

I have one dataframe that I split into 2 based on the group (baseline/treatment) and use those as the inputs. I sample and then take the difference of means as my output.

Strange, looks like a very simple model. What’s the size of your dataset?

It’s in the original post, but the total number of batches/rows is 8517 (if you sum up the number of batches per week), however, each iteration of the loop runs for the total number of rows up to that week. So, loop 1 (2023W34) would run on 741 rows, then loop 2 (2023W35) would run on 741+1114 = 1855 rows. Also, if what you’re asking for is the file size, its 331KB.

I think you might be running into compilation time issues.
You can avoid all recompilation if you use nutpie to sample:

import pymc as pm
import numpy as np
import pytensor.tensor as pt

dataBaseline = np.random.randn(8000)
dataTreatment = np.random.randn(8000)

with pm.Model() as model:
    dataBaseline = pm.Data("dataBaseline", dataBaseline)
    dataTreatment = pm.Data("dataTreatment", dataTreatment)

    # Priors for the mean of treatment and baseline groups
    muBaseline = pm.Normal('muBaseline', mu=pt.mean(dataBaseline), sigma=10)
    muTreatment = pm.Normal('muTreatment', mu=pt.mean(dataBaseline), sigma=10)

    # Priors for the standard deviation of treatment and baseline groups
    sigmaTreatment = pm.HalfNormal('sigmaTreatment', sigma=10)
    sigmaBaseline = pm.HalfNormal('sigmaBaseline', sigma=10)

    Treatment_likelihood = pm.Normal('Treatment_likelihood', mu=muTreatment, sigma=sigmaTreatment, observed=dataTreatment)
    Baseline_likelihood = pm.Normal('Baseline_likelihood', mu=muBaseline, sigma=sigmaBaseline, observed=dataBaseline)

    diff_means = pm.Deterministic('diff_means', muTreatment - muBaseline)
# Compile once (takes about 8s on my machine)
compiled = nutpie.compile_pymc_model(model)

# Run with some new dataset (takes about 80ms)
compiled_data1 = compiled.with_data(dataBaseline=np.random.randn(1000), dataTreatment=np.random.randn(1000))
tr = nutpie.sample(compiled_data1)

You can then change the dataset in the loop and resample without having to recompile.

Ignoring the technical side of how to do this, maybe it is worth thinking a bit about why you want to do this in the first place. I don’t think this is wrong in any way, but it is a bit unusual.
If you are interested in how the treatment effect changes over time for instance, would it not be better to fit the whole dataset, but allow the treatment effect to vary?

So for instance you could do something like

import pymc as pm
import numpy as np
import pytensor.tensor as pt
import pandas as pd

data = pd.DataFrame({
    "week": ["week1", "week1", "week1", "week2", "week2"],
    "batch": list(range(5)),
    "group": ["treatment", "baseline", "baseline", "treatment", "baseline"],
    "yield_value": np.random.rand(5) * 100,
})

week_idx, weeks = pd.factorize(data["week"])
is_treatment = (data["group"] == "treatment").values

coords = {
    "week": weeks
}

with pm.Model(coords=coords) as model:
    # What is the mean over time of the baseline value
    intercept = pm.Normal("intercept", sigma=10)

    # How much does the baseline differ between weeks?
    week_effect_sd = pm.HalfNormal("week_effect_sd", sigma=3)

    # How much does each individual weeks baseline differ from the mean?
    # This uses a non-centered parametrization
    week_effect_centered = pm.ZeroSumNormal("week_effect_centered", dims="week")
    week_effect = pm.Deterministic(
        "week_effect",
        week_effect_sd * week_effect_centered,
        dims="week",
    )

    # This would be a centered parametrization. Sometimes one of those
    # samples faster and more stable, sometimes the other
    #week_effect = pm.ZeroSumNormal("week_effect", sigma=week_effect_sd, dims="week")

    # How much bigger is the treament value on average?
    treatment_effect = pm.Normal("treatment_effect", sigma=5)

    # How does the per-week value differ from that average?
    week_treatment_effect_sd = pm.HalfNormal("week_treatment_effect_sd", sigma=3)
    week_treatment_effect_centered = pm.ZeroSumNormal("week_treatment_effect_centered", dims="week")
    week_treatment_effect = pm.Deterministic(
        "week_treatment_effect",
        week_effect_sd * week_effect_centered,
        dims="week",
    )

    mu = (
        intercept
        + week_effect[week_idx]
        + is_treatment * treatment_effect
        + is_treatment * week_treatment_effect[week_idx]
    )

    sigma = pm.HalfNormal("sigma")
    pm.Normal("yield_value", mu=mu, sigma=sigma, observed=data["yield_value"])

If your observations are percentages, maybe you also want to change the likelihood?
Assuming the yield is always strictly between 0 and 100, you could for instance use

sigma = pm.HalfNormal("sigma")
pm.LogitNormal("yield_ratio", mu=mu, sigma=sigma, observed=data["yield_value"] / 100)
1 Like

Hello! Thank you so much for your suggestion. It took some refactoring but using nutpie to compile then sample the model worked perfectly. The runtime went from 115 seconds on average to just 18 seconds (for 11 weeks’ worth of data).

Also, to answer your question about fitting the whole dataset but allow the treatment effect to vary, right now I am running the script this way to perform some retrospective analysis on some known cases to see how effective this method is at showing roughly when the treatment was better/worse. In terms of my end goal, I don’t really care about the weekly effect of the treatment. Splitting it by week and running it iteratively with accumulation is more for experimental purposes. However, I will take your suggestions into consideration as I continue to work on this, so thank you very much for your support.