Sampling draw time increases massively near "finish line" for 1M observed rows

I have a model that is functionally identical to Bayesian Estimation Supersedes the T-Test — PyMC3 3.10.0 documentation

When I feed it ~1M observed rows sampling proceeds swiftly, at a rate of something like 2s/draw. Then, as we approach ~95%, it slows to something like 75s/draw. I’ve got buckets of available ram, CPU capacity, and disk I/O to spare. It appears that this stage proceeds with only a single core, based on utilization. The “fast” stage fully utilizes my processors, as specified. I tried specifying cores=actual-1 to give it some room but it didn’t change the result.

Are there any “gotchas” I should be looking out for that would explain this behaviour, or perhaps is this just a visual/reporting issue?

The only weird thing I’m doing is specifying return_inferencedata=True.

x86_64 ubuntu in a Oracle Virtualbox virtual machine running on Windows 10 w/ PAE/NX and Nested VT-x/AMD-V enabled.
jupyter lab

I am not completely sure it’s your case but I think it’s worth noting. By default, after sampling, pymc3 computes convergence diagnostics rhat and ess which (independently of return_inferencedata being True or False) calls ArviZ under the hood to do so. The default for pymc-arviz conversion is to compute and store pointwise log likelihood values, which are used for loo/waic that are useful for model comparison as well as some level of model criticism (especially when combined with posterior predictive checks).

This has two important drawbacks when the number of observations is large. The first is that the memory used can be quite large as it is # of observations times # of samples, with a default of 1000 draws and 4 chains and in your 1M case it would mean 4e9 floats to store. The second is that due to limitations in Theano and Aesara, this operation is not vectorized, so the converter loops over each variable, then chain and then draw to compute these pointwise log likelihoods, which is generally fast, but with the 4e9 multiplier can end up being noticeable or even annoying.

To prevent this computation and storage of pointwise log likelihood you need to pass idata_kwargs={"log_likelihood": False} to pm.sample. Having said that however, this should not happen at ~95% but at 100%. Storing deterministics can also be challenging memory wise if they are large arrays, but again I don’t see the relation to the ~95%.

Thank you for your reply! I will try that. When I can run sampling again :joy:

I woke up to an output complaining of divergences so something like that eventually happened. There were none mentioned at ~99%, so what you mention may have been what found them. Arviz also complained about my pymc3 version (freshly installed, though maybe not via conda-forge) so I upgraded that to 3.11.2 and now I have new gotchas to poke around with.

Now I’m getting failed chains due to the “The derivative of RV … is zero”, which goes away if I specify init=“map”, which is discouraged (and much slower, it seems?). I’ll dig into my data and model and try to figure out what’s going on. This post seems helpful. Cookbook — Bayesian Modelling with PyMC3 | Eigenfoo

update: scaling my observed data (via sklearn.processing.scale) before giving it to pymc3 seems to have addressed the “the derivative of RV …” problem. Maybe it should be noted that my data consists of small floating point numbers with a bulk stdev of something like 1e-2 and a mean value of ~zero.

Wanted to update that I had the RV issue again for a simplified model, but was able to solve it by swapping out a uniform distribution prior for a standard deviation parameter with a half cauchy. I had previously tried changing the low and high ranges to no avail. Not sure if this is a repeatable solution for other people or just a sampling from a chaotic distribution of problems and solutions, but maybe this note will help someone else.