Sampling draw time increases massively near "finish line" for 1M observed rows

I am not completely sure it’s your case but I think it’s worth noting. By default, after sampling, pymc3 computes convergence diagnostics rhat and ess which (independently of return_inferencedata being True or False) calls ArviZ under the hood to do so. The default for pymc-arviz conversion is to compute and store pointwise log likelihood values, which are used for loo/waic that are useful for model comparison as well as some level of model criticism (especially when combined with posterior predictive checks).

This has two important drawbacks when the number of observations is large. The first is that the memory used can be quite large as it is # of observations times # of samples, with a default of 1000 draws and 4 chains and in your 1M case it would mean 4e9 floats to store. The second is that due to limitations in Theano and Aesara, this operation is not vectorized, so the converter loops over each variable, then chain and then draw to compute these pointwise log likelihoods, which is generally fast, but with the 4e9 multiplier can end up being noticeable or even annoying.

To prevent this computation and storage of pointwise log likelihood you need to pass idata_kwargs={"log_likelihood": False} to pm.sample. Having said that however, this should not happen at ~95% but at 100%. Storing deterministics can also be challenging memory wise if they are large arrays, but again I don’t see the relation to the ~95%.