MCMC is really slow even with C++ compiler

Hi!

So I installed Anaconda 3 and am trying to use MCMC to make Bayesian regression on the volatility model for returns (I am following the tutorial from Advanced Algorithmic Trading). The problem is that the sampling is taking a lot of time.

At first I didn’t use a C++ compiler and thought that it was the reason for the slow sampling. However, even after I installed one with conda install m2w64-toolchain (making the “No C++ compiler”-warning disappear), it’s equally slow. It’s only giving me a few samples (I assume they’re called samples?) per second, at most. The problem is that the author of the book wrote that the same code took 15-20 minutes on his desktop PC. I am also running the code on a reasonably good desktop computer, although it seems to be taking hours for me.

Does anyone have any idea what the problem could be? The model I am using is:

with model:
    sigma = pm.Exponential('sigma', 50.0, testval=0.1)
    nu = pm.Exponential('nu', 0.1)
    s = GaussianRandomWalk('s', sigma**-2, shape=len(log_returns))
    logrets = pm.StudentT('logrets', nu, lam=pm.math.exp(-2.0*s), observed=log_returns)      
    trace = pm.sample(samples, progressbar=True);

I am using Python 3.7 with Anaconda 3 and using the Anaconda Prompt to execute my Python script. I’m using Windows 10 and could provide further specifications if needed.

Any help is greatly appreciated!

Hi Max,
Thanks for your question.

  • Have you done some prior predictive checks? Exponential and StudentT are fat-tails distributions and exp(50) can give super huge values, so I would check it the slowness isn’t coming from this. Depending on your domain and domain expertise, these priors could be justified, or maybe they are too wide :man_shrugging:
  • What’s the scale of the data? If they are on a huge scale, that can get very hard to sample. In these cases, standardizing the data usually helps.
  • What’s the size of your data? If it’s very big, then the model taking a long time to sample is less surprising.

Also, don’t forget to update your PyMC to the brand new 3.9.1 if you can :wink: There are lots of new features in there!
Hope this helps :vulcan_salute:

1 Like

Hi Alex!

Thanks for your response!

Have you done some prior predictive checks? Exponential and StudentT are fat-tails distributions and exp(50) can give super huge values, so I would check it the slowness isn’t coming from this. Depending on your domain and domain expertise, these priors could be justified, or maybe they are too wide :man_shrugging:

Essentially all of the code is from a tutorial, but 50 is supposed to reflect the large uncertainty in the degrees-of-freedom parameter. Honestly I don’t know personally whether this is justified. My experience with the model is at this point in time very new, so I don’t know exactly which shapes are reasonable to assume for the prior. I could try a prior predictive check to see which values I get though.

What’s the scale of the data? If they are on a huge scale, that can get very hard to sample. In these cases, standardizing the data usually helps.

I’m not sure what the difference is between data scale and data size is. If you by scale mean the range it varies between approximately -2.5 and 2.5. Tell me if it is something else!

What’s the size of your data? If it’s very big, then the model taking a long time to sample is less surprising.

log_returns is of size 2516, and I’m doing 2000 samples. This is exactly the same as in the book. I guess my main concern is that the author said the same code took him 15-20 minutes, which makes me wonder whether my set up is flawed. Surely he could have a better GPU, etc. but I don’t know whether this is enough to explain the performance difference. In my experience differences these significant, with my simulation taking hours and his a couple of minutes, is usually due to optimization problems. Of course, as you’re stating, it could be the model setup itself (initial guesses, etc.) that causes the problem, but since we both use the same code, wouldn’t such a difference probably stem from either computer differences or changes in the packages, etc. over time?

Also, don’t forget to update your PyMC to the brand new 3.9.1 if you can :wink: There are lots of new features in there!

That’s the one I’m using! :slight_smile:

Do you think this sort of performance is normal given the set up described or does something seem unoptimized besides the model set up?

Thanks for the clear and detailed answer Max!

I’d do that, especially because, with your current parametrization, sigma is transformed and then fed to the random walk, which is then retransformed and fed to the StudentT, so it’s really hard to see what Exponential(50) implies on the outcome scale and whether it’s justified – my guess is 50 is really gigantic, especially if your data go from -2.5 to 2.5.

It’s exactly what I meant :slight_smile: You could standardize your data and see whether that helps, but the difference shouldn’t be that big, as the scale is already reasonable.

Yeah it could be if the difference is that big – although I’d first check that my model makes sense before investigating my hardware; at least I’d be more sure of what the problem could be.

Not necessarily: maybe you’re using the same model, but to model a different data generating process – are you using the same data and are interested in the same phenomenon as in the tutorial?

1 Like

Hi Alex!

Thank you for the response! I’ll keep up the blockquote trend!

I’d do that, especially because, with your current parametrization, sigma is transformed and then fed to the random walk, which is then retransformed and fed to the StudentT, so it’s really hard to see what Exponential(50) implies on the outcome scale and whether it’s justified – my guess is 50 is really gigantic, especially if your data go from -2.5 to 2.5.

That is very true! I tried decreasing it to 10 and then to 1 (still without much justification, just to see what happens). With 10 it is still considerably slow. With 1, it increased to around 10% in 15 minutes. However, what happens in all these cases is that the step time decreases significantly with time. At around 10% it is considerably slower than the start, where an increment in 1 percentage point only took a couple of seconds. The method is carried out with 4 chains.

I am still yet to check whether the values are justified though. I just figured that I would try varying the lambda parameter a bit before I did that. (I’m fairly new to the model, so I will have to do some research to see which values are reasonable.)

Not necessarily: maybe you’re using the same model, but to model a different data generating process – are you using the same data and are interested in the same phenomenon as in the tutorial?

That is a good point! The tutorial investigates the same phenomenon. I did use slightly more data than the tutorial (AMZN from 2006-01-01 to recent) but even when changing it to the exact same data as that of the tutorial (AMZN from 2006-01-01 to 2015-12-31) it is pretty much equally slow. The data is also plotted before the MCMC-algorithm is started, and I’m able to compare that plot to the one in the book, and it seems identical.

I guess my biggest concern at this point is that my code is not using the C++ compiler correctly. If it is my computer specifications that are not up to par, then I will accept that faith, but it would really bother me if my code is using the Python-implementation instead of the C++ compiler, making everything a lot slower without me even knowing. Is there any way of making sure that the C++ compiler is used other than noticing that the warning has disappeared?

Just for reference, I will post the full code below along with my computer specifications (lambda happens to be 10.0 in this version). Just in case I have missed to include something important. The code is almost identical to that of the tutorial, apart from minor adjustments. The panda library is giving me a warning that I understand and will fix, but I figured at this stage it won’t have any impact as far as debugging is concerned.

Code: import datetimeimport pprintimport matplotlib.pyplot as pltimport numpy - Pastebin.com
Specifications: Windows 64-bit, Intel Core i7-3770K (3.50GHz, 3.90GHz), 16 GB RAM

1 Like

Thanks for detailed answers Max, it really helps!

If refining the priors and the model doesn’t work and you have weird warnings coming and going, then exploring the hardware side might become worthwhile :thinking:
Unfortunately, this is above my league :confused: I’m thinking that @fonnesbeck or @aseyboldt could help you here.

1 Like