Excessive memory usage in PyMC3? (Solved - AWS Linux platform issue. Works on AWS Windows)

Hello,

As a prelude to beginning development with PyMC3, I wanted to make sure I had a machine and environment that was capable of running some basic demos.

I am using an Amazon AWS instance running Ubuntu, with 32 cores and 244GB memory (so quite a hefty machine).

On a basic demo such as the following notebook, I find that the machine bogs down completely at less than 20% complete on the cell with contents

with neural_network:
    trace = pm.sample(1000, tune=200)

(notebook available at: https://github.com/twiecki/WhileMyMCMCGentlySamples/blob/master/content/downloads/notebooks/random_walk_deep_net.ipynb)

The machine has >>99% of memory allocated (out of 244GB) and is spending nearly all its time swapping to disk. I have a hard time believing that a basic demo is so memory intensive, and am wondering if there may be a memory leak in the recent release of PyMC3?

Thanks for any advice…

Are you running on GPU? Also, what is the theano and PyMC3 version you are running?

Hello,

It’s Theano 0.9.0, PyMC3 3.2.

It’s running CPU only (as the AWS instance I am currently using does not have a GPU). I can move to a GPU instance if that is a problem. (But the issue isn’t execution speed, it’s memory usage).

Thanks!

Yeah it is likely unrelated to GPU/CPU. any idea @twiecki?

This could be due to the pre-initialization of the trace, as the model is fairly high-dimensional. Can you try with the HDF5 backend?

Thanks for the suggestion. I’ll give it a try.

The behavior is that memory allocation steadily grows until about 20% complete, at which point it reaches >99% of memory and starts to swap. It doesn’t appear that the allocation is happening all up front, but rather is continuous during execution.

I will let you know the results…

@twiecki & @junpenglao,

I can’t get it to work under any circumstances. The behavior is always the same: it allocates approximately 100MB of additional memory per second, and allocated memory keeps growing until it exceeds the machine’s physical capacity, at which point the process degrades completely due to memory swapping. I have tried 61GB, 122GB and 244GB machines; all produce the same outcome except that the larger memory instances last for a longer time until they start to thrash.

It is dying during the initialization process using ADVI (v3.1) or jitter+adapt_diag (v3.2) – usually about 20% finished depending on machine specs.

Ubuntu 16.04, Python 3.5.2, Theano 0.9.0

I have tried all combinations of the following:

  • PyMC3 v3.2 and v3.1
  • Default backend
  • HDF5 backend

I would love to use PyMC3 for a computing project, but can’t proceed with any confidence unless I can find a way to complete runs with a reasonably available quantity of memory.

I would appreciate any insight you might have. Is there a known good configuration available within AWS (i.e. a combination of operating system, Python version, PyMC3 version, etc)?

Thanks again for your assistance!

Wondering if this issue may be relevant? Looks like another user had memory issues running on Linux in AWS, that did not appear on his/her personal machine.

Thanks!

@twiecki & @junpenglao,

Thought you might be interested – it is definitely a platform issue with Linux on AWS. (I only tried Ubuntu, don’t know about other flavors of Linux).

See other reports, e.g.


On AWS Linux, running the notebook linked earlier in the thread allocates >> 244 GB and kills the machine.

On AWS Windows, the process is stable at 160MB. Not 160GB, but 160MB.

Thanks for your attention to this. I will file an issue on Github if you want me to, please let me know.

2 Likes

Thanks for reporting the solution. Really puzzling that memory is managed so differently on AWS Linux. This probably makes it a Theano issue, but definitely open an issue (or comment on the one you dug up) of your findings.

Hi,

I am having the same problem of memory leakage. My input dataset is barely 200Mb.
Shape of dataset is 1.2M x 19.

I have used my local instance to run sampling. It would work fine on a smaller sample of the dataset. But when I plug everything in, it starts consuming more and more memory, until it starts swapping to disk and eventually kills the session.

I tried on heavy AWS Sagemaker compute but it does the same thing over there.

Using PYMC v5. Python 3.9.

Can anyone please help me on this?