I am trying to estimate the probabilities of a categorical distribution. My data is a numpy array (I actually have 3 categories, encoded as 0,1,2):
array([1, 1, 1, ..., 0, 0, 1], dtype=int32)
The shape of the array is:
(233310,)
My model definition is as follows (I have also tried with a Dirichlet prior, no difference)
with pm.Model() as categ_model:
theta = pm.Uniform('theta', 0,1, shape=3)
obs = pm.Categorical(name='obs', p=theta, observed=X)
I get out of memory errors during sampling, if my tuning steps exceed 1000. I have also had a look at this other post but the solution mentioned there (reshaping X so it becomes (233310,1) ) does not work for me.
What am I doing wrong?
What version of PyMC are you using? In the latest version (from the repository) I can sample just fine, each process seems to consume around 300 MB of RAM
import numpy as np
import pymc as pm
data = np.random.randint(3, size=233_310, dtype="int32")
with pm.Model() as m:
theta = pm.Uniform("theta", 0, 1, shape=3)
obs = pm.Categorical("obs", p=theta, observed=data)
with m:
trace = pm.sample()
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [theta]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 89 seconds.