Model with a lot of categories is very slow to sample

Hi, I’m trying to use bayesian analysis in Kaggle’s uncertainty competition where the goal is to predict walmart sales of various items for different percentiles.

Using a subset of the data, with first shape parameter = 3049 and size of train = 715402 this model slows to a crawl (~3it/s)

with pm.Model() as model:
    lam = pm.Exponential('lam', lam=1/train_mu, shape=ca_1.item_index.nunique())
    pm.Poisson('obs',  mu=lam[train['item_index'].values], observed=train['value'])
    traces = pm.sample(1000, cores=1)

According to the FAQ, models are slow either because the gradient takes a long time to compute, or because it has to compute a lot of them. I think in my case it’s because it has to compute a lot of them. The recommended solution:

func = model.logp_dlogp_function(profile=True)
func.set_extra_values({})
x = np.random.randn(func.size)
%timeit func(x)

func.profile.summary()

Prints out a bunch of diagnostics I don’t really understand, with some theano suggestions of setting th.config.floatX = ‘float32’. I tried that and it didn’t really seem to do anything.

What are some strategies to handle this situation?

For that many observations you probably want to use ADVI with mini-batches

https://docs.pymc.io/notebooks/variational_api_quickstart.html

2 Likes