Big Data with HierarchicalRegression

I’m trying to train a hierarchical regression model with ADVI on a big data set (10 million+ rows, 100+ categories, 50+ features).

I’m also using pymc3_models.

I’m able to load all the data in memory, and when i finally fit the model i use a minibatch_size of 100. Since the array is able to load completely in memory, i’d expect the training to be fast but i’m still getting 13 seconds per iteration. Sampling from the numpy array should be fast, so I’m not sure what the issue is.

Does anyone have an suggestions on how i can improve this?

The same question comes up recently: ADVI Minibatch slows down with increasing size of data

According to the profiling, what slows down the sampling is the part data is being indexed (for minibatch), I wonder if there is any way to optimized that… @ferrine?

That’s hard to make a clue. I believe indexing is bottleneck. In other frameworks this is usually done in a separate thread and thus fast

I managed to get around it by using Minibatch.update_shared_f:

def choice_iterator(size, nrows, seed):
    state = np.random.RandomState()

    while True:
        yield sorted(np.random.randint(0, nrows+1, size=size))

def update_shared_f_in_memory(obj, size, seed):

    nrows     = len(obj) if isinstance(obj, np.ndarray) else len(obj.index)
    generator = choice_iterator(size=size, nrows=nrows, seed=seed)

    def f():
        where = next(generator)
        return obj[where] if isinstance(obj, np.ndarray) else obj.iloc[where]

    return f