Big Data with HierarchicalRegression

alphamaximus · August 10, 2018, 12:44am

I’m trying to train a hierarchical regression model with ADVI on a big data set (10 million+ rows, 100+ categories, 50+ features).

I’m also using pymc3_models.

I’m able to load all the data in memory, and when i finally fit the model i use a minibatch_size of 100. Since the array is able to load completely in memory, i’d expect the training to be fast but i’m still getting 13 seconds per iteration. Sampling from the numpy array should be fast, so I’m not sure what the issue is.

Does anyone have an suggestions on how i can improve this?

junpenglao · August 10, 2018, 4:36am

The same question comes up recently: ADVI Minibatch slows down with increasing size of data

According to the profiling, what slows down the sampling is the part data is being indexed (for minibatch), I wonder if there is any way to optimized that… @ferrine?

ferrine · August 10, 2018, 6:11am

That’s hard to make a clue. I believe indexing is bottleneck. In other frameworks this is usually done in a separate thread and thus fast

alphamaximus · August 10, 2018, 8:14pm

I managed to get around it by using Minibatch.update_shared_f:

def choice_iterator(size, nrows, seed):
    state = np.random.RandomState()
    state.seed(seed)

    while True:
        yield sorted(np.random.randint(0, nrows+1, size=size))

def update_shared_f_in_memory(obj, size, seed):

    nrows     = len(obj) if isinstance(obj, np.ndarray) else len(obj.index)
    generator = choice_iterator(size=size, nrows=nrows, seed=seed)

    def f():
        where = next(generator)
        return obj[where] if isinstance(obj, np.ndarray) else obj.iloc[where]

    return f

Topic		Replies	Views
ADVI Minibatch slows down with increasing size of data Questions	3	989	April 19, 2019
Minibatch for a large dataset ADVI Questions	2	1209	September 7, 2018
Simple Hierarchical Model with Huge Data version agnostic	6	174	July 23, 2024
Hierarchical Model - Slow Sampling Questions	4	1173	March 26, 2020
Hierarchical gaussian mixture model VI minibatch Questions	0	479	July 14, 2020

Big Data with HierarchicalRegression

Related topics