How to make Minibatch for multi-dimensional data?

Thanks for making pymc3! I’m a beginner in Probabilistic Programming but I am already very impressed with what I can do with pymc3!

I am having problems getting the Minibatch to work. The docs have broken formatting that makes it very hard to read: https://docs.pymc.io/api/data.html There is a brief tutorial but it doesn’t solve my problem: https://docs.pymc.io/notebooks/variational_api_quickstart.html#Minibatches I have also searched GitHub, StackOverflow and discourse but I’m still not sure how to do this.

Perhaps you could update your docs and tutorial with this kind of case, as it is probably quite common.

Simplified Example:

I have data with X and Y where the distribution of Y depends on X. For example:

# Number of data-points.
n = 1000000

# Random data for X.
X = np.random.uniform(0, 10, size=n)

# Random data for Y which depends on X.
noise = np.random.normal(size=n)
Y = 7.25 + 2.5 * X + 5.3 * noise

# Find the parameters for the relation between X and Y.
with pm.Model() as model:
    # Prior parameters.
    a = pm.Normal('a', mu=0, sigma=10)
    b = pm.Normal('b', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=10)
    
    # Relation between X and the mean of Y.
    mu = a + b * X
    
    # Observed output Y.
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=Y)
    
    # Find the posterior parameters.
    approx = pm.fit()

# Sample from the posterior distributions for the parameters.
trace = approx.sample(draws=500)

# Plot the distributions for the posterior parameters.
pm.traceplot(trace)

This runs quite slowly because there are so many data-points. So I want to use mini-batches, but it is unclear to me, how I can use pm.Minibatch to draw from both X and Y simultaneously?

Thanks!

There’s a tutorial doc showing how to set up minibatches for multiple variables here. You essentially just have to make sure the variables have the same total size and then pass them both into pm.Minibatch. Also, be sure to use the ‘total_size’ keyword argument when doing this because there needs to be a rescaling of the loss function that depends on how big your entire dataset is relative to a single minibatch.

Thanks for the quick reply!

I had actually seen that tutorial, but I found it quite confusing. One thing that puzzled me was that multiple pm.Minibatch objects are created, because they would then have to sync their batch generations between the multiple objects “behind the scenes”. That is an unusual design for minibatch-generators, so it would be very helpful if you make that clear in the docs and tutorial.

Please confirm that I have done it correctly in the following code example:

# Number of data-points.
n = 1000000

# Random data for X.
X = np.random.uniform(0, 10, size=n)

# Random data for Y which depends on X.
noise = np.random.normal(size=n)
Y = 7.25 + 2.5 * X + 5.3 * noise

# Turn data into mini-batches.
batch_size = 128
X_batch = pm.Minibatch(data=X, batch_size=batch_size)
Y_batch = pm.Minibatch(data=Y, batch_size=batch_size)

# Find the parameters for the relation between X and Y.
with pm.Model() as model:
    # Prior parameters.
    a = pm.Normal('a', mu=0, sigma=10)
    b = pm.Normal('b', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=10)
    
    # Relation between X and the mean of Y.
    mu = a + b * X_batch
    
    # Observed output Y.
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma,
                      observed=Y_batch, total_size=n)
    
    # Find the posterior parameters.
    approx = pm.fit()

# Sample from the posterior distributions for the parameters.
trace = approx.sample(draws=500)

# Plot the distributions for the posterior parameters.
pm.traceplot(trace)

Note that creating the pm.Minibatch objects generate a Python warning when using pymc3 v. 3.8 (latest). Is this something to worry about?

/home/magnus/anaconda3/envs/bayes/lib/python3.6/site-packages/pymc3/data.py:246: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  self.shared = theano.shared(data[in_memory_slc])

Also, would it be better to use pm.Minibatch with a multi-dimensional numpy array? I suppose the batch-generator should be initialized as follows. But how do I get that data into the pymc3 model in the example above?

# Combine X and Y into an array with shape (1000000, 2)
data = np.array([X, Y]).T

# We want to take mini-batches of shape (128, 2)
batch = pm.Minibatch(data=data, batch_size=[(128, 2)])

Thanks again!

One thing that puzzled me was that multiple pm.Minibatch objects are created, because they would then have to sync their batch generations between the multiple objects “behind the scenes”.

Your point about some of the behavior of multiple Minibatch streams is well taken; PyMC3 affords the user less granular control than Theano and some details are not as obvious.

Also, would it be better to use pm.Minibatch with a multi-dimensional numpy array? I suppose the batch-generator should be initialized as follows. But how do I get that data into the pymc3 model in the example above?

With regard to the multdimensional minibatch, you can create the minibatch variable inside the model and then index into it like so:

with pm.Model() as model:
    
    batch = pm.Minibatch(data=data, batch_size=[(128, 2)])
    X_batch = batch[:,0]
    Y_batch = batch[:,1]

I’m not sure if there is a performance difference between using pm.Minibatch twice and creating it once and then indexing later but it may be something worth testing.

Note that creating the pm.Minibatch objects generate a Python warning when using pymc3 v. 3.8 (latest). Is this something to worry about?

I’m not completely sure about this. You can read some discussion other users have had about it here but it appears to be a harmless Theano warning for now.

Thanks again for the quick and detailed reply!

I have tested both methods and it is much faster to have multiple pm.Minibatch objects, in which case it only takes 35 seconds to run the model fitting, while it takes over 9 minutes when there is a single pm.Minibatch object! I used the same code in both cases, except for the mini-batch creation. And I used 100000 iterations in pm.fit.

This seems very strange to me, as I would have expected the two methods to be about equally fast?

The posterior distributions look very similar for the two methods, when using pm.traceplot to plot them. If I pull out random samples like this: X_batch.eval().shape then they also have the correct shape of (128,).

Do you have any idea why one method of mini-batching is 15x slower than the other?

Wow, that’s really good to know. I’ve always assumed that instantiating multiple minibatched variables was the best way to do it since it’s arguably a little more readable. Thanks for running the comparison! I’m not an expert on the Theano-level computations being done behind the scenes but my guess is that since pm.Minibatch is a wrapper for a shared Theano variable, there is some significant overhead associated with working with two rather than one.

I would have thought it was a bug that the difference is 15x in the time-usage between the two methods of mini-batching because they are very similar, but perhaps there is more to it than meets the eye.

Please consider updating the doc-string and tutorials on how to use pm.Minibatch with multiple variables, because it is a quite unusual way of doing mini-batching. It will likely save everyone a lot of time in the future, if the doc-string and tutorials are more clear, so people don’t have to dig out this thread.

Please note that the doc-string formatting also needs fixing because the examples are really hard to read:
https://docs.pymc.io/api/data.html#pymc3.data.Minibatch

The main tutorial could also give a simple example:
https://docs.pymc.io/notebooks/variational_api_quickstart.html#Minibatches

I’m a beginner with pymc3 so I think it’s better if I don’t make this update and you do it yourself, to make sure it is correct.

Thanks again for your help!

(How annoying, this forum blocks it when I want to write the final link)

The tutorial named GLM-hierarchical-advi-minibatch.html should also make clear that the multiple pm.Minibatch objects are actually synced “behind-the-scenes” so the batches for all the variables are pulled correctly from the data:

Thanks for running the comparison, @ferrine do you have some idea why? also what is your recommendation?

Sorry for that late reply. I think that (128, 2) usage is not correct, it should be just 128. Otherwise, 2 random generators are used. In this case, you have advanced indexing over 2 dimensions and that causes performance overhead

1 Like

There is something I can’t understand here.

If Y = 7.25 + 2.5 * X + 5.3 * noise is changed to Y = 100 + 20*X, code will still return a ~ 2.5, b~ 2.5

Also if the batch_size is changed to n*10, the code still runs.

[quote=“Hvass-Labs, post:3, topic:5033”]

# Number of data-points.
n = 1000000

# Random data for X.
X = np.random.uniform(0, 10, size=n)

# Random data for Y which depends on X.
noise = np.random.normal(size=n)
Y = 7.25 + 2.5 * X + 5.3 * noise

# Turn data into mini-batches.
batch_size = 128
X_batch = pm.Minibatch(data=X, batch_size=batch_size)
Y_batch = pm.Minibatch(data=Y, batch_size=batch_size)

# Find the parameters for the relation between X and Y.
with pm.Model() as model:
    # Prior parameters.
    a = pm.Normal('a', mu=0, sigma=10)
    b = pm.Normal('b', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=10)
    
    # Relation between X and the mean of Y.
    mu = a + b * X_batch
    
    # Observed output Y.
    Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma,
                      observed=Y_batch, total_size=n)
    
    # Find the posterior parameters.
    approx = pm.fit()

# Sample from the posterior distributions for the parameters.
trace = approx.sample(draws=500)

# Plot the distributions for the posterior parameters.
pm.traceplot(trace)