Performance speedup for updating posterior with new data


#1

Hi All

I’ve been developing a Poisson Regression model with the capabilities to incrementally update the posterior as new data becomes available (online learning).

I’ve been using @davidbrochart Updating Priors notebook as a guide for how to do this in pymc3. I’ve attached an example to this post.

The problem I’m having is that although the initial sampling takes 1 to 2 seconds (on 100 data points) - when new data is fed in this jumps to about 15 seconds per new data point (I’m feeding in just 1 new point for each update of the posterior). I’ve run this on a more powerful virtual machine (non-gpu) and the minimum I’ve got it down to is 6-10 seconds.

I’m wondering is there something I’m missing that could help quicken the updates? It seems like a lot of effort is spent initialising a new model every time - is there a way to by-pass this given that we’ve just run a similar model?

Super appreciate any help on this.

Thanks.
poisson_regression_online_learning.py (3.7 KB)


The pymc3 way: Linear regression and inferring given posterior/trace
#2

I had a look at your code. Unfortunately, I dont see much more room to improve the speed, as the Interpolation rv always need to reinitialized, and the bottom neck in the evaluation of scipy Spline smooth function is also difficult to improve without rewriting everything in theano.

Alternatively, if you are happy with using some other distribution as an approximation, it would simplify the code a lot:

new_data = pd.DataFrame(data=[generate_data() for _ in range(1)], columns=generate_data().keys())
output_shared = theano.shared(new_data['output'].values)
x1_shared = theano.shared(new_data['x1'].values)
    
mu0 = theano.shared(trace['alpha_0'].mean(), name='hyper_mu0')
sd0 = theano.shared(trace['alpha_0'].std(), name='hyper_sd0')
mu1 = theano.shared(trace['alpha_1'].mean(), name='hyper_mu1')
sd1 = theano.shared(trace['alpha_1'].std(), name='hyper_sd1')

with pm.Model() as update_model:
    alpha_0 = pm.StudentT('alpha_0', mu=mu0, sd=sd0, nu=1, testval=0.)
    alpha_1 = pm.StudentT('alpha_1', mu=mu1, sd=sd1, nu=1, testval=0.)
    theta = (alpha_0 + alpha_1 * x1_shared)

    likelihood = pm.Poisson(
        'output',
        mu=np.exp(theta),
        observed=output_shared,
    )
    trace = pm.sample(1000)
    traces.append(trace)
    
for _ in range(1, 10):
    new_data = pd.DataFrame(data=[generate_data() for _ in range(1)], columns=generate_data().keys())
    output_shared.set_value(new_data['output'].values)
    x1_shared.set_value(new_data['x1'].values)
    mu0.set_value(trace['alpha_0'].mean())
    sd0.set_value(trace['alpha_0'].std())
    mu1.set_value(trace['alpha_1'].mean())
    sd1.set_value(trace['alpha_1'].std())
    
    with update_model:
        trace = pm.sample(1000)
        traces.append(trace)

Updating multivariate priors
Drawing from posterior of a Multivariate Distribution
Updating priors for Gaussian Mixture Model
#3

Thanks for looking at the code. To be honest your code changes has dropeed runtime per update from 15s down to a few seconds - so pretty awesome speedup in any case! Shame about the other bottlenecks though - figured not much can be done if it comes from elsewhere.