Speeding up inference on large datasets

I started learning about bayesian methods, becuase I am trying to solve a problem where I would like to modify a time series’ values to correct for a bias factor that is known to artificially inflate or deflate the value of the time series.

I am using a model that inspired by HMMs but is continuous.

Observed values are in green and latent variables are in red.

I have specified the variables as follows:
True\_Value_t \sim \mathcal{N}(True\_Value_{t-1}, \sigma_t)
True\_Value_0 \sim \mathcal{N}(0, \sigma_t)
Bias_t \sim \mathcal{N}(0, \sigma_b)
Bias\_Coeff \sim \mathcal{N}(0, 1)
Measurement_t \sim \mathcal{N}(True\_Value_t + Bias\_Coeff \cdot Bias_t, \sigma_m)

Using this model I can infer the most likley sequence of True\_Value, to ‘correct’ the series of measurements accounting for the bias, using find_MAP.

This works well but I want to perform this analysis for a large dataset, and use the result of this inference for a downstream ML task.

I have 4k of these sequences with an average length of 20 meaning I will be running inference roughly 80k times!

As i want to use the smoothed series for a downstream ML task I will need to calculate a smoothed series for each step to prevent dataleakage into my downstream model.

NpVector = np.ndarray
def create_model(measurements: NpVector, bias_values: NpVector):
    n = len(measurements)
    true_value = pm.distributions.timeseries.GaussianRandomWalk('true_value', sigma=1, shape=n)
    bias = pm.Normal('bias', mu=0, sigma=1, shape=n, observed=bias_values)
    bias_coeff = pm.Normal('bias_coeff', mu=0, sigma=1)
    measurement = pm.Normal('measurement', mu=true_value + bias_coeff*bias, sigma=1, shape=n, observed=measurements)

At the moment I am now doing inference in a for loop:

for measurements, biases in data:
  with pm.Model as model:
    create_model(measurements, biases)
    MAP = pm.find_MAP()

I have tried to use multiprocessing's pool to parallelise this for loop, but am not sure how this will play with pymc3’s implementation of find_MAP.

I was wondering if there is some way for me to do inferences on chains of the same length fitted with different independent sets of observations?

Or if there would be a way to do some sort of online learning as the length of the sequence grows?

1 Like