Time series break point detection using PyMC3 - optimization

I am trying to solve a time series break point detection problem using pyMC3, and I just wanted to check the rationale of the selected approach with the community, since this is the first real problem I tackle with PyMC3.

I have several time series of ~ 400 points (one every 15 days, corresponding to about 13 years) which presents clear annual periods, some zero mean additive Gaussian error (I expect) and sudden reductions in the time series values (breakpoints). The data are constrained in the interval [0-1]. I have a starting period (one year) in which no breakpoints are present.

The approach selected was to model the observations as a vector of 23 uniformly distributed random variables, one for every 16-day period in the year. This vector repeats every year like,

X = [X(0), .., X(2), X(0), ..]

in order to match the input data size. Finally, the observation process is modeled as Gaussian, with mean mu = X and an exponentially-distributed variance s.

In order to test the model, I have ~100 validation time series, in which the breakpoint time is indicated. To test if this approach works, I initialize X using the first year (training period) and the I increase the input vector one step every time to simulate the arrival of the new data. For every data d_(t+1),

  1. I compute the posterior for t+1, P(d_(t+1)|d_(t),d_(t-1), ..,d_(0))
  2. Use some metric to test is the new value correspond to a breakpoint (e.g. to evaluate if the p-value of d_(t+1) computed using the full posterior < 0.05)
  3. If it is a breakpoint, the algorithm stops. On the contrary, the value of d_(t+1) is “accepted” and in the next iteration is used to compute the next posterior.

The scheme works, but I have some concerns:

  1. Modeling approach: how is this model when compared with an ARN model? In this model, I have the whole distributions as free parameters, and the posteriors are being trimmed every time step to fit the data.

  2. Temporal correlation: the model has no explicit temporal correlation, but the fitted joint distribution of X presents reasonable correlations (e.g. between X[6] and X[7]). Should I include some explicit correlation between times?

  3. Computation time: with Theano using the GPU (a NVDIA 660 Ti), it need ~ 7 sec (1200 it/s) for every new point (nSamples = 10000, nBurn = 1000), so I need about 46 min per time series (too slow for my application).

Any feedback is welcome.

Your approach looks fine to me, so here is a few thoughts:

1, Using to an AR(n) model you can not easily identify the changepoint, as it is not a parameter of the model. You can probably also get pretty good fit of the data, if you have enough data before and after the break point.

2, I would say it is a good idea to include a temporal correlation, it will reduces the variance on each time point.

3, Samples around 1000 with multiple traces in NUTS is enough. If you are using a Metropolis because of the discrete break point, you can change the break point to a continous RV and use NUTS instead.

Thank you @junpenglao! I am happy I am in the right track.

  1. About speed: I implemented NUTS as:

    with basic_model:
    start = pm.find_MAP(fmin=optimize.fmin_powell)
    step = pm.NUTS()
    trace = pm.sample(step = step, start = start)

but it runs very slow (~ 10 it/s). I read here that on most cases this is a problem of initialization (you recommend somewhere else not to use find_MAP). ADVI initialization is suggested, but I couldn’t find documentation about how to do it. Can you please point me in the right direction?

  1. About correlation: As far as I known, it is not possible to explicitly include correlation information while using Uniform distributions. Do you recommend to change to Gaussian priors?

Thank you again!

  1. Yes, using find_MAP to initialize the sampler is rarely a good idea in high dimension problems. You should first start with the default trace = pm.sample(). And since you are sequentially updating as data comes in, you should find a way to store the mass matrix used by the NUTS kernel in the previous step for the next batch of data.
  2. Prior usually can’t help in this case of the posterior correlation, as the correlation in the posterior is more influence by your model parameterization (which is usually what preprocessing comes in such as making the predictor matrix orthogonal)

Thank you again @junpenglao for all the feedback!

I checked using trace = pm.sample() but NUTS is still very slow. In the NUTS initialization, it does not mention ADVI at all, like I saw in some examples. Is there a way to force the initialization to be carried by ADVI? I think this problem in related. There they find that MAP and ADVI initial values are very different.

About posterior correlation, I was thinking about what you post and I think you are completely right.

In summary, when I call trace = pm.sample() alone, NUTS is very slow probably because ADVI is not called.

Thank you again!

PS:

I don’t follow here, please provide some reading material in order to understand how and where NUTS mass matrix is stored.

Yes, you can do trace = pm.sample(1000, tune=1000, init='advi') it will use advi initiatualization. More option could be seen in the doc: http://docs.pymc.io/api/inference.html#module-pymc3.sampling

Regarding to the mass matrix, it is the same idea as using the init='nuts':

where init_trace is the trace from the last batch of training.

Thank you again @junpenglao.

I tested trace = pm.sample(1000, tune=1000, init='advi') but the performance is still poor,

if using NUTS as init, it takes forever

45-nuts

As benchmark, for the same task Metropolis needs only 7 seg.

07-Metropolis

I read the docs and tested many configurations, but I couldn’t find anything faster than Metropolis for my case. Any other suggestion is welcome. Thank you for the feedback!

You shouldnt use NUTS as init directly - that’s not a good option and will be removed soon. If anything you should use the default jitter+adapt_diag.
Metropolis is fast but in many case it is an illusion (e.g., see Don’t Use Metropolis) - as what you care about more is effective sample size.

Thank you one last time @junpenglao.

OK. With jitter+adapt_diag, NUTS cannot go beyond 15 it/s for this problem. I read

and indeed I have different histograms for every job with Metropolis! I need to find a way to make NUTS faster. Any suggestion is welcome.