Are there any tutorials about time series modeling of a binomial process in PyMC3? For example modeling something like a conversion rate over time with regressors and accounting for serial autocorrelation in the errors.
Here’s a minimal example of a binomial model with predictors and autocorrelated probability of success. It’s basically a binomial GLM with an extra latent variable representing an autocorrelated adjustment to the probability of success. Feel free to ask for more clarification on any of the details below.
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit
# Creating simulated data
n = 100 # Number of possible successes
t = 40 # Number of time series observations
p = 4 # Number of predictors
# Create series of values drawn from an
# autocorrelated Gaussian random walk
eps_jumps = np.random.randn(t) * 0.25
eps = np.cumsum(eps_jumps)
beta = np.random.randn(p)
X = np.random.randn(t,p)
eta = X@beta + eps
prob = expit(eta)
y = np.random.binomial(n, prob)
# Declaring the PyMC3 model and sampling
with pm.Model() as model:
coefs = pm.Normal('coefs',sd=10,shape=p)
latent_sigma = pm.HalfNormal('latent_sigma')
latent = pm.GaussianRandomWalk('latent', sigma=latent_sigma, shape=t)
mu = X@coefs + latent
prob_estimated = pm.math.sigmoid(mu)
response = pm.Binomial('response', n, prob_estimated, observed=y)
trace = pm.sample()
print(pm.summary(trace,varnames=['coefs']))
print(beta)
# Compute 90% credible interval for the autocorrelated part
# of the success probability
credible_intervals = np.percentile(trace['latent'],axis=0,q=[5,95])
plt.plot(trace['latent'].mean(axis=0),label='Posterior mean estimate')
plt.fill_between(np.arange(t),credible_intervals[1],credible_intervals[0],
alpha=0.25,label='95% credible interval')
plt.plot(eps,label='True value',color='k')
plt.legend()
This is great and including the data generating model with a regressor made it really clear, thank you!
The only follow up question I have - any recommended learning resources or books that would help me start coming up with my own solutions like this in the bayesian / time series / GLM space? I’ve read Rethinking, but it doesn’t have much on Time Series (and I’m a little weak with time series anyway tbh).
I would say the biggest area where I don’t understand the theory is how using the latent variable that is a random walk accounts for the autocorrelation
It’s challenging to find good resources because much of the Bayesian time series literature is focused on inference methods that use more efficient sampling methods like filtering. Those methods scale well to much longer time series (i.e. T >> 100) but typically require writing down a new set of equations for inference every time.
An additional problem is that some of the classical time series / econometrics literature focuses on very specific time series models such as GARCH and ARMA among others without explaining how they’re part of a larger class of dynamic models. This review paper moves quickly and assumes a fair amount of background but it presents a very general framework for thinking about dynamic linear models (DLMs) for time series. DLMs typically include autoregressions and dynamic regressions as special cases. I have not read this paper in detail but it also appears to cover the same material. It’s important to keep in mind that we can be much more flexible with regard to parametric assumptions than these DLM papers since NUTS is a general purpose inference algorithm.
Finally, since PyMC3 and Stan share many similarities, I often use the Stan manual for reference. It has a section on time series data as well.
Thanks for this I really appreciate it! You make a great point, the most frustrating part of looking for time series resources is that there’s so much GARCH/ARMA literature that I can’t find much else (without knowing what to look for) - looking forward to reading that review paper and getting a broader sense of the solution space
I ended up having a follow up question if you have the time (wasn’t sure if I should make a new post) - what would the data generating process and model look like if there was monthly seasonality? I’m having trouble creating a seasonal random walk, and not quite sure how that would change the model as well. I want to add a trend to the data as well, but I think that should be a little easier
Sure, this is a good place to have that discussion. The simplest way to incorporate seasonality is to include a sine or cosine wave as predictors. You may have to adjust the phase of the angle to line up with your prior assumptions, if you have any. This has the downside that it forces adjacent months’ impact to be similar to each other. If you have more data, you might want to include a dummy variable for each month as a predictor. The same strategy works for trend - just include it as another predictor. In both of these cases, the data generating process doesn’t really change in structure - we just add extra columns to the design matrix.
Since you mentioned ‘seasonal random walk’, I suspect that you want to let the effect of the seasonal component of your model vary over time. That’s the crux of the dynamic linear model framework. In this case, you can modify the above code to include extra GaussianRandomWalk
components which are not directly added (as in the code I showed above) but which are multiplied by your covariates. Since the GRW is a relatively simple process, I’ve implemented it in terms of simpler increments with normal distributions below. Here, we let each coefficient become an entire vector with t autocorrelated elements. We then take the elementwise product with the predictors and sum along the covariate axis which is why we have axis=1
in the computation of mu
. If you included a seasonal or monthly predictor in the covariates, this would implement a model which accounts for seasonality, though it lets the influence of seasonality vary due to the random walk prior over coefs
.
with pm.Model() as model:
latent_sigma = pm.HalfNormal('latent_sigma', shape=[1,p])
coef_means = pm.Normal('coef_means',shape=p, sd=10)
coef_jumps = pm.Normal('coef_jumps',shape=[t,p],sigma=latent_sigma)
coefs = pm.Deterministic('coefs',coef_means + tt.cumsum(coef_jumps,axis=0))
mu = pm.math.sum(X*coefs,axis=1)
prob_estimated = pm.math.sigmoid(mu)
response = pm.Binomial('response', n, prob_estimated, observed=y)
Ah I didn’t realize a seasonal random walk meant that the seasonal component varies over time (I still have a lot to learn) - I decided to implement a sine wave as a predictor like you suggested and was able to retrieve coefficients from data I simulated so thank you! This is a great solution, I’ll have to simulate data with varying seasonality over time with your code next