Combining AR/Negative Binomial with Gaussian Random Walk

Hello,

First of all thank you to everyone who spends their time developing and maintaining this fantastic library! I’ve been trying to implement my first Bayesian model and I’ve become stuck. The original idea was to implement the paper cited in this post: Vectorized autoregressive model

However, my data is unstationary and highly dependent on time, and also is divided into categories, so I thought to utilize the GaussianRandomWalk as seen here: https://twiecki.io/blog/2017/03/14/random-walk-deep-net/

Here is my best attempt, drawing heavily from those two sources:

with  pm.Model() as stochastic_model:
    #global hyperprior for alpha parameter used in Negative Binomial
    alpha = pm.Uniform("alpha", lower=0.001, upper=0.1)
    #attemping to make coefficients that depend on time
    step_size = pm.HalfNormal('step_size', sd=np.ones(df.shape[1]), shape=df.shape[1])
    w = pm.GaussianRandomWalk('w', sd=step_size, shape=(df.shape[0], df.shape[1]))
    #hyperpriors for the precision, tau
    kappa_tau = pm.Uniform("kappa_tau", lower=5, upper=10)
    beta_tau = pm.Uniform("beta_tau", lower=2, upper=25)
    tau_l = pm.Gamma("tau_l", kappa_tau, beta_tau, shape=df.shape[1])
    alpha_l = pm.Exponential("alpha_l", alpha, shape=df.shape[1])
    #The class AR2d comes from the first link above that allows pm.AR to accept multidimensional rho
    eta_l = AR2d("eta_l", rho=w, tau=tau_l, constant=False,shape=(df.shape[0],df.shape[1]))
    y_t_l = pm.NegativeBinomial("y_t_l", tt.exp(eta_l), alpha_l, observed=df)
    
    trace = pm.sample(5000, cores=4)

My dataframe is 101 rows and 5 columns (time series with 101 observations of counts data for 5 different categories) and I hope to bring in more categories eventually. The sampling did not converge (“The chain contains only diverging samples. The model is probably misspecified”) and acceptance probabilities were rounding errors from 0. I’m trying to understand if the problem is in the ranges I choose for hyperpriors, whether the priors should be different distributions, if I’m getting the shape wrong in some instances, or if the model is conceptually flawed.

Thanks for anyone who can lend advice! If there’s more information I can provide please let me know.

I realized the way the AR2d model takes in rho and uses len(rho) as the number of lags to regress on isn’t compatible with how I’m defining the coefficients. Going to try the simple AR1 and see how it works!

Hi Samuel, how did trying AR1 work out? Interested in this, cheers

Still trying to figure it out! I’ll post when I have something interesting to report

At this point, because my model is hierarchical with several levels of priors and hyperpriors, I’m using the GaussianRandomWalk at the lowest level to model an AR1 coefficient, and ‘inheriting’ from higher levels by adding a value from a normal distribution to the GRW. If someone knows a better way for a GRW to inherit from a higher level I would be grateful to hear it!

Hello,

I’ve tried some different things and here is where my model is at. It’s a lot to read, but basically it’s a hierarchical model that starts with a MvGaussianRandomWalk (technically one for the intercept and one for the coefficient) on the first level (across 100 time points), and then has levels for the 4 categorical variables. The 4th variable is the year (I have 4 years of data), the idea being I will use the coefficients from the level above for inference to avoid overfitting. I am almost certain there’s a better way to do this, as my trace dataframe has 450,000 rows. I’m also concerned because I saw in this post: Using Multivariate distributions for price optimization that I possibly shouldn’t use covariance matrices, which I’m doing. Any advice would be greatly welcome!

Thanks

N=100 #length of time series
lagged=theano.shared(df['lagged'].values) #previous value in time series
vals=theano.shared(df['vals'].values)
with pm.Model() as model:     

    global_beta_mu=pm.Normal('global_beta_mu',mu=0,sd=100,shape=(N-1,var1_count))
    #var1_count is 5 (the number of unique categories in my first variable)
    global_alpha_mu=pm.Normal('global_alpha_mu',mu=0,sd=100,shape=(N-1,var1_count))
    packed_L_β = pm.LKJCholeskyCov('packed_L_beta', n=var1_count, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L_β = pm.expand_packed_triangular(var1_count, packed_L_β)
    packed_L_α = pm.LKJCholeskyCov('packed_L_alpha', n=var1_count, eta=2., sd_dist=pm.HalfCauchy.dist(2.5))
    L_α = pm.expand_packed_triangular(var1_count, packed_L_α)
    alpha = pm.MvGaussianRandomWalk('alpha', mu=global_alpha_mu[:,var1_indexes], shape=(N, var1_count), chol=L_α)
    beta = pm.MvGaussianRandomWalk('beta', mu=global_beta_mu[:,var1_indexes], shape=(N, var1_count), chol=L_β)
    #vars12_count is 10 (the unique combinations of categories from first 2 variables)
    vars12_alpha_sd=pm.HalfCauchy('vars12_alpha_sd',beta=4,shape=(N,vars12_count))
    vars12_beta_sd=pm.HalfCauchy('vars12_beta_sd',beta=4,shape=(N,vars12_count))
    vars12_alpha_offset=pm.Normal('vars12_alpha_offset',mu=0,sd=1,shape=(N,vars12_count)) 
    vars12_beta_offset=pm.Normal('vars12_beta_offset',mu=0,sd=1,shape=(N,vars12_count))
    vars12_alpha_m=pm.Deterministic('vars12_alpha_m',alpha[:,vars12_indexes] + 
                        vars12_alpha_offset * vars12_alpha_sd[:,var12_indexes])
    vars12_beta_m=pm.Deterministic('vars12_beta_m',beta[:,vars12_indexes] + 
                        vars12_beta_offset*vars12_beta_sd[:,var12_indexes])
    #vars123_count is 130 (unique combinations of first three variables)
    vars123_alpha_sd = pm.HalfCauchy('vars123_alpha_sd', beta=4, shape=(N,vars123_count)) 
    vars123_beta_sd = pm.HalfCauchy('vars123_beta_sd', beta=4, shape=(N,vars123_count))
    vars123_alpha_offset=pm.Normal('vars123_alpha_offset',mu=0,sd=1,shape=(N,vars123_count)) 
    vars123_beta_offset=pm.Normal('vars123_beta_offset',mu=0,sd=1,shape=(N,vars123_count))
    vars123_alpha_m=pm.Deterministic('vars123_alpha_m',vars12_alpha_m[:,vars123_indexes] +
                        vars123_alpha_offset*vars123_alpha_sd[:,vars123_indexes])
    vars123_beta_m=pm.Deterministic('vars123_beta_m',vars12_beta_m[:,vars123_indexes]+
                        vars123_beta_offset*vars123_beta_sd[:,vars123_indexes])    
    #vars1234_count is 520 (unique combinations of all four variables)
    vars1234_alpha_sd = pm.HalfCauchy('vars1234_alpha_sd', beta=4, shape=(N,vars1234_count))
    vars1234_beta_sd = pm.HalfCauchy('vars1234_beta_sd', beta=4, shape=(N,vars1234_count))
    vars1234_alpha_offset=pm.Normal('vars1234_alpha_offset',mu=0,sd=1,shape=(N,vars1234_count))
    vars1234_beta_offset=pm.Normal('vars1234_beta_offset',mu=0,sd=1,shape=(N,vars1234_count))
    vars1234_alpha_m=pm.Deterministic('vars1234_alpha_m',vars123_alpha_m[:,vars1234_indexes]+
                        var123_alpha_offset*vars1234_alpha_sd[:,vars1234_indexes])
    vars1234_beta_m=pm.Deterministic('vars1234_beta_m',vars123_beta_m[:,vars1234_indexes] +
                        occ_month_yr_beta_offset*vars1234_beta_sd[:,vars1234_indexes])  

    regression=pm.Deterministic('regression',
                vars1234_alpha_m[df['dates'].values,df['vars1234_index'].values]+
                vars1234_beta_m[df['vals'].values,
                df['vars1234_index'].values]*lagged)

    obs_error=pm.Uniform('obs_error',lower=0,upper=100)

    likelihood=pm.Normal('likelihood',mu=regression,sd=obs_error,observed=vals)    

    trace=pm.sample(1000)

Also it’s incredibly slow sampling, around 10 seconds per draw (not draws per second!)

If I’m reading this correctly, you have a cohort of N objects under observation of T time steps that belong to one group for each of 4 categories, where there are (5, 2, 13, 4) groups in these categories.

You then pre-compute the (N x 520) indicator matrix, reshape it into a (Nx1) matrix, with external indices representing the group information, and replicate it to a (TxN) matrix. Thus all([x in vars12_indexes for x in vars123_indexes]) should be true. So your hierarchical model looks like

\beta_{1a\gamma} \sim N(\beta_{1a}, \sigma_\gamma^2)

with

[\beta_1, \dots ,\beta_5](t) \sim \mathrm{RW}(\Sigma)(t)

If I got this correct, then the model looks well-specified. In the absence of other covariates on which to condition e.g. \beta_2(t), you can’t do much better than a random walk (though you might look into letting \beta_2(t) be a function of global principal components, which would reduce the number of parameters from 10+5T to 5 n_{PC}).

At any rate, this is a very complicated model with well over 1000 parameters, well into the territory where practical sampling requires a specifically-tailored MCMC. General samplers will certainly improve in the future, so I suspect this will not always be the case, but for now NUTS can’t work magic.

I can see two approaches for tractability:

  1. Use ADVI or low-rank ADVI (Householder flow)
  2. Marginalize out the pieces you don’t want; using N(0, t\Sigma) in place of the random walk; or choosing conjugate priors for categories you’re not actively analyzing and marginalize them out.

Good luck!

Thanks so much for taking the time to respond! You’re spot on in your understanding of the model. It’s very reassuring to hear at least my model is well-specified - it’s my first pymc3 project so I thought for sure I was doing something wrong! I’ll use your suggestions to try to improve it.

Hello again,

When I use ADVI, I get ok results but still pretty far off the mark. The ELBO seems to stop improving pretty quickly:
image

When I try Householder flow, my kernal dies every time. Do you have any idea why that might be or how to improve it?

Thanks again!

Also, might it be better to replace the Normal dists I use in my model with other dists?

I’m not sure that the convergence rate of ELBO necessarily indicates anything.

When you say “OK results but pretty far off the mark,” what exactly do you mean? You have some aggressively broad priors (N(0, 100) / HalfCauchy(4)) on a huge number of parameters. It would require quite a lot of data to overcome these uncertainties.

I would try a couple of things to simplify the model further. More informative priors would help (do you think beta values of 200 are reasonable? If not, then why N(0,100)?) You can also share variances within and/or across the levels. For instance, it looks like each of your standard deviation variables is point-specific (shape (N, K)). You might pool this like:

  vars12_alpha_sd_pooled =pm.HalfCauchy('vars12_alpha_sd_pooled_',beta=4,shape=(1,vars12_count))
  vars12_alpha_sd = pm.Deterministic('vars12_alpha_sd', tt.dot(tt.ones((N, 1)), vars12_alpha_sd_pooled))

or even make it global for the level

  vars12_alpha_sd_collapsed =pm.HalfCauchy('vars12_alpha_sd_collapsed',beta=4)
  vars12_alpha_sd = pm.Deterministic('vars12_alpha_sd', vars12_alpha_sd_collapsed * tt.ones((N, vars12_count)))

In general I found it is much easier to start with a very simple (or even simplistic) model. There is a great lure to write down a complicated model that takes care of all the different effects you can think of, but at first I never manage to come up with something that is actually reasonable (I make that mistake a lot :wink: ). There just are so many different small things that you never now why the model doesn’t work. Usually the sampler just doesn’t like it (and often for good reasons, the model might just be terribly miss-specified). If you build up the model peace by peace, then at least you know which new part introduced the problems. That also makes it easier to get help. As it is, it is just a lot of work just to understand what your model is doing in the fist place.

1 Like

Thanks to both of you for your advice! It is my first assignment as a Data Scientist so I wanted to produce something really great but I think you’re right I went all out for an overly sophisticated model. I will try make a simple model that works and go from there!

We’ve all been there. :slight_smile:
And feel free to continue asking questions if something comes up.

Hello again! I think I’ve realized something I was doing wrong :slight_smile:

I think I was trying to fit a single coefficient to several time series with different behavior for my priors, thus quite logically not finding a workable result. What I meant to do was fit the coefficient to the aggregated level time series, and use that as the prior for the next hierarchical level!

Thus I was wondering how to do this best. Can I run a separate model at the most aggregated level, and use the resulting coefficients as priors? Or aggregate the data in the model itself? Or make separate data sets for the aggregated series and include it in the same model as ‘Observed’? I was looking at the Potential function but I can’t tell if it would be a good use for it.

Thanks again as always!