Time Varying Overdispersed Poisson Process

I’m getting the following error in my code:

TypeError: For compute_test_value, one input test value does not have the requested type.

The error when converting the test value to that variable type:
Wrong number of dimensions: expected 0, got 1 with shape (38994,).

Any ideas on how to get around this? Also any feedback on the approach would be appreciated! Borrowing alot of code here from Ritchie Vink’s blog

# Trend

def det_dot(a, b):
    """
    The theano dot product and NUTS sampler don't work with large matrices?
    
    :param a: (np matrix)
    :param b: (theano vector)
    """
    return (a * b[None, :]).sum(axis=-1)

def trend_model(m, t, n_changepoints=25, changepoints_prior_scale=0.05, 
                growth_prior_scale=5, changepoint_range=0.8):
    """
    The piecewise linear trend with changepoint implementation in PyMC3.
    :param m: (pm.Model)
    :param t: (np.array) MinMax scaled time.
    :param n_changepoints: (int) The number of changepoints to model.
    :param changepoint_prior_scale: (flt/ None) The scale of the Laplace prior on the delta vector.
                                    If None, a hierarchical prior is set.
    :param growth_prior_scale: (flt) The standard deviation of the prior on the growth.
    :param changepoint_range: (flt) Proportion of history in which trend changepoints will be estimated. 
    :return g, A, s: (tt.vector, np.array, tt.vector)
    """
    s = np.linspace(0, changepoint_range * np.max(t), n_changepoints + 1)[1:]
    
    # * 1 casts the boolean to integers
    A = (t[:, None] > s) * 1

    with m:
        # initial growth
        k = pm.Normal('k', 0 , growth_prior_scale)
        
        if changepoints_prior_scale is None:
            changepoints_prior_scale = pm.Exponential('tau', 1.5)
        
        # rate of change
        delta = pm.Laplace('delta', 0, changepoints_prior_scale, shape=n_changepoints)
        # offset
        m = pm.Normal('m', 0, 5)
        gamma = -s * delta
        
        g = (k + det_dot(A, delta)) * t + (m + det_dot(A, gamma))
    return g, A, s


def det_trend(k, m, delta, t, s, A):
    return (k + np.dot(A, delta)) * t + (m + np.dot(A, (-s * delta)))

# Seasonality

def fourier_series(t, p=365.25, n=10):
    # 2 pi n / p
    x = 2 * np.pi * np.arange(1, n + 1) / p
    # 2 pi n / p * t
    x = x * t[:, None]
    x = np.concatenate((np.cos(x), np.sin(x)), axis=1)
    return x


def seasonality_model(m, df, period='yearly', seasonality_prior_scale=10):
    if period == 'yearly':
        n = 10
        # rescale the period, as t is also scaled
        p = 365.25 / (df['ds'].max() - df['ds'].min()).days
    elif period == 'weekly':
        n = 3
        # rescale the period, as t is also scaled
        p = 7 / (df['ds'].max() - df['ds'].min()).days
    else: # daily    
        n = 4
        # rescale the period, as t is also scaled
        p = 24 / (df['ds'].max() - df['ds'].min()).days
    
    x = fourier_series(df['t'], p, n)
    
    with m:
        beta = pm.Normal(f'beta_{period}', mu=0, sd=seasonality_prior_scale, shape=2 * n)
    
    return x, beta


import numpy as np
import pandas as pd
import scipy.stats as stats
import pymc3 as pm


def sim_data():
    ds = pd.date_range('2018-06-01 00:00:00', '2020-08-21 08:30:00',freq='1800S')
    monthly = dict(zip(range(12), [1.2, 0.4, 0.9, 1.25, 0.75, -0.5, 0.4, 0.1,-1.5, -1, -0.5, 1.25]))
    daily = dict(zip(range(7), [0.7, 1, 0.85, 0.1, -0.5, -1.35, -1]))
    hourly = dict(zip(range(24), [-2,-3.35,-4,-4.3, -5,-4.75,-4,-3.15,-1.3,0.35,2.2,3.4,3.65,3.8,3.9,3.85,3.2,2.2,1.6,1.45,1.55,1.65,0.75,-0.4]))

    df = pd.DataFrame({'ds':ds, 'month':ds.month - 1, 'day':ds.dayofweek, 'hour':ds.hour })

    seasonality = (df.month.map(monthly) + df.day.map(daily) + df.hour.map(hourly))
    lam = 10
    scale = 1/6
    df['y'] = stats.nbinom(lam+seasonality, 1/6).rvs(len(seasonality))
    return df
    
df1 = sim_data()


df1['t'] = (df1['ds'] - df1['ds'].min()) / (df1['ds'].max() - df1['ds'].min())


with pm.Model() as m2:

    a = pm.Normal('a', 5, 1.5)
    b = pm.Beta('b', 16, 84)
    
    g, A, s = trend_model(m2, df1['t'], changepoints_prior_scale=None)
    x_monthly, beta_monthly = seasonality_model(m2, df1, 'yearly')
    x_daily, beta_daily = seasonality_model(m2, df1, 'weekly')
    x_hourly, beta_hourly = seasonality_model(m2, df1, 'daily')

    y = pm.Deterministic('y', g + det_dot(x_monthly, beta_monthly) + det_dot(x_daily, beta_daily) + det_dot(x_hourly, beta_hourly))
    
    
    sigma = pm.HalfCauchy('sigma', 0.5)
    seasonality = pm.Normal('seasonality', mu=y, sd=sigma)
    
    lam = pm.Gamma('lam', a, b)
    obs = pm.Poisson('obs', lam + seasonality, observed=df1.y)
    
    trace = pm.sample(1000)

Hi Kyle

This should work

seasonality = pm.Normal('seasonality', mu=y, sigma=sigma, shape=len(df1))

though I’m not sure why it is required as usually the shapes will broadcast themselves fine.
Also I changed the sd into sigma as sd will be depreciated soon.

After that I noticed that the next lines fail. k is not defined and nor is df1.user_id.
In addition a and b are not used at all?

1 Like

Ah whoops I just edited my code, k=a, theta=b, and observed was meant to be df.y. Just ran the code and it works, thank you!

I have a follow up question if you’re up for it: I just started running it and its incredibly slow (3-4+ hours for sampling). Do you have any recommendations after seeing this code on how to improve it? I had originally figured this method would be a little faster to sample than using a linear model with an indexed categorical variables since it leverages fourier series and therefore less degrees of freedom, but maybe not

Indeed it’s very slow on my machine as well.

I noticed a few things:

  • Both lam and m inside the trend_model are for the same thing, global mean. You only need one of them. I set m to zero
  • For lam, the global mean, do you need the hyper-priors a&b? Not sure what information they will add, I would change lam to something like HalfNormal(10.). If you do want to keep the hyper priors then you need to ensure a and b are strictly positive.
  • The seasonality prior scale is quite wide, try lowering to 5 or 2
  • Use theano function where possible over numpy.
import theano.tensor as tt
def fourier_series(t, p=365.25, n=10):
    # 2 pi n / p
    x = 2 * np.pi * tt.arange(1, n + 1) / p
    # 2 pi n / p * t
    x = x * t[:, None]
    x = tt.concatenate((tt.cos(x), tt.sin(x)), axis=1)
    return x
  • You have 25 change points, your (fake) data doesn’t have any trends, perhaps lower the number of changepoints.
  • Do you need such high orders on the fourier series? You have yearly order=10, weekly-3 and daily-4.

Even after all these the sampler still estimates a run time of a couple of hours for me.

3 Likes

My eventual goal is to build a threshold based anomaly detection model for count data with a time varying poisson process, but since it’s overdispersed I made it a negative binomial distribution with the gamma distribution, so I don’t want to drop this part without knowing its not necessary since I know its overdispersed. Right now I’m trying to simulate fake data to make sure I return the right parameter values, and then I’ll add in fake anomalies, and then I’ll try and apply this to a few real datasets.

It turns out the shape=len(df1) argument is contributing quite a bit to the time, as its fitting a parameter for every single observation - trying to find another way around it but I have a feeling figuring out an alternative will speed things back up

Also great points with all priors I appreciate all the feedback on that

You are right, we don’t need that pm.Normal(). The below works for me and is much quicker, taking just a few minutes to sample.

with pm.Model() as m2:
    #a = pm.Gamma('a', 5, 1.5)
    #b = pm.Beta('b', 8, 41)
    
    g, A, s = trend_model(m2, df1['t'], changepoints_prior_scale=0.2, n_changepoints=4)
    x_monthly, beta_monthly = seasonality_model(m2, df1, 'yearly')
    x_daily, beta_daily = seasonality_model(m2, df1, 'weekly')
    x_hourly, beta_hourly = seasonality_model(m2, df1, 'daily')

    y = pm.Deterministic('y', g + det_dot(x_monthly, beta_monthly) + det_dot(x_daily, beta_daily) + det_dot(x_hourly, beta_hourly))
    
    #sigma = pm.HalfCauchy('sigma', 10.)
    #seasonality = pm.Normal('seasonality', mu=y, sigma=sigma, shape=len(df1))
    
    #lam = pm.Gamma('lam', a, b)
    lam = pm.HalfNormal('lam', 20.)
    obs = pm.Poisson('obs', lam + y, observed=df1.y)
    
    trace = pm.sample(1000)
1 Like

Is this a reasonable assumption to change the model structure like this? We know that the data is an overdispersed poisson model so I don’t think this makes much sense

PyMC3 does have a NegativeBinomial distribution if you want to model the overdispersion directly

2 Likes