Time Varying Overdispersed Poisson Process

KyleJCaron · August 21, 2020, 9:06pm

I’m getting the following error in my code:

TypeError: For compute_test_value, one input test value does not have the requested type.

The error when converting the test value to that variable type:
Wrong number of dimensions: expected 0, got 1 with shape (38994,).

Any ideas on how to get around this? Also any feedback on the approach would be appreciated! Borrowing alot of code here from Ritchie Vink’s blog

# Trend

def det_dot(a, b):
    """
    The theano dot product and NUTS sampler don't work with large matrices?
    
    :param a: (np matrix)
    :param b: (theano vector)
    """
    return (a * b[None, :]).sum(axis=-1)

def trend_model(m, t, n_changepoints=25, changepoints_prior_scale=0.05, 
                growth_prior_scale=5, changepoint_range=0.8):
    """
    The piecewise linear trend with changepoint implementation in PyMC3.
    :param m: (pm.Model)
    :param t: (np.array) MinMax scaled time.
    :param n_changepoints: (int) The number of changepoints to model.
    :param changepoint_prior_scale: (flt/ None) The scale of the Laplace prior on the delta vector.
                                    If None, a hierarchical prior is set.
    :param growth_prior_scale: (flt) The standard deviation of the prior on the growth.
    :param changepoint_range: (flt) Proportion of history in which trend changepoints will be estimated. 
    :return g, A, s: (tt.vector, np.array, tt.vector)
    """
    s = np.linspace(0, changepoint_range * np.max(t), n_changepoints + 1)[1:]
    
    # * 1 casts the boolean to integers
    A = (t[:, None] > s) * 1

    with m:
        # initial growth
        k = pm.Normal('k', 0 , growth_prior_scale)
        
        if changepoints_prior_scale is None:
            changepoints_prior_scale = pm.Exponential('tau', 1.5)
        
        # rate of change
        delta = pm.Laplace('delta', 0, changepoints_prior_scale, shape=n_changepoints)
        # offset
        m = pm.Normal('m', 0, 5)
        gamma = -s * delta
        
        g = (k + det_dot(A, delta)) * t + (m + det_dot(A, gamma))
    return g, A, s


def det_trend(k, m, delta, t, s, A):
    return (k + np.dot(A, delta)) * t + (m + np.dot(A, (-s * delta)))

# Seasonality

def fourier_series(t, p=365.25, n=10):
    # 2 pi n / p
    x = 2 * np.pi * np.arange(1, n + 1) / p
    # 2 pi n / p * t
    x = x * t[:, None]
    x = np.concatenate((np.cos(x), np.sin(x)), axis=1)
    return x


def seasonality_model(m, df, period='yearly', seasonality_prior_scale=10):
    if period == 'yearly':
        n = 10
        # rescale the period, as t is also scaled
        p = 365.25 / (df['ds'].max() - df['ds'].min()).days
    elif period == 'weekly':
        n = 3
        # rescale the period, as t is also scaled
        p = 7 / (df['ds'].max() - df['ds'].min()).days
    else: # daily    
        n = 4
        # rescale the period, as t is also scaled
        p = 24 / (df['ds'].max() - df['ds'].min()).days
    
    x = fourier_series(df['t'], p, n)
    
    with m:
        beta = pm.Normal(f'beta_{period}', mu=0, sd=seasonality_prior_scale, shape=2 * n)
    
    return x, beta


import numpy as np
import pandas as pd
import scipy.stats as stats
import pymc3 as pm


def sim_data():
    ds = pd.date_range('2018-06-01 00:00:00', '2020-08-21 08:30:00',freq='1800S')
    monthly = dict(zip(range(12), [1.2, 0.4, 0.9, 1.25, 0.75, -0.5, 0.4, 0.1,-1.5, -1, -0.5, 1.25]))
    daily = dict(zip(range(7), [0.7, 1, 0.85, 0.1, -0.5, -1.35, -1]))
    hourly = dict(zip(range(24), [-2,-3.35,-4,-4.3, -5,-4.75,-4,-3.15,-1.3,0.35,2.2,3.4,3.65,3.8,3.9,3.85,3.2,2.2,1.6,1.45,1.55,1.65,0.75,-0.4]))

    df = pd.DataFrame({'ds':ds, 'month':ds.month - 1, 'day':ds.dayofweek, 'hour':ds.hour })

    seasonality = (df.month.map(monthly) + df.day.map(daily) + df.hour.map(hourly))
    lam = 10
    scale = 1/6
    df['y'] = stats.nbinom(lam+seasonality, 1/6).rvs(len(seasonality))
    return df
    
df1 = sim_data()


df1['t'] = (df1['ds'] - df1['ds'].min()) / (df1['ds'].max() - df1['ds'].min())


with pm.Model() as m2:

    a = pm.Normal('a', 5, 1.5)
    b = pm.Beta('b', 16, 84)
    
    g, A, s = trend_model(m2, df1['t'], changepoints_prior_scale=None)
    x_monthly, beta_monthly = seasonality_model(m2, df1, 'yearly')
    x_daily, beta_daily = seasonality_model(m2, df1, 'weekly')
    x_hourly, beta_hourly = seasonality_model(m2, df1, 'daily')

    y = pm.Deterministic('y', g + det_dot(x_monthly, beta_monthly) + det_dot(x_daily, beta_daily) + det_dot(x_hourly, beta_hourly))
    
    
    sigma = pm.HalfCauchy('sigma', 0.5)
    seasonality = pm.Normal('seasonality', mu=y, sd=sigma)
    
    lam = pm.Gamma('lam', a, b)
    obs = pm.Poisson('obs', lam + seasonality, observed=df1.y)
    
    trace = pm.sample(1000)

nkaimcaudle · August 21, 2020, 11:28pm

Hi Kyle

This should work

seasonality = pm.Normal('seasonality', mu=y, sigma=sigma, shape=len(df1))

though I’m not sure why it is required as usually the shapes will broadcast themselves fine.
Also I changed the sd into sigma as sd will be depreciated soon.

After that I noticed that the next lines fail. k is not defined and nor is df1.user_id.
In addition a and b are not used at all?

KyleJCaron · August 21, 2020, 11:36pm

Ah whoops I just edited my code, k=a, theta=b, and observed was meant to be df.y. Just ran the code and it works, thank you!

I have a follow up question if you’re up for it: I just started running it and its incredibly slow (3-4+ hours for sampling). Do you have any recommendations after seeing this code on how to improve it? I had originally figured this method would be a little faster to sample than using a linear model with an indexed categorical variables since it leverages fourier series and therefore less degrees of freedom, but maybe not

nkaimcaudle · August 22, 2020, 12:20am

Indeed it’s very slow on my machine as well.

I noticed a few things:

Both lam and m inside the trend_model are for the same thing, global mean. You only need one of them. I set m to zero
For lam, the global mean, do you need the hyper-priors a&b? Not sure what information they will add, I would change lam to something like HalfNormal(10.). If you do want to keep the hyper priors then you need to ensure a and b are strictly positive.
The seasonality prior scale is quite wide, try lowering to 5 or 2
Use theano function where possible over numpy.

import theano.tensor as tt
def fourier_series(t, p=365.25, n=10):
    # 2 pi n / p
    x = 2 * np.pi * tt.arange(1, n + 1) / p
    # 2 pi n / p * t
    x = x * t[:, None]
    x = tt.concatenate((tt.cos(x), tt.sin(x)), axis=1)
    return x

You have 25 change points, your (fake) data doesn’t have any trends, perhaps lower the number of changepoints.
Do you need such high orders on the fourier series? You have yearly order=10, weekly-3 and daily-4.

Even after all these the sampler still estimates a run time of a couple of hours for me.

KyleJCaron · August 22, 2020, 12:50pm

My eventual goal is to build a threshold based anomaly detection model for count data with a time varying poisson process, but since it’s overdispersed I made it a negative binomial distribution with the gamma distribution, so I don’t want to drop this part without knowing its not necessary since I know its overdispersed. Right now I’m trying to simulate fake data to make sure I return the right parameter values, and then I’ll add in fake anomalies, and then I’ll try and apply this to a few real datasets.

It turns out the shape=len(df1) argument is contributing quite a bit to the time, as its fitting a parameter for every single observation - trying to find another way around it but I have a feeling figuring out an alternative will speed things back up

Also great points with all priors I appreciate all the feedback on that

nkaimcaudle · August 24, 2020, 10:02am

You are right, we don’t need that pm.Normal(). The below works for me and is much quicker, taking just a few minutes to sample.

with pm.Model() as m2:
    #a = pm.Gamma('a', 5, 1.5)
    #b = pm.Beta('b', 8, 41)
    
    g, A, s = trend_model(m2, df1['t'], changepoints_prior_scale=0.2, n_changepoints=4)
    x_monthly, beta_monthly = seasonality_model(m2, df1, 'yearly')
    x_daily, beta_daily = seasonality_model(m2, df1, 'weekly')
    x_hourly, beta_hourly = seasonality_model(m2, df1, 'daily')

    y = pm.Deterministic('y', g + det_dot(x_monthly, beta_monthly) + det_dot(x_daily, beta_daily) + det_dot(x_hourly, beta_hourly))
    
    #sigma = pm.HalfCauchy('sigma', 10.)
    #seasonality = pm.Normal('seasonality', mu=y, sigma=sigma, shape=len(df1))
    
    #lam = pm.Gamma('lam', a, b)
    lam = pm.HalfNormal('lam', 20.)
    obs = pm.Poisson('obs', lam + y, observed=df1.y)
    
    trace = pm.sample(1000)

KyleJCaron · August 28, 2020, 1:02am

Is this a reasonable assumption to change the model structure like this? We know that the data is an overdispersed poisson model so I don’t think this makes much sense

nkaimcaudle · August 28, 2020, 1:21am

PyMC3 does have a NegativeBinomial distribution if you want to model the overdispersion directly

Topic		Replies	Views
Regression model sampling solely one sample every 5th second version agnostic modeling	2	512	August 29, 2022
Poisson regression model for beginner v5 theano , modeling	5	1679	June 28, 2023
Modeling count time series (Negative Binomial VS Normal) Questions	2	1033	March 25, 2020
Unable to draw prior samples from model Questions	6	651	November 27, 2020
Slow sampling speed v5	10	87	November 27, 2024

Time Varying Overdispersed Poisson Process

Related topics