Regression model sampling solely one sample every 5th second

steward · August 25, 2022, 2:21pm

After trying to get pymc3 to work on windows 10 for several days without luck i decided to try v4. The installation seemed to work and i was no longer facing the “using numpy c-api based implementation for blas functions” . warning.

The main issue faced when using pymc3 was the speed, 1 sample was drawn every 4/5 seconds for my model using the NUTS-sampler.

The setup of the model before sampling usually took like 10 minutes or more.

The same issue appears to still be apperant for me in v4 making me think that the model itself is the problem. This is even more apparent if you let it run and get posteriors which are really far from the truth(the simulated values).

Maybe someone can point out whats the problem, it is probably the model but could also be my computer not matching up but most likely its the model due to my inexperience

**UPDATE: . I tried to run the code in GLM: Negative Binomial Regression — PyMC example gallery which took 177 seconds on my computer compared to their 47, quite a large difference still.
First off i post the code and below you find the installations so you can reproduce:

import numpy as np
import math
import pandas as pd
from dateutil.tz import gettz
from datetime import datetime
import pymc as pm
from typing import List, Dict, Union
import aesara as theano
import aesara.tensor as tt
import arviz as az
import matplotlib.pyplot as plt

def make_seasonality(
    series_len, seasonality, method="discrete", order=3, duration=1, scale=0.05, seed=1
):
    if seasonality > 1:
        if method == "fourier":
            t = np.arange(0, series_len)
            out = []
            for i in range(1, order + 1):
                x = 2.0 * i * np.pi * t / seasonality
                out.append(np.sin(x))
                out.append(np.cos(x))
            out = np.column_stack(out)
            b = np.random.default_rng(seed).normal(0, scale, order * 2)
            seas = np.matmul(out, b)
        else:
            # initialization
            seas = []
            iterations = math.ceil(series_len / duration)
            # initialize vector to be repeated
            init_seas = np.zeros(seasonality)
            init_seas[:-1] = np.random.default_rng(seed).normal(
                0, scale, seasonality - 1
            )
            init_seas[seasonality - 1] = -1 * np.sum(init_seas)
            for idx in range(iterations):
                seas += [init_seas[idx % seasonality]] * duration
            seas = np.array(seas[:series_len])
    else:
        seas = np.zeros(series_len)
    return seas

def generatorcustom(n, verbose=True):
    def adstock(x: List, theta: float):
        x_decayed = [0 for _ in x]
        x_decayed[0] = x[0]

        for xi in range(1, len(x_decayed)):
            x_decayed[xi] = x[xi] + theta * x_decayed[xi - 1]

        return x_decayed

    def saturation(x, theta):
        return np.array(x) ** theta

    seed = 42
    np.random.seed(seed)
    # make the time varing coefs
    tau = np.arange(1, n + 1) / n
    data = pd.DataFrame({
        'tau': tau,
        'date': pd.date_range(start='1/1/2018', periods=n),
        'saturationx1': 0.34,
        'saturationx2': 0.65,
        'adstockx1': 0.8,
        'adstockx2': 0.32,
        'x1': 0 + np.abs(np.random.normal(400, 200, size=n)), #np.random.normal(0, 10, size=n),
        'x2': 0 + np.abs(np.random.normal(400, 200, size=n)), #np.random.normal(0, 10, size=n),
        'trend': np.cumsum(np.concatenate((np.array([1]), np.random.normal(0, 0.1, n - 1)))),
        'seasonality7': make_seasonality(series_len=n, seasonality=7, duration=1, method='discrete', seed=2020),
        'seasonality30': make_seasonality(series_len=n, seasonality=30, duration=1, method='discrete', seed=2020),
    })

    x1_ = data['x1']
    adstockx1_ = data['adstockx1']
    saturationx1_ = data['saturationx1']
    x2_ = data['x2']
    adstockx2_ = data['adstockx2']
    saturationx2_ = data['saturationx2']

    x1adstocked = adstock(x1_, adstockx1_[0])
    x1 = saturation(x1adstocked, saturationx1_)
    x2adstocked = adstock(x2_, adstockx2_[0])
    x2 = saturation(x2adstocked, saturationx2_)

    response = math.e ** (data['trend']) * math.e ** (data['seasonality7'] + data['seasonality30']) * x1 * x2

    df = pd.DataFrame.from_dict(data={'x1': x1, 'x2': x2, 'response': response, 'date': data['date']})
    df.index = df['date']

    df.plot()
    plt.show()
    df['response'].plot()
    plt.show()
    df['x1'].plot()
    plt.show()
    df['x2'].plot()
    plt.show()

    return df

def removeleadingandtrailingnans(frame: pd.DataFrame, indicator: Union[List, str]="facost") -> pd.DataFrame:
    # remove leading and trailing NaNs
    return frame.loc[frame[indicator].first_valid_index():frame[indicator].last_valid_index()]

class TorchStandardScaler:
    def __init__(self, with_std=False):
        self.with_std = with_std

    def fit(self, x):
        self.mean = x.mean()
        self.std = x.std()

    def transform(self, x):
        x -= self.mean
        if not self.with_std:
            return x
        x /= (self.std + 1e-7)
        return x

    def fittransform(self, x):
        self.fit(x=x)
        return self.transform(x=x)

    def inverse_transform(self, x):
        x += self.mean
        if not self.with_std:
            return x
        x *= (self.std + 1e-7)
        return x

def getprocessedframe(frame, nowianatimezone, horizon, applylog=True, applynormalizing=True, applylogtotalconv=True):
    if 'totalconvresponse' not in frame:
        frame['totalconvresponse'] = frame['goconvresponse'] + frame['faconvresponse']

    frame = removeleadingandtrailingnans(frame=frame,
                                         indicator='facost')
    frame = removeleadingandtrailingnans(frame=frame,
                                         indicator='gocost')

    frame = frame[frame['totalconvresponse'].notna()]  # changes the coefficients completely w.r.t go and fb

    s = pd.DatetimeIndex(frame.index)
    frame.index = s
    daterange = pd.date_range(start=s.tolist()[0],
                              end=s.tolist()[-1])
    frame = frame.reindex(daterange)

    frame['date'] = frame.index

    frame = frame.fillna(0.0)
    frame['day'] = pd.to_datetime(frame.index).day.tolist()
    frame['month'] = pd.to_datetime(frame.index).month.tolist()
    data = frame
    data['facost'] += 1
    data['gocost'] += 1
    data['day'] += 1
    data['month'] += 1
    data['totalconvresponse'] += 1
    data['roas'] = data['totalconvresponse'] / (data['facost'] + data['gocost'])
    if applylogtotalconv:
        frame[['totalconvresponse']] = frame[['totalconvresponse']].apply(np.log)
    if applylog:
        regressors = ['facost', 'gocost']
        frame[["totalconvresponse"] + regressors] = frame[["totalconvresponse"] + regressors].apply(np.log)
    if applynormalizing:
        frame['unprocessedfacost'] = frame['facost'].values
        fascaler = TorchStandardScaler(with_std=False)
        frame['facost'] = fascaler.fittransform(frame['facost'].values.reshape(-1, 1))

        frame['unprocessedgocost'] = frame['gocost'].values
        goscaler = TorchStandardScaler(with_std=False)
        frame['gocost'] = goscaler.fittransform(frame['gocost'].values.reshape(-1, 1))
    else:
        fascaler = None
        goscaler = None
    return data, (fascaler, goscaler)


def adstock_geometric_theano_pymc3(x, theta):
    x = tt.as_tensor_variable(x)

    def adstock_geometric_recurrence_theano(index, input_x, decay_x, theta):
        return tt.set_subtensor(decay_x[index], tt.sum(input_x + theta * decay_x[index - 1]))

    len_observed = x.shape[0]

    x_decayed = tt.zeros_like(x)
    x_decayed = tt.set_subtensor(x_decayed[0], x[0])

    output, _ = theano.scan(
        fn=adstock_geometric_recurrence_theano,
        sequences=[tt.arange(1, len_observed), x[1:len_observed]],
        outputs_info=x_decayed,
        non_sequences=theta,
        n_steps=len_observed - 1
    )

    return output[-1]


def runknot(frame, nowianatimezone,
            horizon=30, target='totalconvresponse'):

    data, (_, _) = getprocessedframe(frame=frame,
                                     nowianatimezone=nowianatimezone,
                                     horizon=horizon,
                                     applynormalizing=True,
                                     applylog=True,
                                     applylogtotalconv=False)

    data['dayofweek'] = data.index.day_of_week.tolist()
    data['dayinmonth'] = data.index.day.tolist()

    n_order = 3
    periods_weekday = data["dayofweek"] / 7
    fourier_features_weekday = pd.DataFrame(
        {
            f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods_weekday * order)
            for order in range(1, n_order + 1)
            for func in ("sin", "cos")
        }
    )

    n_order = 3
    periods_dayinmonth = data['dayinmonth'] / 30
    fourier_features_dayinmonth = pd.DataFrame(
        {
            f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods_dayinmonth * order)
            for order in range(1, n_order + 1)
            for func in ("sin", "cos")
        }
    )

    t = np.array((data.index - data.index.min()) / (data.index.max() - data.index.min()))

    START_ANALYSIS_INDEX = data.index.tolist()[0]
    END_ANALYSIS_INDEX = data.index.tolist()[-1]

    knots = ['25-2', '2-9', '9-16', '16-25']  # 4 weeks split

    response_mean = []

    date = data['date'].to_numpy()
    coords = {'date': date, 'fourier_mode': np.arange(2 * n_order)}

    with pm.Model(coords=coords) as model_2:
        facost = data['unprocessedfacost'].values
        gocost = data['unprocessedgocost'].values

        adstockfa = pm.Beta('faadstock', 3, 3)
        adstockgo = pm.Beta('goadstock', 3, 3)

        facostadstocked = adstock_geometric_theano_pymc3(x=facost, theta=adstockfa)
        falogged = np.log1p(facostadstocked)
        fafullyprocessed = falogged / facost.mean()

        gocostadstocked = adstock_geometric_theano_pymc3(x=gocost, theta=adstockgo)
        gologged = np.log1p(gocostadstocked)
        gofullyprocessed = gologged / gocost.mean()

        saturationsgo = [1 for _ in gocost]
        saturationsfa = [1 for _ in facost]

        # seasonality weekday
        b_fourier_weekday = pm.Laplace(name="b_fourier_weekday", mu=0, b=2,
                                       dims="fourier_mode")

        # seasonality dayinmonth
        b_fourier_dayinmonth = pm.Laplace(name="b_fourier_dayinmonth", mu=0, b=2,
                                          dims="fourier_mode")

        # trend
        b_trend = pm.Normal(name="b_trend", mu=0, sigma=2)

        for knot in knots:
            firstidx, lastidx = (int(x) for x in knot.split('-'))
            if firstidx > lastidx:
                mask = (data.index.day > firstidx) | (data.index.day <= lastidx)
            else:
                mask = (data.index.day > firstidx) & (data.index.day <= lastidx)

            saturationgo = pm.Beta(f'gosaturation-knot-{knot}', 3, 3)
            saturationfa = pm.Beta(f'fasaturation-knot-{knot}', 3, 3)
            indexes = np.nonzero(mask)[0]
            for idx in indexes:
                saturationsgo[idx] = saturationgo
                saturationsfa[idx] = saturationfa

        fa = fafullyprocessed.dot(saturationsfa)
        go = gofullyprocessed.dot(saturationsgo)

        response_mean.append(fa)
        response_mean.append(go)

        trend = pm.Deterministic(name="trend", var=b_trend * t, dims="date")
        seasonalityweekday = pm.Deterministic(name="seasonality_dayinweek",
                                              var=pm.math.dot(fourier_features_weekday, b_fourier_weekday),
                                              dims="date")
        seasonalitydayinmonth = pm.Deterministic(name="seasonality_dayinmonth",
                                                 var=pm.math.dot(fourier_features_dayinmonth, b_fourier_dayinmonth),
                                                 dims="date")
        mu = pm.Deterministic(name="mu",
                              var=np.sum(response_mean) + trend + seasonalityweekday + seasonalitydayinmonth,
                              dims="date")

        sigma = pm.HalfNormal(name="sigma", sigma=10)

        nu = pm.Gamma(name="nu", alpha=25, beta=2)

        likelihood = pm.StudentT(name="outcome",
                                 nu=nu,
                                 mu=mu,
                                 sigma=sigma,
                                 observed=data['totalconvresponse'],
                                 dims="date")

    with model_2:
        idata = pm.sample(1000,
                          cores=1,
                          # tune=1000,
                          # step=None,
                          # target_accept=0.95,
                          return_inferencedata=True)

        idata.extend(pm.sample_prior_predictive())
        idata.extend(pm.sample_posterior_predictive(idata))

    import arviz as az
    az.plot_trace(idata)
    az.plot_forest(idata)
    az.plot_posterior(idata)

def run():
    df = generatorcustom(n=500)
    df['facost'] = df['x1']
    df['gocost'] = df['x2']
    df['totalconvresponse'] = df['response']

    now = datetime.now(tz=gettz(name='Europe/Helsinki')).date()

    runknot(frame=df,
            nowianatimezone=now,
            horizon=30)

if __name__ == '__main__':
    run()

installations:

Name Version Build Channel

aeppl 0.0.35 pypi_0 pypi
aesara 2.8.2 py39h7171602_0 conda-forge
aesara-base 2.8.2 py39hcbf5309_0 conda-forge
arviz 0.12.1 pypi_0 pypi
blas 2.116 mkl conda-forge
blas-devel 3.9.0 16_win64_mkl conda-forge
bzip2 1.0.8 h8ffe710_4 conda-forge
ca-certificates 2022.6.15 h5b45459_0 conda-forge
cachetools 5.2.0 pypi_0 pypi
cairo 1.16.0 h0ac17fb_1012 conda-forge
cftime 1.6.1 pypi_0 pypi
cloudpickle 2.1.0 pypi_0 pypi
cons 0.4.5 pyhd8ed1ab_0 conda-forge
cycler 0.11.0 pypi_0 pypi
deprecat 2.1.1 pypi_0 pypi
dill 0.3.5.1 pypi_0 pypi
etuples 0.3.5 pyhd8ed1ab_0 conda-forge
expat 2.4.8 h39d44d4_0 conda-forge
fastprogress 1.0.3 pypi_0 pypi
filelock 3.8.0 pyhd8ed1ab_0 conda-forge
font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge
font-ttf-inconsolata 3.000 h77eed37_0 conda-forge
font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge
font-ttf-ubuntu 0.83 hab24e00_0 conda-forge
fontconfig 2.14.0 hce3cb01_0 conda-forge
fonts-conda-ecosystem 1 0 conda-forge
fonts-conda-forge 1 0 conda-forge
fonttools 4.37.1 pypi_0 pypi
freetype 2.12.1 h546665d_0 conda-forge
fribidi 1.0.10 h8d14728_0 conda-forge
getopt-win32 0.1 h8ffe710_0 conda-forge
gettext 0.19.8.1 ha2e2712_1008 conda-forge
graphite2 1.3.13 1000 conda-forge
graphviz 5.0.1 had6c3a3_0 conda-forge
gts 0.7.6 h7c369d9_2 conda-forge
harfbuzz 5.1.0 h27de254_0 conda-forge
icu 70.1 h0e60522_0 conda-forge
intel-openmp 2022.1.0 h57928b3_3787 conda-forge
jpeg 9e h8ffe710_2 conda-forge
kiwisolver 1.4.4 pypi_0 pypi
lerc 4.0.0 h63175ca_0 conda-forge
libblas 3.9.0 16_win64_mkl conda-forge
libcblas 3.9.0 16_win64_mkl conda-forge
libdeflate 1.13 h8ffe710_0 conda-forge
libffi 3.4.2 h8ffe710_5 conda-forge
libgd 2.3.3 h891f43f_3 conda-forge
libglib 2.72.1 h3be07f2_0 conda-forge
libiconv 1.16 he774522_0 conda-forge
liblapack 3.9.0 16_win64_mkl conda-forge
liblapacke 3.9.0 16_win64_mkl conda-forge
libpng 1.6.37 h1d00b33_4 conda-forge
libpython 2.2 py39hcbf5309_1 conda-forge
libsqlite 3.39.2 h8ffe710_1 conda-forge
libtiff 4.4.0 h92677e6_3 conda-forge
libwebp 1.2.4 h8ffe710_0 conda-forge
libwebp-base 1.2.4 h8ffe710_0 conda-forge
libxcb 1.13 hcd874cb_1004 conda-forge
libzlib 1.2.12 h8ffe710_2 conda-forge
llvmlite 0.38.1 py39ha0cd8c8_0 conda-forge
logical-unification 0.4.5 pyhd8ed1ab_0 conda-forge
m2w64-binutils 2.25.1 5 conda-forge
m2w64-bzip2 1.0.6 6 conda-forge
m2w64-crt-git 5.0.0.4636.2595836 2 conda-forge
m2w64-gcc 5.3.0 6 conda-forge
m2w64-gcc-ada 5.3.0 6 conda-forge
m2w64-gcc-fortran 5.3.0 6 conda-forge
m2w64-gcc-libgfortran 5.3.0 6 conda-forge
m2w64-gcc-libs 5.3.0 7 conda-forge
m2w64-gcc-libs-core 5.3.0 7 conda-forge
m2w64-gcc-objc 5.3.0 6 conda-forge
m2w64-gmp 6.1.0 2 conda-forge
m2w64-headers-git 5.0.0.4636.c0ad18a 2 conda-forge
m2w64-isl 0.16.1 2 conda-forge
m2w64-libiconv 1.14 6 conda-forge
m2w64-libmangle-git 5.0.0.4509.2e5a9a2 2 conda-forge
m2w64-libwinpthread-git 5.0.0.4634.697f757 2 conda-forge
m2w64-make 4.1.2351.a80a8b8 2 conda-forge
m2w64-mpc 1.0.3 3 conda-forge
m2w64-mpfr 3.1.4 4 conda-forge
m2w64-pkg-config 0.29.1 2 conda-forge
m2w64-toolchain 5.3.0 7 conda-forge
m2w64-toolchain_win-64 2.4.0 0 conda-forge
m2w64-tools-git 5.0.0.4592.90b8472 2 conda-forge
m2w64-windows-default-manifest 6.4 3 conda-forge
m2w64-winpthreads-git 5.0.0.4634.697f757 2 conda-forge
m2w64-zlib 1.2.8 10 conda-forge
matplotlib 3.5.3 pypi_0 pypi
minikanren 1.0.3 pyhd8ed1ab_0 conda-forge
mkl 2022.1.0 h6a75c08_874 conda-forge
mkl-devel 2022.1.0 h57928b3_875 conda-forge
mkl-include 2022.1.0 h6a75c08_874 conda-forge
mkl-service 2.4.0 py39h0fb9bb3_0 conda-forge
msys2-conda-epoch 20160418 1 conda-forge
multipledispatch 0.6.0 py_0 conda-forge
netcdf4 1.6.0 pypi_0 pypi
numba 0.55.2 py39hb8cd55e_0 conda-forge
numpy 1.22.1 pypi_0 pypi
openssl 3.0.5 h8ffe710_1 conda-forge
packaging 21.3 pypi_0 pypi
pandas 1.4.3 pypi_0 pypi
pango 1.50.9 h480d202_0 conda-forge
patsy 0.5.2 pypi_0 pypi
pcre 8.45 h0e60522_0 conda-forge
pillow 9.2.0 pypi_0 pypi
pip 22.2.2 pyhd8ed1ab_0 conda-forge
pixman 0.40.0 h8ffe710_0 conda-forge
pthread-stubs 0.4 hcd874cb_1001 conda-forge
pymc 4.1.6 pypi_0 pypi
pymc3 3.11.5 pypi_0 pypi
pyparsing 3.0.9 pypi_0 pypi
python 3.9.13 hcf16a7b_0_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python-graphviz 0.20.1 pyh22cad53_0 conda-forge
python_abi 3.9 2_cp39 conda-forge
pytz 2022.2.1 pypi_0 pypi
scipy 1.7.3 pypi_0 pypi
semver 2.13.0 pypi_0 pypi
setuptools 65.3.0 py39hcbf5309_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.39.2 h8ffe710_1 conda-forge
statsmodels 0.13.2 pypi_0 pypi
tbb 2021.5.0 h2d74725_1 conda-forge
theano-pymc 1.1.2 pypi_0 pypi
tk 8.6.12 h8ffe710_0 conda-forge
toolz 0.12.0 pyhd8ed1ab_0 conda-forge
typing_extensions 4.3.0 pyha770c72_0 conda-forge
tzdata 2022c h191b570_0 conda-forge
ucrt 10.0.20348.0 h57928b3_0 conda-forge
vc 14.2 hb210afc_6 conda-forge
vs2015_runtime 14.29.30037 h902a5da_6 conda-forge
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
wrapt 1.14.1 pypi_0 pypi
xarray 2022.6.0 pypi_0 pypi
xarray-einstats 0.3.0 pypi_0 pypi
xorg-kbproto 1.0.7 hcd874cb_1002 conda-forge
xorg-libice 1.0.10 hcd874cb_0 conda-forge
xorg-libsm 1.2.3 hcd874cb_1000 conda-forge
xorg-libx11 1.7.2 hcd874cb_0 conda-forge
xorg-libxau 1.0.9 hcd874cb_0 conda-forge
xorg-libxdmcp 1.1.3 hcd874cb_0 conda-forge
xorg-libxext 1.3.4 hcd874cb_1 conda-forge
xorg-libxpm 3.5.13 hcd874cb_0 conda-forge
xorg-libxt 1.2.1 hcd874cb_2 conda-forge
xorg-xextproto 7.3.0 hcd874cb_1002 conda-forge
xorg-xproto 7.0.31 hcd874cb_1007 conda-forge
xz 5.2.6 h8d14728_0 conda-forge
zlib 1.2.12 h8ffe710_2 conda-forge
zstd 1.5.2 h7755175_4 conda-forge

steward · August 27, 2022, 9:59am

one obvious error in the code is
df = pd.DataFrame.from_dict(data={‘x1’: x1, ‘x2’: x2, ‘response’: response, ‘date’: data[‘date’]})
x1 and x2 is already adstocked and saturated here…
should be:
df = pd.DataFrame.from_dict(data={‘x1’: data[‘x1’], ‘x2’: data[‘x2’], ‘response’: response, ‘date’: data[‘date’]})

will update how it effects the runtime… considering the mismatch of the data-generation-process and the model it may improve it a little bit.

steward · August 29, 2022, 12:14am

in general the model does not match the data-generation-process(the simulation) which is the reason behind the poor posterior estimates, reviewing the generated trend and the modelling of it in the model we can easily see a mismatch where as the simulated trend is much more complex than what the modelling of it in the model is able to capture, it effects the runtime aswell but not as drastic as one would believe.

Topic		Replies	Views
Time Varying Overdispersed Poisson Process Questions	7	943	August 28, 2020
I can't get pymc3 to work Questions bug	4	2588	January 15, 2021
Bayesian VAR example notebook: extremely low sampling rate	18	1423	May 23, 2023
Seeds: Random effect logistic regression Questions	31	1296	March 29, 2022
Poor Performance of pyMC5 vs pyMC3 for large number of variables Development	10	2139	January 25, 2023

Regression model sampling solely one sample every 5th second

Name Version Build Channel

Related topics