Exponential, Logistic and Generative Bayesian Models with Gamma Prior

medsellu · February 28, 2024, 2:20pm

Hi everyone,
Apologies if this question is trivial. Building from the work of Thomas Weicki on Bayesian modeling of COVID-19. I was successful with pymc3 but found issues with pymc. Below are my codes. Grateful for any help. Link to my github

github.com

medsellufbc/BMI/blob/main/ExpSL.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyP8vB7xzPVE4IjH2R/m4SWf",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",

This file has been truncated. show original

cluhmann · March 1, 2024, 1:58am

Welcome!

I cant’ quite tell what the issue is here. Can you provide a bit of code that doesn’t do what you want? You can check out the FAQ on how to maximize the chances of getting help and highlight what your specific problem is?

medsellu · March 1, 2024, 1:32pm

Dear Christian,
Thanks for the reply, I am very grateful. I am modeling COVID-19 cases in MANO RIVER UNION in west Africa, I want to apply Bayesian statistics to exponential, Logistic and Generative models with Gamma priors. My challenge is with pymc. I was using pymc3. with pymc, i am have error messages and I want to do them in pymc. Attached are my files from google colab.
Kindly assist me.
Best regards

import numpy as np
import matplotlib.pyplot as plt

def C(K, r, t, C_0):
  A = (K-C_0)/C_0
  return K / (1 + A * np.exp(-r * t))

def d_C(K, r, t, C_0):
  c = C(K, r, t, C_0)
  return r * c * (1 - c / K)

r = 0.2
K = 1000
C_0 = 100

t = np.linspace(1,60)

fig, ax = plt.subplots(1, 2, figsize=(12,5))
for r in np.linspace(0.1, 0.5, 5):
  ax[0].plot(t, C(K, r, t, C_0), label=f"r = {r:0.2f}")
ax[0].legend(loc="best")
ax[0].set(xlabel="Days since...", ylabel="Cases")
for K in np.linspace(100, 10000, 10):
  ax[1].plot(t, C(K, r, t, C_0), label=f"K = {K:0.1f}")
ax[1].legend(loc="best")
ax[1].set(xlabel="Days since...", ylabel="Cases")

fig, ax = plt.subplots(1, 2, figsize=(12,5))
for r in np.linspace(0.1, 0.5, 5):
  ax[0].plot(t, d_C(K, r, t, C_0), label=f"r = {r:0.2f}")
ax[0].legend(loc="best")
ax[0].set(xlabel="Days since...", ylabel="Cases")
for K in np.linspace(100, 10000, 10):
  ax[1].plot(t, d_C(K, r, t, C_0), label=f"r = {r:0.2f}")
ax[1].legend(loc="best")
ax[1].set(xlabel="Days since...", ylabel="Cases")

!pip install pymc3


!pip install arviz==0.11.0

import numpy as np
import pymc3 as pm
import pandas as pd
import matplotlib.pyplot as plt
import theano

# Load data
df = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/", parse_dates=["dateRep"], infer_datetime_format=True, dayfirst=True)
df = df.rename(columns={'dateRep': 'date', 'countriesAndTerritories': 'country'}) # Sane column names
df = df.drop(["day", "month", "year", "geoId"], axis=1) # Not required


country = "Sierra_Leone"
# Filter for country (probably want separate models per country, even maybe per region)
sorted_country = df[df["country"] == country].sort_values(by="date")
# Cumulative sum of data
country_cumsum = sorted_country[["cases", "deaths"]].cumsum().set_index(sorted_country["date"])
# Filter out data with less than 100 cases
country_cumsum = country_cumsum[country_cumsum["cases"] >= 100]
days_since_100 = range(len(country_cumsum))

# Pull out population size per country
populations = {key: df[df["country"] == key].iloc[0]["popData2019"] for key in df["country"].unique()}

def model_factory(country: str, x: np.ndarray, y: np.ndarray):
  with pm.Model() as model:
    t = pm.Data(country + "x_data", x)
    confirmed_cases = pm.Data(country + "y_data", y)

    # Intercept - We fixed this at 100.
    C_0 = pm.Normal("C_0", mu=100, sigma=10)

    # Growth rate: 0.2 is approx value reported by others
    r = pm.Normal("r", mu=0.2, sigma=0.5)

    # Total number of cases. Depends on the population, more people, more infections.
    # It also depends on the total number of people infected.
    # Start with a prior based upon a guess of the initial population, very weak
    proportion_infected = 5e-05 # This value comes from the rough projection that 80000 will be infected in China
    p = populations[country]
    K = pm.Normal("K", mu=p * proportion_infected, sigma=p*0.1)

    # Logistic regression
    growth = C(K, r, t, C_0)

    # Likelihood error
    eps = pm.HalfNormal("eps")

    # Likelihood - Counts here, so poission or negative binomial. Causes issues. Lognormal tends to work better?
    pm.Lognormal(country, mu=np.log(growth), sigma=eps, observed=confirmed_cases)
  return model

r = 0.2
K = 10000 # This should look something a bit like China
C_0 = 100
t = np.linspace(1,60)
y_synthetic = C(K, r, t, C_0)
plt.plot(t, y_synthetic)
plt.gca().set_yscale("log")

# Training
with model_factory(country, t, y_synthetic) as model:
    train_trace = pm.sample()
    pm.traceplot(train_trace)
    pm.plot_posterior(train_trace)
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(t, ppc[country].T, ".k", alpha=0.05)
    ax.plot(t, y_synthetic, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Days since 100 cases", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the training set")

train_x = days_since_100[:-5]
train_y = country_cumsum["cases"].astype('float64').values[:-5]
hold_out_x = days_since_100[-5:]
hold_out_y = country_cumsum["cases"].astype('float64').values[-5:]
# Training
with model_factory(country, train_x, train_y) as model:
    train_trace = pm.sample()
    pm.traceplot(train_trace)
    pm.plot_posterior(train_trace)
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(train_x, ppc[country].T, ".k", alpha=0.05)
    ax.plot(train_x, train_y, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Date", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the training set")

with model_factory(country, hold_out_x, hold_out_y) as test_model:
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(hold_out_x, ppc[country].T, ".k", alpha=0.05)
    ax.plot(hold_out_x, hold_out_y, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Date", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the holdout set")

# Generate the predicted number of cases (assuming normally distributed on the output)
predicted_cases = ppc[country].mean(axis=0).round()
print(predicted_cases)
def error(actual, predicted):
  return predicted - actual

def print_errors(actuals, predictions):
  for n in [1, 5]:
    act, pred = actuals[n-1], predictions[n-1]
    err = error(act, pred)
    print(f"{n}-day cumulative prediction error: {err} cases ({100 * err / act:.1f} %)")

print_errors(hold_out_y, predicted_cases)

new_x = [hold_out_x[-1] + 1, hold_out_x[-1] + 5]
new_y = [0, 0]
# Predictive model
with model_factory(country, new_x, new_y) as test_model:
    ppc = pm.sample_posterior_predictive(train_trace)
predicted_cases = ppc[country].mean(axis=0).round()
print("\n")
print(f"Based upon this model, tomorrow's number of cases will be {predicted_cases[0]}. In 5 days time there will be {predicted_cases[1]} cases.")

country = "Liberia"
# Filter for country (probably want separate models per country, even maybe per region)
sorted_country = df[df["country"] == country].sort_values(by="date")
# Cumulative sum of data
country_cumsum = sorted_country[["cases", "deaths"]].cumsum().set_index(sorted_country["date"])
# Filter out data with less than 100 cases
country_cumsum = country_cumsum[country_cumsum["cases"] >= 100]
days_since_100 = range(len(country_cumsum))

# Pull out population size per country
populations = {key: df[df["country"] == key].iloc[0]["popData2019"] for key in df["country"].unique()}
train_x = days_since_100[:-5]
train_y = country_cumsum["cases"].astype('float64').values[:-5]
hold_out_x = days_since_100[-5:]
hold_out_y = country_cumsum["cases"].astype('float64').values[-5:]
# Training
with model_factory(country, train_x, train_y) as model:
    train_trace = pm.sample()
    pm.traceplot(train_trace)
    pm.plot_posterior(train_trace)
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(train_x, ppc[country].T, ".k", alpha=0.05)
    ax.plot(train_x, train_y, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Date", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the training set")

with model_factory(country, hold_out_x, hold_out_y) as test_model:
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(hold_out_x, ppc[country].T, ".k", alpha=0.05)
    ax.plot(hold_out_x, hold_out_y, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Date", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the holdout set")





# Generate the predicted number of cases (assuming normally distributed on the output)
predicted_cases = ppc[country].mean(axis=0).round()
print(predicted_cases)
def error(actual, predicted):
  return predicted - actual

def print_errors(actuals, predictions):
  for n in [1, 5]:
    act, pred = actuals[n-1], predictions[n-1]
    err = error(act, pred)
    print(f"{n}-day cumulative prediction error: {err} cases ({100 * err / act:.1f} %)")

print_errors(hold_out_y, predicted_cases)

new_x = [hold_out_x[-1] + 1, hold_out_x[-1] + 5]
new_y = [0, 0]
# Predictive model
with model_factory(country, new_x, new_y) as test_model:
    ppc = pm.sample_posterior_predictive(train_trace)
predicted_cases = ppc[country].mean(axis=0).round()
print("\n")
print(f"Based upon this model, tomorrow's number of cases will be {predicted_cases[0]}. In 5 days time there will be {predicted_cases[1]} cases.")

country = "Guinea"
# Filter for country (probably want separate models per country, even maybe per region)
sorted_country = df[df["country"] == country].sort_values(by="date")
# Cumulative sum of data
country_cumsum = sorted_country[["cases", "deaths"]].cumsum().set_index(sorted_country["date"])
# Filter out data with less than 100 cases
country_cumsum = country_cumsum[country_cumsum["cases"] >= 100]
days_since_100 = range(len(country_cumsum))

# Pull out population size per country
populations = {key: df[df["country"] == key].iloc[0]["popData2019"] for key in df["country"].unique()}
train_x = days_since_100[:-5]
train_y = country_cumsum["cases"].astype('float64').values[:-5]
hold_out_x = days_since_100[-5:]
hold_out_y = country_cumsum["cases"].astype('float64').values[-5:]
# Training
with model_factory(country, train_x, train_y) as model:
    train_trace = pm.sample()
    pm.traceplot(train_trace)
    pm.plot_posterior(train_trace)
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(train_x, ppc[country].T, ".k", alpha=0.05)
    ax.plot(train_x, train_y, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Date", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the training set")



with model_factory(country, hold_out_x, hold_out_y) as test_model:
    ppc = pm.sample_posterior_predictive(train_trace)
    fig, ax = plt.subplots(figsize=(10, 8))
    ax.plot(hold_out_x, ppc[country].T, ".k", alpha=0.05)
    ax.plot(hold_out_x, hold_out_y, color="r")
    ax.set_yscale("log")
    ax.set(xlabel="Date", ylabel="Confirmed Cases", title=f"{country} - Posterior predictive on the holdout set")

# Generate the predicted number of cases (assuming normally distributed on the output)
predicted_cases = ppc[country].mean(axis=0).round()
print(predicted_cases)
def error(actual, predicted):
  return predicted - actual

def print_errors(actuals, predictions):
  for n in [1, 5]:
    act, pred = actuals[n-1], predictions[n-1]
    err = error(act, pred)
    print(f"{n}-day cumulative prediction error: {err} cases ({100 * err / act:.1f} %)")

print_errors(hold_out_y, predicted_cases)

medsellu · March 1, 2024, 1:47pm

Generative Bayesian modeling of the COVID-19 pandemic: The model underlying rt.live

This blog post explains the model powering https://rt.live with its nowcast of the effective reproduction number. The model was developed by Kevin Systrom and Thomas Vladeck with mostly technical support provided by various PyMC3 developers, including myself.

The model underwent several revisions and will continue to improve. However, I believe that the model is at a point where it is valuable to explain it to the world. First, because I believe it is one of the best models currently out there, but also because it is a great case-study for generative Bayesian modeling where several challenges needed to be overcome. This post describes version 1.0.0 of the model.

But let’s start with the basics.

What is Rt?

R_0 (“R-naught”) describes the reproduction factor of a disease – i.e. how many other people does one infected person pass the disease to. If this quantity is larger than 1 we have an epidemic on our hands, as is the case for COVID-19.

R_0 assumes, however, that there are no counter-measures being implemented to curb spread of the virus. Thus, the more critical measure to track is R_e(t) – the time-changing effective reproduction factor, i.e. on a given day t, how many people does one person infect.

As lockdowns and social distancing measures are put in place we expect this quantity to drop, ideally below the critical quantity of 1 because then, over time, the disease would just wimper out.

Usually we’d extract R_e(t) from something like an SIR (susceptible-infected-recovered) model or an SEIR (which adds an exposed category), which are classic epidemiological compartment models.

An SIR model is in fact what rt.live used in the beginning. However, SIR/SEIR models are also just approximations of the real thing and come with quite a few assumptions baked in. The current model is simpler and makes fewer assumptions. In addition, the SIR model is described as an ODE which causes various technical problems. Solving the ODE is quite time-intensive and while closed-form approximations exist and are faster, we found that they are quite unreliable.

Instead, the current model uses a simple generative logic to explain how a initial pool of infected people spreads the disease at each time-point, according to the current reproduction factor.

The generative model

A core property of Bayesian modeling is that the models we tend to build are generative. That means we are defining how some hidden, unobservable causes we are interested in give rise to observable data. This aspect isn’t always so obvious with Bayesian models (think regression models) but it is a very powerful and intuitive concept once understood.

The rt.live model makes this generative aspect particularily obvious. So in this post I will explain how the model links hidden causes to observable data which we can then invert using Bayes formula to infer the most likely hidden causes given certain observed data. For this reason, Bayesian statistics used to be called “inverse probability”.

import warnings
warnings.simplefilter(action=‘ignore’, category=FutureWarning)

import pymc3 as pm
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_context(‘talk’)
from scipy import stats
from matplotlib import pyplot as plt
from covid.models.generative import GenerativeModel
from covid.patients import get_delay_distribution
from covid.data import summarize_inference_data, get_and_process_covidtracking_data
%config InlineBackend.figure_format = ‘retina’

df = get_and_process_covidtracking_data(
run_date=pd.Timestamp.today()-pd.Timedelta(days=1))

Let’s assume that for an idealized disease we start with a single infected patient (primary infection) on day 0 that is infectous for a single day only and on that day goes on to infect 2 people (secondary infection) which become sick the next day. This disease thus has a reproduction factor R_0 of 2. We could write that on day t the number of newly infected y_t is:

$$ y_t = y_{t-1} \cdot R_0 $$

Quite simple. This logic gives rise to the classic exponential growth pattern we see in epidemics:

Code optimized for readability, not speed

n_days = 10
ts = np.arange(n_days)
R0 = 2
y = np.zeros(n_days)

y[0] = 1 # starting with 1 infected
for t in ts[1:]:
y[t] = y[t-1] * R0

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(ts, y);
ax.set(xlabel=‘time’, ylabel=‘newly infected’,
title=‘Growth of idealized disease with R_0 = 2’); sns.despine();

However, as we discussed we care more about the effective reproduction rate as a function of time R_e(t). We can just switch that into our idealized generative model:

$$ y_t = y_{t-1} \cdot R_e(t) $$

n_days = 10
ts = np.arange(n_days)
Rt = np.linspace(2, 1, n_days) # Assuming Re(t) goes from 2 to 1
y = np.zeros(n_days)

y[0] = 1 # starting with 1 infected
for t in ts[1:]:
y[t] = y[t-1] * Rt[t]

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(ts, y);
ax.set(xlabel=‘time’, ylabel=‘newly infected’,
title=‘Growth of idealized disease with R_e(t) going from 2 to 1’);
sns.despine();

You can see that on the last day, where R_e(t) is 1 we get the same number of newly infected as on the previous day, because every infected infects just one more person.

The implicit assumption in the generative process above is that an infected person is only infectious for a single day and that it then takes just one day to infect other people.

In reality, the time it takes for the primary person to infect others follows a distribution. They might infect one person the next day, two the day after etc. This delay distribution is officially known as the “generation time” and we will model it with a probability distribution from this study (the study actually provides an estimate for something closely related to the generation time but it’s fine to use it for this).

The paper includes a simple formulation of how to derive the generation time distribution for COVID-19:

mean_si = 4.7
std_si = 2.9
mu_si = np.log(mean_si ** 2 / np.sqrt(std_si ** 2 + mean_si ** 2))
sigma_si = np.sqrt(np.log(std_si ** 2 / mean_si ** 2 + 1))
generation_time = stats.lognorm(scale=np.exp(mu_si), s=sigma_si)

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(ts, generation_time.pdf(ts))
ax.set(xlabel=‘days from one person infecting another’,
ylabel=‘probability density’,
title=‘Generation time’); sns.despine();

In order to include this effect in our generative model we need to do a convolution. Intuitively, instead of the new cases on day t depending only on the new cases on day t-1, they now depend on the new cases on (potentially) all previous days because it could have taken 5 days between the time a person got infected and infected another person. We need to take all of these previously infected people into account and by which probability they infect people today.

We accomplish this by weighting the number of newly infected people i days ago – y_{t-i} – by the generation time g_i for that particular delay as well as the effective reproduction number on that day R_e(t-i):

$$ y_t = \sum_{i=1}^{M}y_{t-i} R_e(t-i) g_i $$

For further details on this generative process see this post: Effective reproduction number estimation.

Updating our generative process model accordingly we get:

n_days = 30
ts = np.arange(0, n_days)
Rt = np.linspace(2, 1, n_days) # Assuming Re(t) goes from 2 to 1
y = np.zeros(n_days)
y[0] = 1 # starting with 1 infected

for t in ts[1:]:
# loop over previous days
for i in range(1, t+1):
y[t] += y[t - i] * Rt[t - i] * generation_time.pdf(i)

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(ts, y);
ax.set(xlabel=‘time’, ylabel=‘newly infected’,
title=‘Growth of idealized disease with R_e(t) going from 2 to 1’);
sns.despine();

As you can see, taking the delay between one person passing the disease onto the next into account slows the spread significantly. The longer the generation time, the slower the spread.

Getting to number of infected

OK, we’re getting close now. We have a generative model of how people transmit the disease from one person to the next. However, we don’t have data of when people transmitted the disease, we have data of who got a positive test result. So we need to delay this function even further by when an infected patient actually shows up as as a positive test in our data.

To do this we will use the distribution of the delay between infection and confirmed positive test, also known as the onset delay distribution. To estimate this distribution we can use data from the Open COVID-19 Data Working Group which asked COVID-19 patients how long ago their symptoms started (Xu et al., 2020). However, symptom onset isn’t the same as when someone got infected because there is incubation period during which the virus is spreading in the body while no symptoms are noticeable. To fix this we just an incubation period to the beginning of the onset delay distribution (rt.live assumes 5 days). You can see this offset in the flat region from days 1-5 in the next plot.

p_delay = get_delay_distribution()
ax = p_delay.plot(figsize=(12, 8))
ax.set(title=“Onset delay distribution”,
ylabel=“Probability of getting positive test”,
xlabel=“Days since infection”);
sns.despine();

Now all we need to do is convolve the function we have above of how many people got infected each day with the onset delay distribution to get to how many people will show up with a positive test on that day.

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(ts, np.convolve(y, p_delay)[:n_days])
ax.set(xlabel=‘day’, ylabel=‘number of positive tests’,
title=‘Growth of number of positive tests’);
sns.despine();

Adjusting for number of tests performed

So we’re done, right? We have a generative process that gives rise to the data we actually observe (number of positive tests on every day) and we’d just need to write this in PyMC3 and hit the inference button™.

Yes, almost. There is one last thing that is included in the model that is quite nifty. When looking at the number of raw positive tests, it’s clear that that number will be influenced by how many people you tested: The more you test, the more cases you will uncover.

This is important to model because there is huge variability in the number of tests being done over time (ramping up of tests more generally as more testing capacities are created, but also because usually fewer tests are being done on weekends). This would bias our estimate of R_e(t) if not accounted for.

Thus, the last trick in the model is to multiply the test exposure e_t (a normalized quantity proportional to the number of tests performed) with the number of positive tests from the generative process. Intuitively, if we test twice as much, we expect twice as many positive tests to show up.

Thus, the expected number of positive tests \tilde{z_t} will be:

$$ \tilde{z_t} = z_t \cdot e_t $$

where z_t is the output of the generative model with the delays applied.

Summarizing the generative process

Primary infection occurs (this is the time-point we want Rt to relate to).
Generation time passes until secondary infection occurs.
Onset time passes until secondary infected person develops symptoms and tests positive. This is the number of positive tests we’d expect if testing were constant.
Multiply number of positive tests (if tests were constant) with the testing exposure to get to the number of expected positives. This is the model output we use to fit the data.

The Bayesian model in PyMC3

The model in PyMC3 follows the above generative process directly and you can check it out in the source.

I’m not going to describe the code here as there are a few technicalities that make it not so easy to look at, mostly related to the recursive function defined above. But don’t be discouraged by that and take a look for yourself, if you know a bit of PyMC3 and Theano you should be able to figure it out quickly with the above logic.

The main addition is that we place a random-walk prior on R_e(t) that injects the knowledge into the model, that the reproduction factor is not changing hugely from day-to-day. If you want to learn more about random-walk priors you can check out my blog post on non-stationary Bayesian neural networks.

Instead, let’s see how to actually use the model interactively. As our example region we use Massachusetts.

region = “MA”
model_data = df.loc[region]

To run the model we just instantiate it and call .sample():

gm = GenerativeModel(region, model_data)
gm.sample()

result = summarize_inference_data(gm.inference_data)

fig, ax = plt.subplots(figsize=(12, 8))
result.infections.plot(c=“C2”, label=“Expected primary infections”)
result.test_adjusted_positive.plot(c=“C0”, label=“Expected positive tests if tests were constant”)
result.test_adjusted_positive_raw.plot(c=“C1”, alpha=.5, label=“Expected positive tests”, style=“–”)
gm.observed.positive.plot(c=“C7”, alpha=.7, label=“Reported positive tests”)
fig.set_facecolor(“w”)
ax.legend();
ax.set(title=f"rt.live model inference for {region}", ylabel=“number of cases”)
sns.despine();

The plot above shows exactly what’s going on but let’s apply the logic backwards this time. First, we have the observed data of “Reported positive tests” in grey. This is what our model tries to explain with the “Expected positive tests”, the output of the model which is derived from the “Expected number of positive tests if tests were constant”. This number we convolve with the onset delay distribution to get from to “Infection onset”, i.e. when did people actually get infected. From there we use the generation time distribution to arrive at when the actual transmission of the disease occured.

As you can see, we are using our domain knowledge of the disease and testing mechanics as well as derived distributions that describe these mechanics to back out the thing we are actually interested in but can’t observe directly. A beautiful example of the power of statistical inference.

Our R_e(t) estimate is then based on when the actual transmission occured, not when we observed the positive test:

fig, ax = plt.subplots(figsize=(12, 8))

ax.set(title=f"Effective reproduciton number for {region}“, ylabel=”R_e(t)")
samples = gm.trace[“r_t”]
x = result.index
cmap = plt.get_cmap(“Reds”)
percs = np.linspace(51, 99, 40)
colors = (percs - np.min(percs)) / (np.max(percs) - np.min(percs))
samples = samples.T

result[“median”].plot(c=“k”, ls=‘-’)

for i, p in enumerate(percs[::-1]):
upper = np.percentile(samples, p, axis=1)
lower = np.percentile(samples, 100-p, axis=1)
color_val = colors[i]
ax.fill_between(x, upper, lower, color=cmap(color_val), alpha=.8)

ax.axhline(1.0, c=“k”, lw=1, linestyle=“–”)
sns.despine();

What’s next?

The model handles quite a few issues that most other models do not. But like any model there are many ways to improve it. Including death data as well as hospitalizations will certainly improve the R_e(t) estimate. So would a hierarchical model that pools information across states.

Of course, we would also like to apply the model to other countries. The main difficulty here is that we do not have data on the number of tests performed for most countries. Michael Osthege has run the model successfully for Germany, however.

Another really interesting direction is inclusion of covariates, like lock-downs, cell-phone data, social distancing, mask-usage etc.

Interested to help out? The model as well as the dashboard are open source, contributions are welcome: GitHub - rtcovidlive/covid-model.

Topic		Replies	Views
Need Help creating models in PyMC3 as a beginner Questions	4	344	November 2, 2021
The Bayesian Workflow: Building a COVID-19 Model by Thomas Wiecki PyMCon2020	13	5765	December 5, 2021
PyMC3 for COVID-19 Research: Now Published in Science Questions	1	600	December 19, 2020
Interpreting beta-binomial parameterisation/posterior Questions	2	657	August 6, 2020
Covid-19 case numbers inference: support welcomed; funding available Development	5	858	May 18, 2020