Calculating the cumulative sum of a modeled random count variable

Hi,

I’m working on modeling how many times an event takes place in a timeframe (like new-customers-per-week).

real_mu = 3
real_dist = poisson(real_mu)
counts = real_dist.rvs([52])
cumulative = counts.cumsum()
cumulative
dt = pd.DataFrame({"events":counts, "cumulative": cumulative})

mdl = pm.Model()

with mdl:
    mu = pm.Uniform("mu",0,10)
    count = pm.Poisson("count", mu, observed=dt['events'])
    trace = pm.sample()
    post = pm.sample_posterior_predictive(trace)

I’d like to be able to report the likely cumulative counts (like how many new customers will we have after 10 weeks).

I believe I’ve come up with a way to generate these with a fairly brute-force technique of simulating a lot of observed counts and adding up the possible cumulative counts. It seems pretty gnarly though.

# Sample a bunch of possible event counts for each of the timesteps I'm interested in.
possible_mus = gaussian_kde(trace.posterior['mu'].stack(n=['chain','draw']))
timesteps = range(0,52)
dt = []
for t in timesteps:
    mus = possible_mus.resample(10)
    counts = poisson(mus).rvs(10)
    dt.append(counts)

# Then calculate cumulative counts by adding the week's possible counts to a running total of the previous week's possible cumulative totals.
cumulative = []
for idx, week in enumerate(dt):
    # for each week, add up all the prior possible counts.
    print("calculating week",idx)
    if idx==0:
        cumulative.append(week)
    else:
        prev = cumulative[-1]
        curr = [i+j for i in week for j in prev]
        # The size of the list of possible cumulative values explodes, so i randomly sample it to keep it to a manageable size.
        max_size = 100000
        if len(curr) > max_size:
            curr = random.sample(curr, max_size)
        print(" Adding ", len(curr), " possible cumulative counts")
        cumulative.append(curr)
        
# Turn it into a dataframe
dt = {"t":[], "min":[],"med":[],"max":[]}
for i, c in enumerate(cumulative):
    dt["t"].append(i)
    dt["min"].append(np.quantile(c, 0.2))
    dt["med"].append(np.quantile(c, 0.5))
    dt["max"].append(np.quantile(c, 0.8))
    
dt = pd.DataFrame.from_dict(dt)

# The plot looks pretty good, i think, with growing uncertainty over time, which is what I'd expect.
plt.plot(dt.index, dt["min"], color = "green", "Most")
plt.plot(dt.index, dt["med"], color= "blue", "Median")
plt.plot(dt.index, dt["max"], color="red", "Fewest")
plt.title("Possible cumulative event counts")

I’d appreciate feedback about this approach. But I’d also really like to know if there’s a better way to calculate this. I can calculate the distribution of the count of some event with pymc, so can pymc also help me calculate the distribution of the cumulative sum of that count variable?

Thanks!

Hello,

It seems like you might be able to just use the .cumsum() method, together with the posterior samples you have already sampled in post, to save yourself a lot of work. For example:

(post.posterior_predictive['count']
             .cumsum(dim='count_dim_0')
             .stack(sample = ['chain', 'draw'])
             .quantile([0.05, 0.5, 0.95], dim='sample')
             .plot
             .line(x='count_dim_0'));

Here, I picked out the predicted counts from the posterior predictive, took the cumulative sum over the “count_dim_0” dimension (which is the time dimension, it’s useful to rename these using coords!), stacked all the samples from all chains into a single dimension called “sample”, computed quantiles, then plotted. Here are the results:

I don’t think it necessary to look at every possible combination of trajectories as you do, because your samples are all independent in time anyway. If you need more trajectories, rather than building them by permuting existing trajectories, it is easier to just draw more samples from posterior. In this case you can do either, but be aware that if you were to add some kind of time-series dynamics to your model, it would not be legit to jump across trajectories like this.

You also don’t need to manually build the posterior by sampling mu from a kde then sampling a poisson, you already have all this using post = pm.sample_posterior_predictive(trace).

Thanks! I didn’t realize I’d be able to see the uncertainty in the cumulative sum with a cumulative sum of the posterior predictive.

2 Likes