Hi,
I’m working on modeling how many times an event takes place in a timeframe (like new-customers-per-week).
real_mu = 3
real_dist = poisson(real_mu)
counts = real_dist.rvs([52])
cumulative = counts.cumsum()
cumulative
dt = pd.DataFrame({"events":counts, "cumulative": cumulative})
mdl = pm.Model()
with mdl:
mu = pm.Uniform("mu",0,10)
count = pm.Poisson("count", mu, observed=dt['events'])
trace = pm.sample()
post = pm.sample_posterior_predictive(trace)
I’d like to be able to report the likely cumulative counts (like how many new customers will we have after 10 weeks).
I believe I’ve come up with a way to generate these with a fairly brute-force technique of simulating a lot of observed counts and adding up the possible cumulative counts. It seems pretty gnarly though.
# Sample a bunch of possible event counts for each of the timesteps I'm interested in.
possible_mus = gaussian_kde(trace.posterior['mu'].stack(n=['chain','draw']))
timesteps = range(0,52)
dt = []
for t in timesteps:
mus = possible_mus.resample(10)
counts = poisson(mus).rvs(10)
dt.append(counts)
# Then calculate cumulative counts by adding the week's possible counts to a running total of the previous week's possible cumulative totals.
cumulative = []
for idx, week in enumerate(dt):
# for each week, add up all the prior possible counts.
print("calculating week",idx)
if idx==0:
cumulative.append(week)
else:
prev = cumulative[-1]
curr = [i+j for i in week for j in prev]
# The size of the list of possible cumulative values explodes, so i randomly sample it to keep it to a manageable size.
max_size = 100000
if len(curr) > max_size:
curr = random.sample(curr, max_size)
print(" Adding ", len(curr), " possible cumulative counts")
cumulative.append(curr)
# Turn it into a dataframe
dt = {"t":[], "min":[],"med":[],"max":[]}
for i, c in enumerate(cumulative):
dt["t"].append(i)
dt["min"].append(np.quantile(c, 0.2))
dt["med"].append(np.quantile(c, 0.5))
dt["max"].append(np.quantile(c, 0.8))
dt = pd.DataFrame.from_dict(dt)
# The plot looks pretty good, i think, with growing uncertainty over time, which is what I'd expect.
plt.plot(dt.index, dt["min"], color = "green", "Most")
plt.plot(dt.index, dt["med"], color= "blue", "Median")
plt.plot(dt.index, dt["max"], color="red", "Fewest")
plt.title("Possible cumulative event counts")
I’d appreciate feedback about this approach. But I’d also really like to know if there’s a better way to calculate this. I can calculate the distribution of the count of some event with pymc, so can pymc also help me calculate the distribution of the cumulative sum of that count variable?
Thanks!