Observing the aggregation of a variable, not the variable itself

New to probabilistic programming, and after trying to find a solution to a problem I’m working on, I can’t figure out how to move forward.

I’m trying to model how many people will have been exposed to something over time, in order to predict out into the future.

As a toy example, let’s say I have a website, and I change my logo on the website on Jan. 1, and I want to know: how many users will have seen the new logo (at least once) by Mar. 31. Let’s say I observe 2 weeks of data, so I have the number of visitors who had not visited yet this year every day from Jan. 1 through Jan 14. And users are not homogeneous: some come every day, and some come once a week, and some come once a month, etc. So, I’m modeling this problem as:

  • All users have a probability of visiting the site on a given day (each day is independent), drawn from a beta distribution; this is intrinsic to the user and doesn’t change over time
  • Every day we flip a coin with probability p_i for each user; if we get heads they visit the site (and see the new logo, for the first, or second, …, or nth time), and if we get tails they do not
  • The first success (visit the site) is the day they are newly exposed; in other words we can model the days until first exposed as a negative binomial model (failures until first success)

Here’s some generated data:

true_pvisit_α = .2
true_pvisit_β = .5
total_pop = 10_000  # True pop is many millions
n_days = 14
MAX_DAYS = 90  # I want to predict how many users will be 
               # newly exposed each day, up to MAX_DAYS days out

df = pd.DataFrame({
    'date': pd.date_range('1/1/2020', periods=n_days, freq='D'),
    'newly_exposed': np.zeros(n_days)
})

with pm.Model() as true_model:
    true_pvisit = pm.Beta('true_pvisit', alpha=true_pvisit_α, beta=true_pvisit_β, shape=total_pop)
    # Coin flip for everyone in the population, for every day in data
    true_has_visit = pm.Binomial('true_has_visit', n=1, p=true_pvisit, shape=(n_days, total_pop))
    true_visits = pm.sample_prior_predictive(samples=1, random_seed=42)['true_has_visit']
    # If they visit at all, get the first day of visit; otherwise use -1
    first_visit_days = np.where(true_visits.max(axis=0) > 0, 
                                true_visits.argmax(axis=0),
                                -1)
    # Counts per day of first visit (excluding non-visitors)
    first_visitors = np.unique(first_visit_days, return_counts=True)[1][1:]

df.loc[:first_visitors.shape[0]-1, 'newly_exposed'] = first_visitors
print(df)

And the resulting data:

         date  newly_exposed
0  2020-01-01           2861
1  2020-01-02            849
2  2020-01-03            497
3  2020-01-04            298
4  2020-01-05            231
5  2020-01-06            170
6  2020-01-07            147
7  2020-01-08            129
8  2020-01-09            113
9  2020-01-10            104
10 2020-01-11             82
11 2020-01-12             97
12 2020-01-13             80
13 2020-01-14             60

The problem I’m having is that I only observe the totals: that is, the number of newly exposed users each day.

So I’ve gotten this far in terms of defining the model:

with pm.Model() as model:
    p_trig_α = pm.Exponential('p_trig_α', 10)
    p_trig_β = pm.Exponential('p_trig_β', 10)
    p_trig = pm.Beta('p_trig', alpha=p_trig_α, beta=p_trig_β, shape=total_pop)

    # α is days observed for each user, which is just the full period
    days_α = df_.days.max() - df_.days.min()
    # μ = mean of neg. bin.
    μ = pm.Deterministic('μ', p_trig * days_α / (1 - p_trig))
    days = pm.NegativeBinomial('days', alpha = days_α, mu=μ, shape=total_pop)

But now I’m stuck. Where do I put the observations? The days RV has a lot of very large numbers, which makes sense since some users will never be exposed, but I can’t figure out how to do something like:

obs = pm.Something('obs', np.unique(np.where(days < MAX_DAYS, days, np.nan), return_counts=True)[1], observed=data)

It seems like this would be easy to model as a like Poisson if all the users had the same probability of visiting, but that doesn’t capture the fact that we exhaust the frequent visitor population pretty soon, and then have a long tail of infrequent visitors. I’ve also tried to model this as a survival analysis problem, but I need to predict data going forward, so for example with https://docs.pymc.io/notebooks/survival_analysis.html I’d need to come up with a way to make tau a function of the days since the logo change, which seems to me like it puts me right back in the same place, since that’s exactly what I’m trying to do above, fundamentally.

Thanks for any pointers!