New to probabilistic programming, and after trying to find a solution to a problem I’m working on, I can’t figure out how to move forward.

I’m trying to model how many people will have been exposed to something over time, in order to predict out into the future.

As a toy example, let’s say I have a website, and I change my logo on the website on Jan. 1, and I want to know: how many users will have seen the new logo (*at least once*) by Mar. 31. Let’s say I observe 2 weeks of data, so I have the number of visitors *who had not visited yet this year* every day from Jan. 1 through Jan 14. And users are not homogeneous: some come every day, and some come once a week, and some come once a month, etc. So, I’m modeling this problem as:

- All users have a probability of visiting the site on a given day (each day is independent), drawn from a beta distribution; this is intrinsic to the user and doesn’t change over time
- Every day we flip a coin with probability p_i for each user; if we get
*heads*they visit the site (and see the new logo, for the first, or second, …, or nth time), and if we get*tails*they do not - The
*first*success (visit the site) is the day they are*newly exposed*; in other words we can model the*days until first exposed*as a negative binomial model (failures until first success)

Here’s some generated data:

```
true_pvisit_α = .2
true_pvisit_β = .5
total_pop = 10_000 # True pop is many millions
n_days = 14
MAX_DAYS = 90 # I want to predict how many users will be
# newly exposed each day, up to MAX_DAYS days out
df = pd.DataFrame({
'date': pd.date_range('1/1/2020', periods=n_days, freq='D'),
'newly_exposed': np.zeros(n_days)
})
with pm.Model() as true_model:
true_pvisit = pm.Beta('true_pvisit', alpha=true_pvisit_α, beta=true_pvisit_β, shape=total_pop)
# Coin flip for everyone in the population, for every day in data
true_has_visit = pm.Binomial('true_has_visit', n=1, p=true_pvisit, shape=(n_days, total_pop))
true_visits = pm.sample_prior_predictive(samples=1, random_seed=42)['true_has_visit']
# If they visit at all, get the first day of visit; otherwise use -1
first_visit_days = np.where(true_visits.max(axis=0) > 0,
true_visits.argmax(axis=0),
-1)
# Counts per day of first visit (excluding non-visitors)
first_visitors = np.unique(first_visit_days, return_counts=True)[1][1:]
df.loc[:first_visitors.shape[0]-1, 'newly_exposed'] = first_visitors
print(df)
```

And the resulting data:

```
date newly_exposed
0 2020-01-01 2861
1 2020-01-02 849
2 2020-01-03 497
3 2020-01-04 298
4 2020-01-05 231
5 2020-01-06 170
6 2020-01-07 147
7 2020-01-08 129
8 2020-01-09 113
9 2020-01-10 104
10 2020-01-11 82
11 2020-01-12 97
12 2020-01-13 80
13 2020-01-14 60
```

The problem I’m having is that I only observe the *totals*: that is, the number of *newly* exposed users each day.

So I’ve gotten this far in terms of defining the model:

```
with pm.Model() as model:
p_trig_α = pm.Exponential('p_trig_α', 10)
p_trig_β = pm.Exponential('p_trig_β', 10)
p_trig = pm.Beta('p_trig', alpha=p_trig_α, beta=p_trig_β, shape=total_pop)
# α is days observed for each user, which is just the full period
days_α = df_.days.max() - df_.days.min()
# μ = mean of neg. bin.
μ = pm.Deterministic('μ', p_trig * days_α / (1 - p_trig))
days = pm.NegativeBinomial('days', alpha = days_α, mu=μ, shape=total_pop)
```

But now I’m stuck. Where do I put the observations? The `days`

RV has a lot of very large numbers, which makes sense since some users will never be exposed, but I can’t figure out how to do something like:

```
obs = pm.Something('obs', np.unique(np.where(days < MAX_DAYS, days, np.nan), return_counts=True)[1], observed=data)
```

It seems like this would be easy to model as a like Poisson if all the users had the same probability of visiting, but that doesn’t capture the fact that we exhaust the frequent visitor population pretty soon, and then have a long tail of infrequent visitors. I’ve also tried to model this as a survival analysis problem, but I need to predict data going forward, so for example with https://docs.pymc.io/notebooks/survival_analysis.html I’d need to come up with a way to make tau a function of the *days since the logo change*, which seems to me like it puts me right back in the same place, since that’s exactly what I’m trying to do above, fundamentally.

Thanks for any pointers!