Modelling advice: how to estimate the true rate over time, out of many average rates over different time periods?

Hello, all. I have a bit of a modelling question (i.e., how to specify my question correctly in PyMC), and would appreciate any little advice that the community would give me.

In a project recently, I was using Twitter’s free API to scrape tweets around certain hashtags, and some of those hashtags were found to have a much greater density of average tweets/time than others. For example, IBM has a frequency of ~2,000 tweets/day, whereas Apple (both the company and the fruit), produce some >50,000 tweets/day.

In the end, my data looked like this:

“IBM --> 28k tweets over 14 days”,
“Apple --> 13mil tweets over 6 days”,
“Cisco --> 10k tweets over 10 days”,
“barnacles --> 36 mil tweets over 3 days”,
… …

Naturally, I wanted to model this in a way that would allow me to estimate, for any hashtag, any time period’s average # of tweets, based on a sample of that same hashtag’s tweets/x # of days (such as those found above).

So, if Cisco has 10k tweets/10 days, how many tweets will it have over 6 months?

The normal distribution seems correct for the likelihood, since a mean for the estimate and std. deviation for the estimate’s uncertainty seem adequate, and the problem seems symmetric. I could then summarize each sample of data by a simple rate number - 1k tweets/day - and work with that somehow, but my main question is, wouldn’t that necessarily destroy the information regarding the variation of tweets within a day?

52%20AM

There clearly exists some variation of tweets over any potential time base (over a day, over an hour, over a 15-minute interval, …), and wouldn’t that need to be factored in somehow?

If anyone could give me any advice regarding this question, I would very greatly appreciate it.

Until then, I will be posting my own progress.

I wouldn’t say “destroy” so much as “ignores” or “glosses over.”
You are right that modeling at the daily level will gloss over any intra-day variation or patterns but that might be ok, depending on what questions you are trying to answer. The one example question you provided…

… makes it sound like you are focused on tweet counts over longer periods of time, and therefore you could easily ignore intra-day patterns and model at the tweets/day or even tweets/week level.

As an analogy, most people budget monthly. Monthly spend is often front-loaded, due to rent or mortgage, but that’s usually safe to ignore. If you wanted to know why you went over-budget in a particular month, you might dig down into spend/week or spend/day. But if you’re just trying to understand avg. spend per month and month-to-month variation, you can safely aggregate to monthly frequency.

Also, you might look at poisson or negative binomial distributions instead of normal.

Hmm. Your analogy of not “destroying” but “ignoring/glossing over” is very useful for me, thank you.

If, fundamentally (as you are implying and I now agree with as well), I am not committing any error by choosing to model tweets/day as the base, and extrapolating that to the larger period I am investigating (6 months), would I still be more accurate in choosing the smallest period possible, such as tweets/15 minutes, and then extrapolating that to 6 months?

Do you think there would be a increase in unbiasedness by trying to measure tweets/time at the smallest possible level?

Going so extreme in the small timescale, I imagine the variance will jump up in return, but I would appreciate thoughts on this regardless.

Thanks for mentioning the Poisson or Neg. Binomial distributions, I will look into those as well!

If you are going to model at the smallest possible level, you have to account for the phenomena at that level.

As a concrete example for your case: “barnacles” is an English word, so it is likely tweeted more when the English-speaking world is awake than when it is asleep. So if you try to estimate the tweets/15 minutes rate, you will have a bimodal distribution. E.g., the 15 minute periods during the day might have a mean rate of 20 tweets/15 minutes, and the 15 minute periods during the night might have a mean rate of 1 tweet/15 minutes. But if you aggregate to the weekly level, you will have a unimodal distribution.

So you can model at either level, but it’s easier to do so at the higher level with the unimodal distribution.

I suggest you start by plotting histograms of tweets/unit of time for a single hashtag, with unit of time in {15 mins, 1hr, 1day, 1 week, 1 month} and tell us what you see.

I might take a Poisson Process approach to these data; denoting by

r_t^{(i)}

the instantaneous tweet rate for hashtag i. Then the number of observed tweets in a given period looks like

N(t_b, t_e) \sim \mathrm{Pois}(r_{b,e}) \;\;\; r_{b,e} = \int_{t_b}^{t_e} r_t dt

You’ve got a resolution of 1 minute (maybe 1 second); so in that case you’d have

N_t^{(i)} \sim \mathrm{Pois}(r_t^{(i)})

for the tth minute (or second?) and ith hashtag. Under this framework, the modeling all has to do with functional terms you introduce into r_t, such as \cos(60 \pi t) for hourly effects. Just to play around:

import pymc3 as pm
import numpy as np
import theano
import theano.tensor as tt
from matplotlib import pyplot as plt

# let a "tick" be one second
DAYS=7
SEC_MIN=60
MIN_HR=60
HR_DAY=24

N_TICK = DAYS*HR_DAY*MIN_HR*SEC_MIN


# 2 / day * (1 day / 24 H) * (1 H / 60 min)

HALF_DAILY_FREQ = 4./(HR_DAY*MIN_HR*SEC_MIN)
DAILY_FREQ = HALF_DAILY_FREQ/2

HOURLY_PHASES = SEC_MIN*MIN_HR*np.arange(24, dtype=np.float32)/(HR_DAY*MIN_HR*SEC_MIN)

ALPHAS = [0.]*24

ALPHAS[20] = 1e-3
alpha_happy_hour = 5e-1

base_rate = 3e-3

time_index = np.arange(N_TICK)
with pm.Model() as mod:
    rate_noise = pm.Normal('noise', 0, base_rate/2., shape=N_TICK)
    rate = np.log(base_rate) + rate_noise
    for i in range(len(ALPHAS)):
        test_fx = tt.cos(np.pi * (HALF_DAILY_FREQ * time_index + HOURLY_PHASES[i]))
        rate = rate + test_fx * ALPHAS[i]
    # happy hour phase
    test_fx = tt.cos(np.pi* (DAILY_FREQ* time_index + HOURLY_PHASES[17]))
    rate = rate + test_fx * alpha_happy_hour
    rate = pm.Deterministic('rate', tt.exp(rate))
    tweets = pm.Poisson('tweets', rate, shape=N_TICK)
    runs = pm.sample_prior_predictive(25)

plt.plot(runs['rate'][0,:]);
midnights = np.arange(DAYS)*HR_DAY*MIN_HR*SEC_MIN
noons = (0.5 + np.arange(DAYS))*HR_DAY*MIN_HR*SEC_MIN
for m in midnights:
    plt.plot([m, m], [-1, 1], 'k-')
for n in noons:
    plt.plot([n, n], [-1, 1], 'b-')
mn = np.min(runs['rate'][0,:])
mx = np.max(runs['rate'][0,:])
plt.ylim([mn, mx])
plt.figure();
plt.plot(runs['tweets'][0,:50000]);

image
image

From this it should be clear that in this model the sub-hourly effects are just noise and average out – and also that the hourly effects (by virtue of being periodic) also average out to the daily level.

import itertools
def hourfx(t):
    return int(t[1]/(SEC_MIN*MIN_HR))

def dayfx(t):
    return int(t[1]/(SEC_MIN*MIN_HR*HR_DAY))

hourly_tweets = itertools.groupby(zip(runs['tweets'][0,:], time_index), hourfx)
hourly_tweets = [np.sum([x[0] for x in v]) for k, v in hourly_tweets]

daily_tweets = itertools.groupby(zip(runs['tweets'][0,:], time_index), dayfx)
daily_tweets = [np.sum([x[0] for x in v]) for k, v in daily_tweets]

plt.plot(hourly_tweets)
plt.plot(daily_tweets)

image
image

One can start introducing test functions just as in a linear model:

days = DAYS
hours_day = HR_DAY

want_freqs = [1./hours_day]
want_phase = np.array([4., 15., 22.])/hours_day

time_index = np.arange(days*hours_day)
with pm.Model() as hourly_model:
    log_base_rate = pm.Normal('base_rate_log', -3, 3.)
    rate = log_base_rate
    for i, f in enumerate(want_freqs):
        for j, p in enumerate(want_phase):
            test_alpha = pm.Laplace('alpha_{}_{}'.format(i, j), 0., 1.)
            test_fx = tt.cos(2*np.pi * (want_freqs[i]*time_index + want_phase[j]))
            rate = rate + test_alpha*test_fx
    #erate = pm.Deterministic('log_rate', rate)
    tweets = pm.Poisson('hourly_tweets', tt.exp(rate), observed=np.array(hourly_tweets))
    trace = pm.sample(600, chains=5, init='jitter+adapt_diag')
pm.traceplot(trace)

And looking at the posterior:

with hourly_model:
    post = pm.sample_posterior_predictive(trace)
    

median_tweets = np.median(post['hourly_tweets'], axis=0)
u90_tweets = np.percentile(post['hourly_tweets'], 95, axis=0)
l90_tweets = np.percentile(post['hourly_tweets'], 5, axis=0)

plt.plot(median_tweets)
plt.plot(u90_tweets, 'grey')
plt.plot(l90_tweets, 'grey')
plt.plot(hourly_tweets, 'red', alpha=0.5)

image

Additionally, one can introduce lags and differences such as

L_k r_t = r_{t-k}
\Delta r_t = r_t - r_{t-1} = r_t - L_1r_t

and make the rate of the next step depend on the number of tweets observed:

l1 = np.array([0] + hourly_tweets[1:])
l2 = np.array([0, 0] + hourly_tweets[2:])
l3 = np.array([0, 0, 0] + hourly_tweets[3:])
l4 = np.array([0, 0, 0, 0] + hourly_tweets[4:])

with pm.Model() as lagged_hourly:
    log_base_rate = pm.Normal('base_rate_log', -3, 3.)
    rate = log_base_rate
    a1 = pm.Normal('a1', 0., 1.)
    a2 = pm.Normal('a2', 0., 1.)
    a3 = pm.Normal('a3', 0., 1.)
    a4 = pm.Normal('a4', 0., 1.)
    rate = rate + a1 * np.log(1 + l1) + a2 * np.log(1 + l2) + a3 * np.log(1 + l3) + a4 * np.log(1 + l4)
    tweets = pm.Poisson('hourly_tweets', tt.exp(rate), observed=np.array(hourly_tweets))
    trace2 = pm.sample(600, chains=5, init='jitter+adapt_diag')
pm.traceplot(trace2);

with lagged_hourly:
    post = pm.sample_posterior_predictive(trace2)
    

median_tweets = np.median(post['hourly_tweets'], axis=0)
u90_tweets = np.percentile(post['hourly_tweets'], 95, axis=0)
l90_tweets = np.percentile(post['hourly_tweets'], 5, axis=0)

plt.plot(median_tweets)
plt.plot(u90_tweets, 'grey')
plt.plot(l90_tweets, 'grey')
plt.plot(hourly_tweets, 'red', alpha=0.5)

image