How can I apply bayesian statistics and pymc3 on the following problem

Hi, this may be a noob question, but i am new to bayesian stat and I’d like to understand how to solve the following traditional frequentist problem using bayesian statistic.
I have some A/B test data, I have found online some tutorial on how to understand if the two variants are statistically different from each other, but I want to do something more. I want to understand if there is a ‘day_of_the_month’ factor influencing my data. Have a look at the following code:

variant_a_visits = np.random.normal(100, 20, 31).round().astype(int)
variant_b_visits = np.random.normal(80, 30, 31).round().astype(int)
days_of_the_month = list(np.arange(1, 32))
df = pd.DataFrame({'variant_a':variant_a_visits, 'variant_b':variant_b_visits}, index=pd.Index(days_of_the_month))

How can I test if on a specific period of the month (end, middle or beginning or specific day) I am more likely to get more visits? Thanks

Hi Dario,

I’m just starting our myself but, here is my noob answer to your question.

Proceed as if this was a simple t-test or better yet its bayesian alternative.
Except that when you are coding you model for estimating differences in means each one of your random variables will have shape parameter with len(days_of_month) in it. This change will change each variable into a vector of variables so that you end up testing for differences on each day individually.

But you might also want to compare this approach to most basic difference in means test using bayes factors, the great thing about bayesian statistics it that MC will determine if the extra parameters(days of the month) are actually adding any value.

There are few way to improve my proposition if results seem odd one way would be use hierarchical model which are really easy to make in bayesian framework/pymc3.

I’m curious to hear propositions from more experienced users.

I drafted a potential approach in this notebook. I modelled the number of visit per day with a Poisson distribution (whose sole parameter is \lambda, the average number of events [1]). Instead of modelling each day individually, we use a rolling regression. That is, we assume that the average count across days are correlated as in \lambda_t=\lambda_{t-1}+\mathcal{N}(0, \sigma). We can do that using the GaussianRandomWalk (see here and here). What you obtain is the estimated \lambda for each day (but you could tweak the model and estimate a \lambda for each meaningful period of the month). I started off with modelling only one timeseries. Then you could move on and incorporate the other one, and compare the parameters across models.

This is an idea. What I would do, however, is to integrate more predictors in your data. For example, you could define which days are workdays/weekends etc (e.g., maybe on weekends you get more visits?). Another approach is to model the timeseries in its structural components, like trend, seasonality, etc (there are many techniques for that, for example stuff like ARIMA).

1 Like