# Compare means between Lognormal distributions

Dear pymc experts,

I have following distributions of same page views with two different features (clusters ‘c1’ and ‘c2’):

I want to know whether the difference between the mean of these two distributions is \neq 0. Following BEST (Bayesian Estimation Supersedes the T-Test — PyMC3 3.10.0 documentation) but instead of using a t distribution, using a Lognormal I built this model:

with pm.Model() as compare_dist:
group1_mu = pm.Normal('mu1', mu=mean_dict['c1'], sd=std_dict['c1']*2)
group2_mu = pm.Normal('mu2', mu=mean_dict['c2'], sd=std_dict['c2']*2)
group1_sd = pm.Uniform('sd1', lower=0, upper=500)
group2_sd = pm.Uniform('sd2', lower=0, upper=500)
group1 = pm.Lognormal("group1", mu=group1_mu, sd=group1_sd, observed=df.loc[df.clusters=='c1', 'hits'])
group2 = pm.Lognormal("group2", mu=group2_mu, sd=group2_sd, observed=df.loc[df.clusters=='c2', 'hits'])
group1_log_mean = pm.Deterministic('group1_log_mean', group1_mu + group1_sd**2/2)
group2_log_mean = pm.Deterministic('group2_log_mean', group2_mu + group2_sd**2/2)
group1_mean = pm.Deterministic('group1_mean', np.exp(group1_log_mean))
group2_mean = pm.Deterministic('group2_mean', np.exp(group2_log_mean))
diff_of_mus = pm.Deterministic("difference of mus", group1_mean - group2_mean)
diff_of_stds = pm.Deterministic("difference of stds", group1_sd - group2_sd)
diff_of_means = pm.Deterministic("difference of means", group1_mean - group2_mean)
diff_of_log_means = pm.Deterministic("difference of log means", group1_log_mean - group2_log_mean)


For which I get:

The above distributions show the differences of \mu, \sigma, the expected value (mean) and the log of the mean. So the parameters between the two groups are clearly different. Does my model makes sense? Based on the distribution on the bottom left, can I conclude that ‘c2’ has on average 15 more page views per day than ‘c1’?

1 Like

It is fair to say that “given this model, the means are probably different by 10 - 20”. A few other random thoughts if you want to go deeper:

• It probably won’t change much, but best practice is now to use something like pm.HalfNormal('sd1', 10) instead of a Uniform('sd1', 0, 500).
• Why a log normal? If that changes to some other distribution, do you get the same results? (you could get into model comparison with this)
• Are your convergence results fine? This means no divergences, rhat very near 1, reasonable effective sample size?
• The first plot you show, the distributions look very visually similar! I’d be prepared when presenting these results to put this in context: what is the raw difference in the means? I would guess that the dataset is pretty large, so maybe that is swamping the priors. Which is fine! it means you could add more features to the model, or just give you confidence that you have enough data to not need Bayes here.