Hope you are having a nice day!
I am a PYMC beginner, and I am trying to wrap my head around using PYMC to compare two data groups, in the context of A/B testing.
I have two groups of data - group A and group B. Group A has 400 datapoints, group B has 390. Each datapoint is a single float number representing revenue from that particular datapoint. This data is stored in pandas DataFrames called
I want to build a simple model that would allow me to compare the posteriors of these two groups, and make claims like “Group A has higher revenue than Group B X% of the time, and distribution of the difference between the two groups looks like this”.
In order to do that, I define a simple model:
with pm.Model() as revenue_model: sigma_A = pm.HalfNormal("sigma_A", 1000) sigma_B= pm.HalfNormal("sigma_A", 1000) mean_A = pm.Normal("mean_A", mu=5000, sigma=1000) mean_B = pm.Normal("mean_B", mu=5000, sigma=1000) revenue_A= pm.Normal('revenue_A', mu=mean_A, sigma=sigma_A, observed = data_A['revenue']) revenue_B= pm.Normal('revenue_B', mu=mean_B, sigma=sigma_B, observed = data_B['revenue']) trace = pm.sample()
I can then plot the trace to see the posterior sigma and mean parameters for A and B. All four will have chains x draws shape (in my case with defaults it’s 4 x 1000).
But when I draw samples from my posterior predictive, I get revenue_A to have a shape of 4 x 1000 x 400, and revenue_B to have a shape of 4 x 1000 x 350. I understand that PYMC draws a sample from posterior predictive for each observed datapoint, hence why the posterior predictives have that shape.
However, I can’t compare these (e.g. take one from another to get a distribution of difference) because they are different shapes.
- Is this the right approach to answer the question I am asking?
- If so, how could I go about comparing posterior distribution of the revenue?
Thank you very much in advance for you advice and guidance!
edit Any suggestions about “best practices” and structuring my code differently are most welcome too!