# Bayesian T-Test to compare posterior distribution from two separate models?

I would like to compare the effects of a variable I have included in two identical models for two different groups. In this case, two separate models make more sense than a hierarchical model. From the trace results, I have the posterior distributions for the effects which are gaussian/normally distributed. I currently take these 2 posterior distributions and run a separate Bayesian t-test to estimate the degree of difference between the two distributions.

This method seems inefficient and I am worried that in doing a separate test on the posterior effects I am missing information in my second Bayesian T-Test model. Is this method the common approach to test differences in effects or am I missing something obvious?

So you have 2 groups and in each group you have a dichotomous factor? And you are estimating a difference for each group? And then comparing the 2 differences? Did I get all that correct?

If so, I would construct a single model with 2 separate parameters (one for the difference in the one group, the second for the difference in the second group). Your trace will then contain samples that reflect the joint distribution of difference 1 and difference 2 and you can then directly ask about p(diff_1 > diff_2). Does that make sense/work?

If you dump in some code, we could sketch out a more concrete solution.

Hi cluhmann,

Thanks for the response!

First, to respond directly to your questions:
I am not interested in comparing the variables but instead the coefficients.
Second, I considered a hierarchical model, but separate models better reflect my hypothesis as I believe that the dependent variable is driven by different variables in real life.

Let me clarify: I have 2 groups representing data from 2 different countries. In each country, I am estimating the proportion of actions a person in each group will take (Z). (The actions are not independent of one another so count models are ruled out.) To estimate the proportion in each group, I use a beta regression model and for example have two explanatory variables (A, B). What I would like to know is, what is the probability that the regression coefficient for A (b_A) in Country 1 is larger than the regression coefficient for A (b_A) is in Country 2.

Thus my question: how can I best calculate p(b_A1 > b_A2)

Right now, using the distributions for each regression coefficient (b_A) generated in the trace, I have been estimating a Bayesian T-Test following this example, but it 1. seems very inefficient to estimate again another model with the trace data and 2. I worry if I am ignoring somehow a connection that the two regression coefficients share. Any thoughts you have would be greatly appreciated.

Please see below for the model where the different variables correspond with the ( ) in text.

A simplified model:

RSEED = 12354
import pandas as pd
from numpy import random
import pymc3 as pm

#Sample data
A = random.randint(0, 100, 500)* 0.01
B = random.randint(1, 6, 500)
Z = random.randint(1, 1000, 500) * 0.001
df = pd.DataFrame({‘A’:A, ‘B’:B, ‘Z’: Z})

with pm.Model() as model1:
#priors
intercept = pm.Normal(‘intercept’, mu=0, sd=10)
b_A = pm.Normal(‘b_A’, mu = 0, sd=5)
b_B = pm.Normal(‘b_B’, mu = 0, sd=5)
# Model error
model_err = pm.HalfNormal(‘model_err’, sd=.5)
y_est = pm.invlogit(b_A * df[‘A’] + b_B * df[‘B’] + intercept)
#Likelyhood
y = pm.Beta(‘y’, mu = y_est, sd=model_err,
observed=df[‘Z’])
# Inference

Ok, I think I understand. Feel free to correct me if not. I assume the synthetic data and model you provided are for a single country. And then you run that same model twice, each time using data from a different country?

Below I have modified your example to merge what I believe to be your 2 models (1 model per country) that each include 4 parameters (an intercept, 2 slopes, and an error term) into a single model with 4 pairs of parameters. As I mentioned, including two `b_A` parameters (1 per country) in a single trace will allow you to straightforwardly compare the posterior of each parameter. I have also modified the synthetic data, both to double the number of rows, and to add in a `country` column that dummy codes which country the data is associated with. This variable is then used to select the appropriate half of each parameter pair from the model.

So let me know if I misunderstood anything or if you have any questions.

``````RSEED = 12354
import pandas as pd
import numpy as np
from numpy import random
import pymc3 as pm

# Sample data

# country 1
A = random.randint(0, 100, 500) * 0.01
B = random.randint(1, 6, 500)
Z = random.randint(1, 1000, 500) * 0.001
country = 0 * np.ones(500, dtype=int)
df1 = pd.DataFrame({"A": A, "B": B, "Z": Z, "country": country})

# country 2
A = random.randint(0, 100, 500) * 0.01
B = random.randint(1, 6, 500)
Z = random.randint(1, 1000, 500) * 0.001
country = 1 * np.ones(500, dtype=int)
df2 = pd.DataFrame({"A": A, "B": B, "Z": Z, "country": country})
df = pd.concat([df1, df2], ignore_index=True)

with pm.Model() as model1:
# priors
intercept = pm.Normal("intercept", mu=0, sd=10, shape=2)
b_A = pm.Normal("b_A", mu=0, sd=5, shape=2)
b_B = pm.Normal("b_B", mu=0, sd=5, shape=2)

# Model error
model_err = pm.HalfNormal("model_err", sd=0.5, shape=2)

# linear combination of predictors/parameters
linear = (
intercept[df["country"]]
+ b_A[df["country"]] * df["A"]
+ b_B[df["country"]] * df["B"]
)

# Likelyhood
y = pm.Beta(
"y", mu=pm.invlogit(linear), sd=model_err[df["country"]], observed=df["Z"]
)

# Inference