How to recreate PPC from linear model coefficients?

EDIT: I worked it out by trying to emulate the shape of sample_ppc. I iterated over each sample in X_i and chose the appropriate corresponding beta weight. A moderator can feel free to delete this post.

I have a hierarchical Bayesian model (thanks @junpenglao!) that models gene expression of a single sample against a background cohort and the results that I get out of sample_ppc are great.

However, I’d like to use the coefficients in the linear model to recapitulate the distribution I get out of sample_ppc so I can get a posterior for genes that I did not train with (due to runtime limitations), but my attempts produce a different distribution and I’m not sure why.

From sample_ppc (posterior is Laplacian, ppp is posterior predictive p-value relative to the vertical line):

If I use the trace coefficients and sampling or averages from X_i, I get something like the below for the same gene:

No matter how I use the trace coefficients, I can’t seem to recreate the result I get from sample_ppc.

My process is:

  1. Calculate expected value of linear model by iterating over trace and sampling from X_i. Or using mean of trace coefs and mean of different X_i groups.
  2. Use np.random.laplace or scipy.stats.laplace.rvswith loc set to the above and scale set to the model error parameter in the trace.
  3. Plot

My X_i are different sizes which is why I randomly sample or take the average over X_i.

Thank you for any help or feedback, example code is below where t is the trace and classes are the different X_i groups.

gene = 'PARP1'
zs = []
for i, a in enumerate(t['a']):
    z = a
    for j, g in enumerate(classes):
        samples = np.random.choice(normal[normal['tissue'] == g][gene])
        z += t['b'][i, j] * samples

z = np.random.laplace(loc=zs, scale=t['eps'], size=(500, 2000)).ravel()

z_true = sample[gene]
ppp = round(sum(z_true < z) / len(z), 4)

plt.title(f'Gene: {gene} - PPP: {ppp}')

Glad you found the solution! I will leave this post here as we always like a post answer itself :wink:

1 Like

I have encountered a similar problem trying to reproduce the mean of the ppc using the coefficients from a simple linear model. I’m using the means of the traces for each beta applied to the data to try to produce the predicted value, as you would in a standard OLS regression framework. The resulting predicted values are highly correlated with the mean of the ppc but the values are off by a constant. I’m trying to understand why this is the case. It sounds like a similar situation to what @jvivian describes; I’d be interested to get more details on his solution.