Hey there PyMC community! I’ve been exploring some recent research, where they’ve demonstrated that a skew-t model works rather well for fitting gene enrichment data. I came across a thread here on this forum (here) that discusses how to use PyMC to handle skew-t models with a custom function. In there, they’ve used a logp function named logp_skewt (I’m presuming it’s for log-likelihood?). To understand it better, I created and tested a basic toy example. I ran into a bit of a snag though while referring to another thread (here), which mentions that the sampling process is rather slow. Can anyone help me speed things up? Am I using PyMC correctly here? If not, what am I missing? The complexity of my real data, which includes a hierarchical model for analyzing the 12,000 group models and 35,000 observations, is making me a bit nervous.
Here’s a quick look at my toy example, which includes the data generation process, prior check, and PyMC model:
# Data Generation Process
X ~ NB(mu = 10, alpha = 0.3)
b ~ Normal(mu = 2.5, sigma = 0.2)
y = b * X
For my prior prediction test, I ran the following code with some varying parameters.
ST parametrs:
Looking at the priors plot, it’s evident that the sigma prior is exerting a significant effect on the data. Can anyone suggest a better choice for the sigma prior in this specific situation? Any input would be greatly appreciated!
Fig_prior_skewt.pdf (19.8 KB)
Here’s a snippet for you:
# Code for prior prediction test
def min_mean_max(arr):
return {0:np.mean(arr),1:np.min(arr),2:np.max(arr)}
def skewT_test(sigma_value,nu_value,alpha_value,X_rand,Y_rand):
mu = pt.scalar("mu")
nu = pt.scalar("nu")
sigma = pt.scalar("sigma")
alpha = pt.scalar("alpha")
value = pt.scalar("value")
rv_t = pm.logp(pm.StudentT.dist(nu, mu=mu, sigma=sigma), value)
rv_T = pm.logcdf(pm.StudentT.dist(nu+1, mu=mu, sigma=sigma), alpha*value*pm.math.sqrt((nu+1)/(mu**2+nu)))
pdf = []
for x,y in zip(X_rand,Y_rand):
pdf_temp = pm.math.log(sigma_value).eval() + rv_t.eval({nu: nu_value, mu: x, sigma: sigma_value, value:y }) + rv_T.eval({nu: nu_value, mu: x,sigma: sigma_value,alpha:alpha_value, value:y}) - pm.math.log(sigma_value).eval()
pdf.append(np.exp(pdf_temp))
df = pd.DataFrame({'x': Y_rand, 'y': pdf})
df = df.sort_values(by=['y'])
return df
X_rand = pm.draw(pm.NegativeBinomial.dist(mu=10, alpha=0.3),draws=50, random_seed=RANDOM_SEED)
b = pm.Normal.dist(2.5,1,shape = 50)
Y_rand = X_rand*b.eval()
nu_dist = pm.Gamma.dist(2,0.1,shape = 50)
nu_dict = min_mean_max(nu_dist.eval())
sigma_value = pm.Exponential.dist(1/np.std(X_rand)**2,shape = 50)
sigma_dict = min_mean_max(sigma_value.eval())
alpha_value = pm.HalfCauchy.dist(3,shape = 50)
alpha_dict = min_mean_max(alpha_value.eval())
fig, axes = plt.subplots(3, 3,figsize=(11.5, 5))
x = np.arange(0,20)
for i in range(3):
for j in range(3):
df = skewT_test(sigma_value = sigma_dict[j],nu_value = nu_dict[i],alpha_value = alpha_dict[i],X_rand = X_rand,Y_rand = Y_rand)
axes[i,j].bar(df['x'].values, df['y'].values)
axes[i,j].set_title(f'alpha:{np.round(alpha_dict[i],1)} \n nu:{np.round(nu_dict[i],1)} \n sigma {np.round(sigma_dict[j],1)}')
plt.subplots_adjust(wspace = 0.5, hspace=1)
And here is the PyMC model:
# The PyMC model
def logp_skewt(value: TensorVariable, nu: TensorVariable, mu: TensorVariable, sigma: TensorVariable, alpha: TensorVariable) -> TensorVariable:
return (pm.math.log(2) +
pm.logp(pm.StudentT.dist(nu, mu=mu, sigma=sigma), value) +
pm.logcdf(pm.StudentT.dist(nu, mu=mu, sigma=sigma), alpha*value) -
pm.math.log(sigma))
with pm.Model() as model1:
beta = pm.Normal('beta',2.5,1)
mu = beta*X_rand
sigma = pm.Exponential("sigma",1/np.std(Y_rand))
nu = pm.Gamma("nu",2,0.1)
alpha = pm.HalfCauchy("alpha",3)
count = pm.CustomDist("count", nu,mu, sigma, alpha,logp=logp_skewt,observed=Y_rand)
trace = pm.sample(1000, tune=1000, target_accept=0.95, random_seed=RANDOM_SEED)
So, eager to hear your thoughts and advice on this. Thanks, everyone!