"Improving RNA Sequencing Read Count Predictions: Seeking Suggestions for Enhancing Negative Binomial Distribution Model Performance"

In the proposed methodology, reda counts (RNA sequencing) are analyzed using a negative binomial distribution model, as the observed variance significantly exceeds the mean. To evaluate the model’s performance in predicting read count parameters, a single-group scenario is considered. Noise reduction in read count prediction is the primary objective, and to achieve this, a synthetic dataset is created using the following code:

xreal = np.random.randint(1, 50, 10)

By incorporating plausible values for the parameters alpha_noise, beta_noise, gamma_noise, and the dispersion rj, the predicted read count is obtained:

alpha_noise = 0.7
beta_noise = 0.3
gamma_noise = 0.25
# dispersion
rj = 5

xpred = xreal * alpha_noise + xreal * beta_noise + xreal * gamma_noise
xExpirm = pm.NegativeBinomial.dist(mu=xpred, alpha=rj).eval()

To account for the absence of real counts prior to noise addition, xi parameters are incorporated into the model. Furthermore, a regularized coefficient is included to address potential differences in noise between parameters. Prior information for the noise parameters is as follows: alpha_noise ranges from 0 to 1, beta_noise varies between 0.5 and 0.25, and gamma_noise has no specific knowledge, falling within the range of 0 to 1. Additionally, the dispersion parameter lacks substantial prior information.

After running the model, the predictions obtained are unsatisfactory. Could anyone provide guidance or suggestions on potential errors or improvements that could be made to enhance the model’s performance?

with pm.Model() as model:
    alpha_noise = pm.Beta("alpha_noise", 3, 2)
    beta_noise = pm.Beta("beta_noise", 3, 2)
    gamma_noise = pm.Beta("gamma_noise", 3, 2)
    # Prior for the unknown xi values
    xi = pm.Uniform("xi", lower=0, upper=np.max(xExpirm), shape=len(xExpirm))
        # Regularized coefficients
    reg_param = 0.1  # You can adjust this parameter based on your knowledge about the problem
    reg_alpha = pm.Laplace("reg_alpha", mu=alpha_noise, b=1 / reg_param)
    reg_beta = pm.Laplace("reg_beta", mu=beta_noise, b=1 / reg_param)
    reg_gamma = pm.Laplace("reg_gamma", mu=gamma_noise, b=1 / reg_param)

    count_data_est = reg_alpha * xi + reg_beta * xi + reg_gamma * xi

    r = pm.Exponential("r", 0.5)
    obs_data = pm.NegativeBinomial("obs_data", mu=count_data_est, alpha=1/r, observed=xpred)
    trace = pm.sample(15_000, tune=20_000,target_accept = 0.9)

One suggestion is to start with simulated data. There are two aspects to this; the first is simulating data outside of PyMC so that you have a generating model with known properties, and then see if the model can recover the parameters. The second piece is to use sample_prior_predictive with the model above to generate simulated datasets with the model structure and prior choices you have. Both of these should provide clues as to why your model is not working as expected.

As a side note, sampling for 15K iterations and tuning for 20K is very much overkill. 1000 samples is plenty to summarize the posterior (unless you have serious autocorrelation, in which case you need to fix the model) and a couple thousand is usually plenty of tuning (depending on how challenging your model is to sample).

1 Like

I appreciate your response and thank you for taking the time. I am pleased to inform that I have improved the predictions by simplifying the model. Specifically, I have removed the latent predictor variable and instead incorporated a log-link function. Please find the revised model attached for your review.

with pm.Model() as model:

alpha_noise = pm.Beta("alpha_noise", 2, 1)

beta_noise = pm.Beta("beta_noise", 2, 5)

gamma_noise = pm.Beta("gamma_noise", 2, 5)

count_data_est = at.exp(alpha_noise + beta_noise + gamma_noise)

r = pm.Exponential("r", 0.5)

obs_data = pm.NegativeBinomial("obs_data", mu=count_data_est, alpha=1/r, observed=xExpirm)

trace = pm.sample(1000, tune=1000,target_accept = 0.9)