I am currently evaluating student performance in an online exam based on the NBA Foul Analysis with Item Response Theory example provided. This works fine and provides student ability and question difficulty as desired.

Here is a minimal working example of my code with toy data.

```
import pandas as pd
import matplotlib.pyplot as plt
import pymc as pm
#string id for student
student=['S1','S1','S1','S2','S2','S2','S3','S3','S3','S4','S4','S4']
#3 pools of questions, students get 1 at random from each
question=['Q1a','Q2a','Q3a','Q1b','Q2b','Q3b','Q1a','Q2b','Q3a','Q1b','Q2a','Q3b']
#correct=1,wrong=0
correct=[1,0,0, 1,1,0, 1,1,0, 1,1,1]
#marks given to each pool, not used at the moment
marks=[1,2,4,1,2,4,1,2,4,1,2,4]
data = pd.DataFrame(dict(student=student,question=question,correct=correct,marks=marks))
student_observed, student = pd.factorize(data['student'], sort=True)
question_observed, question = pd.factorize(data['question'], sort=True)
coords = {"student": student, "question": question}
with pm.Model(coords=coords) as model_toy:
# Data
correct_observed = pm.Data("correct_observed", data['correct'], mutable=False)
# Hyperpriors
sigma_theta = pm.HalfCauchy("sigma_theta", 2.5)
mu_b = pm.Normal("mu_b", 0.0, 10.0)
sigma_b = pm.HalfCauchy("sigma_b", 2.5)
# Priors
delta_theta = pm.Normal("delta_theta", 0.0, 1.0, dims="student")
delta_b = pm.Normal("delta_b", 0.0, 1.0, dims="question")
# Deterministic
ability = pm.Deterministic("ability", delta_theta * sigma_theta, dims="student")
difficulty = pm.Deterministic("difficulty", delta_b * sigma_b + mu_b, dims="question")
eta = pm.Deterministic("eta", ability[student_observed] - difficulty[question_observed])
# Likelihood
y = pm.Bernoulli("y", logit_p=eta, observed=correct_observed)
trace = pm.sample(1000, tune=500)
f,axs = plt.subplots(1,2,figsize=(6,4))
az.plot_forest(trace, var_names=["ability"], combined=True,ax=axs[0],labeller=az.labels.NoVarLabeller())
az.plot_forest(trace, var_names=["difficulty"], combined=True,ax=axs[1],labeller=az.labels.NoVarLabeller())
```

During the student’s exam they are given a random selection of questions drawn from different pools (different length of question and topic area). We would hope that question difficulty within each pool is nominally the same so the test is fair. But the IRT analysis indicates this isn’t true. While the ability score in IRT accounts for this, they pass based on their absolute mark and it’s reasonable for them to ask how this score might have been with a different selection of questions.

Given that the IRT model provides the ability of students and difficulty of the question, leading to a probability of answering correctly, I feel like I should be able to rework it to estimate the distribution of exam scores.

At this point I can’t see a strategy for progressing. It seems like having run the IRT analysis I would need to fix the ability/difficulty distributions and run something with the ‘question_observed’ indices varying according to certain rules (do any 2 from Q1-5, any 3 from Q6-11, and so on) then propagate this through to predict scores. I think it’s the varying ‘question_observed’, or equivilant, that’s most confusing to me.

Can anyone recommend a way forward?