Predicting student exam scores

DrMP · October 17, 2023, 9:54am

I am currently evaluating student performance in an online exam based on the NBA Foul Analysis with Item Response Theory example provided. This works fine and provides student ability and question difficulty as desired.

Here is a minimal working example of my code with toy data.

import pandas as pd
import matplotlib.pyplot as plt
import pymc as pm

#string id for student
student=['S1','S1','S1','S2','S2','S2','S3','S3','S3','S4','S4','S4'] 
#3 pools of questions,  students get 1 at random from each
question=['Q1a','Q2a','Q3a','Q1b','Q2b','Q3b','Q1a','Q2b','Q3a','Q1b','Q2a','Q3b']
#correct=1,wrong=0
correct=[1,0,0, 1,1,0, 1,1,0, 1,1,1] 
#marks given to each pool, not used at the moment
marks=[1,2,4,1,2,4,1,2,4,1,2,4]

data = pd.DataFrame(dict(student=student,question=question,correct=correct,marks=marks))
  
student_observed, student = pd.factorize(data['student'], sort=True)
question_observed, question = pd.factorize(data['question'], sort=True)
  
coords = {"student": student, "question": question}
  
with pm.Model(coords=coords) as model_toy:

    # Data
    correct_observed = pm.Data("correct_observed", data['correct'], mutable=False)

    # Hyperpriors
    sigma_theta = pm.HalfCauchy("sigma_theta", 2.5)
    mu_b = pm.Normal("mu_b", 0.0, 10.0)
    sigma_b = pm.HalfCauchy("sigma_b", 2.5)

    # Priors
    delta_theta = pm.Normal("delta_theta", 0.0, 1.0, dims="student")
    delta_b = pm.Normal("delta_b", 0.0, 1.0, dims="question")

    # Deterministic
    ability = pm.Deterministic("ability", delta_theta * sigma_theta, dims="student")
    difficulty = pm.Deterministic("difficulty", delta_b * sigma_b  + mu_b, dims="question")
    eta = pm.Deterministic("eta", ability[student_observed] - difficulty[question_observed])

    # Likelihood
    y = pm.Bernoulli("y", logit_p=eta, observed=correct_observed)

    trace = pm.sample(1000, tune=500)

f,axs = plt.subplots(1,2,figsize=(6,4))
az.plot_forest(trace, var_names=["ability"], combined=True,ax=axs[0],labeller=az.labels.NoVarLabeller())
az.plot_forest(trace, var_names=["difficulty"], combined=True,ax=axs[1],labeller=az.labels.NoVarLabeller())

During the student’s exam they are given a random selection of questions drawn from different pools (different length of question and topic area). We would hope that question difficulty within each pool is nominally the same so the test is fair. But the IRT analysis indicates this isn’t true. While the ability score in IRT accounts for this, they pass based on their absolute mark and it’s reasonable for them to ask how this score might have been with a different selection of questions.

Given that the IRT model provides the ability of students and difficulty of the question, leading to a probability of answering correctly, I feel like I should be able to rework it to estimate the distribution of exam scores.

At this point I can’t see a strategy for progressing. It seems like having run the IRT analysis I would need to fix the ability/difficulty distributions and run something with the ‘question_observed’ indices varying according to certain rules (do any 2 from Q1-5, any 3 from Q6-11, and so on) then propagate this through to predict scores. I think it’s the varying ‘question_observed’, or equivilant, that’s most confusing to me.

Can anyone recommend a way forward?

Simon · October 17, 2023, 8:07pm

Would be too simplistic to do something like this?: ∑_im_ip_i,k, where m_i are the marks, p_i,k are the posterior probabilities for one student with k…S samples. You could do that for each student, maybe even added as a pm.Deterministic inside the model. In a way, it’s simply using the estimated probability of answering correctly as a weight for each maximum mark. Not the fanciest solution, but maybe useful in case other options are too difficult/time consuming to implement.

DrMP · October 18, 2023, 10:50am

Thanks. If I understood correctly, that would estimate the exam score if the student took all the questions? They actually get a subset of all the questions that would change if they were able to resit the exam. There are several hundred students that each get about 40 out of 120 questions pulled from diffent pools at random. It’s this part I can’t see how to do in pymc.

I suppose I can imagine an approach outside of pymc where I loop through each student, generate many distribtutions of questions and do as you suggest to generate a distribution of scores. I was hoping there was already something in pymc that would let me avoid the inevitable errors done by amateurs.

Simon · October 18, 2023, 11:37am

Sorry if I miss something, but wouldn’t that defeat the purpose of an IRT analysis? The point of this type of analysis is providing an heuristic so you can remove the questions that are too difficult/easy. Let’s say you discard 20, then you can predict the scores on the remaining 100. These should have the same relative difficulty, so you can estimate the score per question for the entire 100 questions, irrespective of what 40 random questions each student receives from those 100. As you mentioned, you could generate a distribution of N random selections of 40 questions and see whether scores remain relatively constant. You can do that with the posterior distributions, rather than adding a function inside the model (as it will be already a bit heavy with 120 questions and hundreds of students, though it’s always possible to optimise the model by reparametrising. Not exactly the same, but maybe what I’ve have implemented here may help: GitHub - SimonErnesto/IRT_GRM_Bayesian: Bayesian approach to item response theory (IRT) implementing a graded response model (GRM) on questionnaire data.).

DrMP · October 18, 2023, 12:27pm

Well, the IRT analysis isn’t being used to evaluate students in an official way. They get a score based on correct answers and on the basis of that they get a pass/fail/grade. I don’t see that changing anytime soon.

The IRT analysis is my attempt to quantify the effect of students getting different sets of questions and so losing or gaining when their marks are put on an absolute scale that implicitly assumes all questions are equal. Using the IRT to estimate ranges of scores is my attempt to present the data in a way that I think will fit more easily into people’s heads.

Thanks for the advice. I think I can see a way forward now.

Topic		Replies	Views
Item response theory model with odd shape mismatch v5 modeling	8	746	January 6, 2023
Is this an appropriate way to model IRT in PyMC3 Questions theano , doc	2	728	May 3, 2021
Initial evaluation of model at starting point failed v3 modeling	3	809	September 12, 2022
IRT w/ PyMC3, unusual number of parameters modeled? Questions	4	885	November 19, 2020
Theoretical and Practical Considerations and Questions v5 development , modeling , sampling	0	26	September 13, 2024

Predicting student exam scores

Related topics