Aim
Inference of a set of students’ abilities (measured by ‘probability of correctly answering easy questions’ and ‘probability of correctly answer hard questions’) from their test scores, making allowance for the fact that different tests might be differentially hard and have different numbers of marks available.
Data
A set of students’ tests scores, across a range of tests, along with the maximum available mark in each test. Test scores are given as raw marks (not percentages). Note that occasionally students don’t take a particular test so there are some missing data values.
Modelling
Let i label a particular student and \alpha label a particular test. Let N_\alpha be the maximum number of marks available on test \alpha (which is known).
Let H_\alpha be the number of ‘hard’ marks available, and then E_\alpha = N_\alpha - H_\alpha is the corresponding number of ‘easy’ marks available (we crudely model each test as having some ‘hard’ marks, and some ‘easy’ marks).
Then, define p^i_E as the probability that student i answers an easy mark correctly. Similarly, define p^i_H as the probability that student i answers a hard mark correctly.
Priors
H_\alpha \sim U_\mathbb{Z}[0, N_\alpha]
p^i_E \sim U[0,1]
p^i_H|p^i_E\sim U[0,p^i_E]
Sampling distribution
Let (n^i_\alpha)_E be the number of easy marks obtained by student i on test \alpha, and let (n^i_\alpha)_H be the number of hard marks obtained by student i on test \alpha (both of these quantities are unknown a priori).
We posit the model
(n^i_\alpha)_H \sim Bi( H_\alpha, p^i_H ) and
(n^i_\alpha)_E \sim Bi( E_\alpha, p^i_E ).
Then
n^i_\alpha = (n^i_\alpha)_E + (n^i_\alpha)_H
is the total number of marks obtained by student i on test \alpha. These are the observed test scores i.e. \{n^i_\alpha\} are the data.
Code so far
data = pd.read_csv( ‘test_data.csv’, header = [0,1], index_col = 0 )
S = len(data.index) ## Number of students
T = len(data.columns) ## Number of tests
N = np.array( data.columns.get_level_values(1).values, dtype = int ) ## Max available marks for each test
n = data.to_numpy()
with pm.Model() as basic_model:
H = pm.DiscreteUniform(‘H’, lower = np.zeros(T, dtype = int), upper = N, shape = T )
E = N - H
pE = pm.Uniform(‘pE’, lower = 0., upper = 1., shape=S)
pH = pm.Uniform(‘pH’, lower = 0., upper = pE, shape=S)
nE = pm.Binomial('nE', n=E[None,:], p=pE[:,None], shape = (S, T) )
nH_obs = n - nE ## Observed number of hard questions answered correctly, given nE
nH = pm.Binomial('nH', n=H[None,:], p=pH[:,None], shape = (S, T), observed = nH_obs )
trace = pm.sample(10000, tune = 25000, discard_tuned_samples = True )
The code runs until I run the sampling routines, where it breaks with the follow error message:
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
CompoundStep
Metropolis: [nE]
Metropolis: [H]
NUTS: [pH, pE]
Sampling 4 chains: 0%| | 0/140000 [00:00<?, ?draws/s]/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
Bad initial energy, check any log probabilities that are inf or -inf, nan or very small:
nH -inf
pymc3.parallel_sampling.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 160, in _start_loop
point, stats = self._compute_point()
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 191, in _compute_point
point, stats = self._step_method.step(self._point)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/step_methods/compound.py”, line 27, in step
point, state = method.step(point)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py”, line 247, in step
apoint, stats = self.astep(array)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py”, line 144, in astep
raise SamplingError(“Bad initial energy”)
pymc3.exceptions.SamplingError: Bad initial energy
Any help on resolving this (perhaps my model is ill-specified?) would be most appreciated.
Cheers.