Exam data - inference of student performance

Aim

Inference of a set of students’ abilities (measured by ‘probability of correctly answering easy questions’ and ‘probability of correctly answer hard questions’) from their test scores, making allowance for the fact that different tests might be differentially hard and have different numbers of marks available.

Data

A set of students’ tests scores, across a range of tests, along with the maximum available mark in each test. Test scores are given as raw marks (not percentages). Note that occasionally students don’t take a particular test so there are some missing data values.

Modelling

Let i label a particular student and \alpha label a particular test. Let N_\alpha be the maximum number of marks available on test \alpha (which is known).

Let H_\alpha be the number of ‘hard’ marks available, and then E_\alpha = N_\alpha - H_\alpha is the corresponding number of ‘easy’ marks available (we crudely model each test as having some ‘hard’ marks, and some ‘easy’ marks).

Then, define p^i_E as the probability that student i answers an easy mark correctly. Similarly, define p^i_H as the probability that student i answers a hard mark correctly.

Priors

H_\alpha \sim U_\mathbb{Z}[0, N_\alpha]
p^i_E \sim U[0,1]
p^i_H|p^i_E\sim U[0,p^i_E]

Sampling distribution

Let (n^i_\alpha)_E be the number of easy marks obtained by student i on test \alpha, and let (n^i_\alpha)_H be the number of hard marks obtained by student i on test \alpha (both of these quantities are unknown a priori).

We posit the model

(n^i_\alpha)_H \sim Bi( H_\alpha, p^i_H ) and
(n^i_\alpha)_E \sim Bi( E_\alpha, p^i_E ).

Then

n^i_\alpha = (n^i_\alpha)_E + (n^i_\alpha)_H

is the total number of marks obtained by student i on test \alpha. These are the observed test scores i.e. \{n^i_\alpha\} are the data.

Code so far

data = pd.read_csv( ‘test_data.csv’, header = [0,1], index_col = 0 )
S = len(data.index) ## Number of students
T = len(data.columns) ## Number of tests
N = np.array( data.columns.get_level_values(1).values, dtype = int ) ## Max available marks for each test
n = data.to_numpy()

with pm.Model() as basic_model:
H = pm.DiscreteUniform(‘H’, lower = np.zeros(T, dtype = int), upper = N, shape = T )
E = N - H
pE = pm.Uniform(‘pE’, lower = 0., upper = 1., shape=S)
pH = pm.Uniform(‘pH’, lower = 0., upper = pE, shape=S)

    nE = pm.Binomial('nE', n=E[None,:], p=pE[:,None], shape = (S, T) )

    nH_obs = n - nE ## Observed number of hard questions answered correctly, given nE

    nH = pm.Binomial('nH', n=H[None,:], p=pH[:,None], shape = (S, T), observed = nH_obs )

    trace = pm.sample(10000, tune = 25000, discard_tuned_samples = True )

The code runs until I run the sampling routines, where it breaks with the follow error message:

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep

CompoundStep

Metropolis: [nE]
Metropolis: [H]
NUTS: [pH, pE]
Sampling 4 chains: 0%| | 0/140000 [00:00<?, ?draws/s]/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)

Bad initial energy, check any log probabilities that are inf or -inf, nan or very small:
nH -inf
pymc3.parallel_sampling.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 160, in _start_loop
point, stats = self._compute_point()
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 191, in _compute_point
point, stats = self._step_method.step(self._point)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/step_methods/compound.py”, line 27, in step
point, state = method.step(point)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py”, line 247, in step
apoint, stats = self.astep(array)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py”, line 144, in astep
raise SamplingError(“Bad initial energy”)
pymc3.exceptions.SamplingError: Bad initial energy

Any help on resolving this (perhaps my model is ill-specified?) would be most appreciated.

Cheers.

Hi,

  • Bad initial energy associated with Mean of empty slice usually signals missing values somewhere. You mentionned some tests do have missing values, and indeed the error message mentions nH, so I’d bet there are NaNs in nH_obs. PyMC3 handles and infers them automatically, but you have to wrap them into a numpy masked array – see this NB, Code 14.7, for an example.

  • For more details on this common error message, you can take a look at the FAQ.

  • Finally, I think I’d use more informative priors – Uniforms are usually not a good choice.

Hope this helps :vulcan_salute:

Super - thanks. Your post made me realise that I wasn’t passing test_value the sampler. This meant it was starting in a region of zero probability (logp = -inf), so the sampler breaks, naturally.

This helpful post mentioned using a test_value for a binomial distribution.

So the code works, now, provided there are no missing values in the data. However, if I make the observed data n a masked array, then the code breaks:

File “main.py”, line 43, in
nH_obs = n - nE ## Observed number of hard questions answered correctly, given nE
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/theano/tensor/var.py”, line 230, in rsub
return theano.tensor.basic.sub(other, self)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/theano/gof/op.py”, line 615, in call
node = self.make_node(*inputs, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/theano/tensor/elemwise.py”, line 480, in make_node
inputs = list(map(as_tensor_variable, inputs))
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/theano/tensor/basic.py”, line 194, in as_tensor_variable
return constant(x, name=name, ndim=ndim)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/theano/tensor/basic.py”, line 232, in constant
x_ = scal.convert(x, dtype=dtype)
File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/theano/scalar/basic.py”, line 284, in convert
assert type(x_) in [np.ndarray, np.memmap]
AssertionError

It looks like I am not allowed to subtract a pymc3.model.FreeRV object from a masked numpy array (but a normal ndarray is fine); do you know if there is a reason for this?

Here’s the full code:

data = pd.read_csv( ‘test_data.csv’, header = [0,1], index_col = 0 )
data.fillna(-1, inplace=True)
data = data.astype( ‘int32’ )
print(data)
S = len(data.index) ## Number of students
T = len(data.columns) ## Number of tests
N = np.array( data.columns.get_level_values(1).values, dtype = int ) ## Max available marks for each test
n = data.to_numpy()
n = np.ma.masked_equal(n, value=-1)

with pm.Model() as model:

    H = pm.DiscreteUniform('H', lower = np.zeros(T, dtype = int), upper = N, shape = T, testval = N//2 )
    E = N - H
    pE = pm.Uniform('pE', lower = 0., upper = 1., shape=S)
    pH = pm.Uniform('pH', lower = 0., upper = pE, shape=S)
    nE = pm.Binomial('nE', n=E[None,:], p=pE[:,None], shape = (S, T), testval = n//2 )
    nH_obs = n - nE  ## Observed number of hard questions answered correctly, given nE
    nH = pm.Binomial('nH', n=H[None,:], p=pH[:,None], shape = (S, T), observed = nH_obs )
    trace = pm.sample(5000, tune = 10000, discard_tuned_samples = True )

Many thanks for taking the time to look at this.

1 Like

It looks like you’re inferring the difficulty of the questions at the same time as you’re inferring the probability of answering the questions. But that way, it seems like you don’t have any observed variable: nE and nH_obs are declared as random variables in your model.

Don’t you observe this variables in your data though? I mean, do you know whether a question is classified as easy or hard before inference?

Yes I was hoping, using the model, to infer the number of difficult/easy questions on each test, using the data, rather than having to impose this manually / a priori.

The way I had initially coded this up, I had

nE = pm.Binomial(‘nE’, n=E[None,:], p=pE[:,None], shape = (S, T))
nH = pm.Binomial(‘nH’, n=H[None,:], p=pH[:,None], shape = (S, T))

and then

n = pm.Deterministic( nE + nH, observed = n )

since n are the actual observed test scores, but then I remembered that in pymc3 you can’t set a deterministic rv to have observed values, so I tried to get around it (a hack!?) using the method shown in my code above (also here below):

nE = pm.Binomial(‘nE’, n=E[None,:], p=pE[:,None], shape = (S, T))
nH_obs = n - nE ## Observed number of hard questions answered correctly, given nE
nH = pm.Binomial(‘nH’, n=H[None,:], p=pH[:,None], shape = (S, T), observed = nH_obs )

I don’t know if this is a sensible approach to specifying the model in pymc3. Perhaps there is a better approach that gets around creating these fake ‘observations’ nH_obs? So my nH_obs is an ‘observed’ variable in the sense that it includes the actual observed data n, but is also a random variable too, since nE is itself a random variable. Perhaps I’m thinking about this all wrong!

Nevertheless, as mentioned, the line

nH_obs = n - nE

breaks the code if the student test data n passed is a masked numpy array, but works if it’s a standard numpy array. So at the minute the code can’t handle missing values in the data…

Really appreciate your time/thoughts on this - thanks.

Interesting…
And you do the same for E and H. So this means the model tries to infer the number and proportion of easy and hard questions (E, nE, pE, H, nH, pH) just with the total number of questions n. I’m wondering if this will be identifiable :thinking:
My advice would be: if you know the number of difficult/easy questions on each test, use this information instead of trying to infer it.

That being said, maybe @junpenglao will be able to answer the big-picture questions:

  1. Is it possible to substract an RV from a masked numpy array?
  2. Why isn’t it possible to set a Deterministic to have observed values?

In general, latent discrete variable is better model as continue mixture (i.e., marginalized out the discrete variable), that case you can get rid of the mask and indexing. Some pointer to converting a discrete latent into mixture could be find in: Frequently Asked Questions

Not out of the box. Note that PyMC3 treat masked numpy array in the observed by indexing into the none-mask value to compute the log_prob, and index to the masked value to create a new Random Variable. You can in principle do the same by indexing and do 2 subtraction

Deterministic does not have log density you can evaluate.

1 Like

Super clear, thanks Junpeng :wink: