Mixture of Uniform with possible contaminates

Hi Pymc3 Community,
I’ve been struggling over this model for a while now. A jupyter notebook is here and data here: markers.csv (8.0 MB)

I am trying to infer the true spindle start (true_spindle_starts) and end (not modeled in the attached notebook) of up to 5 ‘sleep spindles’ in a 25-second epoch of EEG data.

I have a number of raters, with varying, unknown expertise (rater_expertise) who, with some noise, mark the start (and end - not analyzed) points of what they think are true spindles (marker_starts). Sometimes they end up marking noise as a spindle (i.e. that mark is a contaminate). For each mark, they also give their confidence (conf) that a spindle is real [low=0.1, med=0.5, high=0.99].

See the picture below. The first plot is the raters spindle marks (marker_starts are the leading edges). Colors are confidence, dashed for contaminants. The second box is EEG, with red marking the true spindle (true_spindle_starts is the leading edge):

I have created a rather complicated model, where each raters spindle start (marker_start) mark is draw Normally from either a) one of the 5 possible true spindles true_spindle_starts, with some sd=rater_expertise or b) randomly from a uniform distribution across the whole 0-25 second epoch. A Bernoulli variable marker_is_from_true_spindle controls whether a raters marker is real/contaminate, where p(marker_is_real_spindle=1) is dependent on conf. A categorical variable mapping_from_marker_to_true_spindle controls the mapping between each true spindle start true_spindle_starts and each spindle marker start marker_start.
The code will hopefully make this more clear. mapping_from_marker_to_true_spindle is bounded between 0 (no spindles in an epoch) and number_of_true_spindles, where number_of_true_spindles is the number of True spindles in an epoch. See the code for more details.

To get the location of real spindles (true_spindle_starts), I run the model 3 times. First fitting for number_of_true_spindles. I then take the mode of number_of_true_spindles, and set that as observed and run again to find marker_is_from_true_spindle and mapping_from_marker_to_true_spindle, finally, I run one last time to get true_spindle_starts

My problems are:

  • A crazy amount of divergences…
  • Gelman-Ruben stats greater than 1.4
  • Clearly incorrect spindle locations being inferred
  • Z often is stuck at 1 for all chains.

Thanks in advance for anyone who had the time to help me out, any and all comments appreciated!

p.s. discourse wont allow for the upload of .ipynb files so i had to link it to a github repo. It would be nice to be able to drag and drop python notebooks here.

Could you change your model to something… simple? I have to admit that I don’t understand your model, even when the model can be shown like this

In fact, could you make your model with fewer variables? I am tempted to say less data but I see that you are using a small part of markers.csv.

Thanks for having a go at my model, i realize its not the simplest thing to interpret.
For the pupose of debugging, ive cut it down to just true spindle starts (true_spindle_starts and marker_start) and updated the orginal question, code and variable names to be a little more discriptive.
Ive also just fit for a single epoch of data.

Thanks for trying. Im going to reparamaterize this, and we’ll see if that helps. I’ll repost the solution if it works.