Censored timeseries

Hi all,

I am analyzing the activity of neurons in a mouse that were recorded using a technique calcium imaging. As most neuroscientists, I am interested in what these neurons encode or are “tuned” for. I have a range of candidate variables that are either binary (e.g. poking/not poking a hole) or continuous (e.g. running speed).

I have attempted to build a GLM to estimate regression coefficients for these variables. There are two things that make this a bit complicated: (1) the experimental technique only allow us to see the signal when it is positive and (2) we are measuring a time series and hence the samples are dependent in time (here’s a representative example:)
image

My current attempt is to ignore the temporal dependencies and use a Tobit link-function with a GLM:

class Tobit(pm.Normal):
    '''My implementation of the Tobit model (see https://en.wikipedia.org/wiki/Tobit_model).'''
    def random(self, point=None, size=None):
        samples = pm.Normal.random(self, point, size)
        return samples * (samples > 0)
    
    def logp(self, value):
        lp  = tt.switch(value>0,
                 pm.Normal.logp(self, value),
                 pm.distributions.dist_math.normal_lcdf(self.mu, self.sd, 0))
        return lp

This gives me sensible coefficients eventually, but the geometry of using this seem to be giving the NUTS sampler a hard time and most chains slow down from ~100 samples / s to less than 1 sample / s after a few hundred samples, making the run time very long. I briefly tested the same model in Stan, with very similar run times. I guess this means I should try to re-parametrize my problem or model the censored values in some other way. I have played around with different ideas (like having the means come from an AR1-time series, or having a zero-inflated gamma model) for a bit, but the sampler always gets stuck/slow or return nonsense.

Does anyone happen to have any experience in modeling censored time-series, or have any clever ideas on how to model them? As a side note, I have ~70000 time points from ~4000 neurons (100 samples / s was for one single neuron time series), so I am very willing to sacrifice statistical rigor for performance in this case.

Thanks!

LOL, dont let your reviewer see that :wink:

The model seems fine, and if there is no divergence I would stick with that for now. I would also try modeling all neurons together (more information usually help convergence).

Also, how large is your predictor matrix in the GLM? Have you try some preconditioning (e.g., PCA) on it first?

Thanks a lot for your reply!

Hopefully the reviewers are not hanging out here :wink: . Although I’ve seen multiple papers in high-ranked journals using OLS on this type of data without commenting on the distribution at all, so maybe they should :slight_smile:

The problem with sticking to the current model is that fitting one neuron now takes ~15 minutes, so 4000 neurons would be a challenge. But your suggestion of fitting all neurons (or perhaps rather the groups of ~500 from each animal) sounds very interesting. If the convergence was good, would it be reasonable to try to fit a 70000x20 predictor matrix to a 70000x500 observed matrix in pymc3? Could it be that the run time scales better than linearly with multiple observations?

Most binary variables (but not all) in the predictor matrix (currently ~20 variables but I’d like to include some more) are mutually exclusive so I don’t believe PCA would help much.

More data and better prior does wonder - it makes your model runs faster.

I agree that with binary variables PCA probably wont help, however, you might still benefit from change of parameterization (eg, instead of coding it as 0 and 1, coding it as -1 and 1)

Thanks again for the hints. I’ve played around a bit with different priors, and changed the normal distribution above to a cauchy to guard against outliers.

However, I still feel the run time is suspiciously slow, so I’ve looked into it some more. In a profiler I noticed that the erfcx function eats 60-70% of the run time when running on a GPU (which seems consistent with the theano documentation saying there’s no c-code generated for erfcx). That shouldn’t be a problem for me now that I’m using Cauchy anyways, but might still be useful to know for others.

A bigger mystery is the following stats I get from the traces:
image
How could it be that the step size for chain 2 is 2-3 orders of magnitude smaller than the others? This seems to cause the tree size to be much bigger which takes longer to run. I guess I could use more tuning steps but wouldn’t the tuning also take a lot of time to run if the step size if way off for some reason?

I also have a small question about the sign convention for the log likelihoods. Currently the model_logp is positive and the energy is negative. Is that what’s expected or have I made an error somewhere?

I have calculated the model_logp in some directions around the posterior mean and it looks smooth and fairly uncorrelated, so I don’t believe there are any tricky funnels messing things up.

What you see is usually an indication that there are some pathological area of your parameter space, usually is area with high curvature. The sampler adapted to the curvature in that area, which resulting in a small step size and lots of steps - a random walk like behavior.

Did you try without the censoring? Maybe the problem is that there are too many value being centored?