I have birthing data on a particular type of whale. I’m trying to figure out how many years it takes to get to the next birth event, given that it just gave birth. Every year, researchers go out at a particular site in the ocean to observe whales. In any year, whales are spotted with a calf, indicating they gave birth. They could also be spotted to be alone, indicating they did not give birth. However, sometimes whales aren’t spotted for whatever reason (e.g. moved to a different ocean, which might still mean they could have given birth, etc.). It’s been proposed that the birthing interval be thought of as binomially distributed. However, I’m not sure how to take the censoring into account via PyMC3. Any thoughts on how to do this?
Here’s my thought process to get a frequentist MLE estimate through optimization.
Find all the whales that we have observed to have given birth.
Produce data given some proposed parameters. For an individual, we start out at the earliest birth event, then we keep generating birth events until we reach the end of the window of observation. We do this for all the individuals.
For the simulated data, we only look at the times that the whales were actually observed (so we discard parts of the simulated data where whales weren’t actually observed).
Find the error between the simulated data and the observed data. Currently, I’m using the mean squared error of the number of times births were observed in the span of time compared to the actual.
Do 4 a bunch of times to get an aggregate measure of fit given the specified parameters.
Do a brute-force search (repeating steps 2-5) through a space of possibilities to find parameters that fit best with the given data.
The nice thing about the above approach is that sample data from the best-fit distribution seems to align closely with the actual data. However, the problem with this approach is that it only gives me a point estimate; I would ideally want some sort of credible interval because we would like to compare distributions across time (i.e. are whales at this certain area reproducing less often compared to before?).
Actually, what you describe is a process very similar to approximate Bayesian computation. And you can get a posterior distribution from it actually: in your step 5, if you put a threshold of how much mean squared error is acceptable, you can accept or reject that propose parameters. Repeat that many (e.g., 100000) times you have your posterior distribution. An important modification is that in step 6 you dont do brute-force search, but generate sample from the prior.
But I think there should be ways to do formulate your problem into a Bayesian model that you can directly sampled from. How about model the birth event as a temporal point process:
birth rate ~ Gamma
Latent birth event for whale i = [0, 0, 0, …, .1, .25, .65, 1, .65, .25, .1, 0, 0, …] with the interval determined by the Gamma distributed birth rate
Observed event ~ Bernoulli(Latent birth event for whale i[actually year that has observation])
Maybe @fonnesbeck knows some ecological model that typically deal with this.
thank you for the detailed explanation. For me this sounds conceptually similar to a survival analysis (data can be converted to event lists; “survival rate” would be “birth rate”), for which there are some examples (by @AustinRochford, which were very useful for me for a seminar once):
There exists a variant with time varying effects, which would be ideal to capture the temporal aspect of your study.
And you would have to include occupancy (@junpenglao, is that what you meant by “latent birth event”?), and of course hierarchy level to model measurements from the same individuals or areas.
but could that not be covered by grouping the observations accordingly?
The example I used to calculate with it was insects in different compartments, with conditions varying between compartments. And just like insects could repeatedly die for a single compartment, a whale can give birth multiple times.
Anyways, it was just an idea (maybe a bad one), but I think it’s still worth checking structure and priors of these models.
Edit: another theoretical framework that just came to my mind is “inter-birth interval”, as in “inter-spike interval” ( wikipedia has some references). I think this can have advantages over rates in some cases, but it might collide with the censoring here.
Thanks for the response! Just for the sake of clarity with the ABC approach, let’s assume that we’re using a binomial distribution with parameters n and p. In the proposed version of step 5, we’ll have tons of accepted values for n and p. Let’s say that I have 100k possible pairs of parameters. So in step 6, I will just iterate through those 100k pairs and calculate the average calving interval for each pair (i.e. np since that’s the mean of binomial distribution). So at the end, I will have 100k samples of possible values for the average calving interval (i.e. posterior distribution). Does that sound right?
Then for the Bayesian model that you’re proposing, what does each value in Latent birth event for whale i mean?
Nope, you will compare the simulation result (n and p in this case) with a distance function you have in mind. And only accept the cases where the distance is small. So you will likely ends up with samples much less than 100k.
And then you can use the accepted sample to compute your HPD interval.
Thanks for the link to the general introduction. I read the section up to rejection ABC algorithm and it makes much more sense now why it’s an approximation of the posterior distribution, and how that algorithm is close to what I was trying to do. I’ll use that for now and see what results I get. Thanks a lot!
Glad I can help. Keep me posted of what solution you come up with - we (pymc_devs) also in the process of implementing ABC as one of the inference method (work mostly done by @agustinaarroyuelo and @aloctavodia), so it would be interesting to compare your result with what can get from PyMC3 soon.
Also, you might want to have a look at https://github.com/elfi-dev/elfi. I have never actually use this library, but the developers are from Aki Vehtari’s group, and they do really solid work.
Sorry to be late to respond here. This does sound like survival analysis to me, since you are modeling time-to-event, where an observation process would have to be accounted for. it doesn’t matter that you can have multiple births per individual, as they happen consecutively (never in the same year). You’d be modeling an underlying “survival” process using imperfect observation data.
For what it’s worth, here is the model i used to prepare for a course.
It was supposed to be done in R, via a linear regression. I found that dissatisfying, thus turned to the excellent example by Austin Rochford.
The code is not well cleaned (I did not care to much, sorry).
data: termites.txt (2.3 KB)
Model: TermitesExport.py (8.3 KB)
The biggest part is overhead for data organization, which, in retrospective, could maybe have been more elegant. I suggest you start from the model and check the structure of the data from there.
The model misses occupancy, and the data would not be sufficient for it. But that should be simple: just add a \delta on the observations in the model, analogous to the explanation here.
Hope this helps! Certainly @fonnesbeck has more sophisticated examples.
I’m breaking down the problem into smaller parts. For now, I’m gonna model each year separately. I have 115 individuals but I’m only using 25 of them for now to iterate the model quickly. Later, I’m gonna use all the individuals once I’m sure my model specification is sound. However, I’m getting the following error:
ERROR:pymc3:The estimated number of effective samples is smaller than 200 for some parameters.
Here’s the model specification that I came up with. Don’t mind the age variable for now. It could technically be negative (i.e. whale wasn’t born yet, and will be alive in the future), but I didn’t include that for now for the sake of simplicity.