Dealing with large unaffected cohort in parametric survival analysis

Hi there. I’m looking for some advice in better specifying my model, since it’s fitting to the data correctly, but not in the way that I want.

I have a project fitting multiple survival models to epidemiological data, with the hope of arbitrating between a number of hypotheses of disease causation, each of which implies one of the generative models I’m trying to fit. The models are a combination of one of a set of five hazard distributions, and one of four possibilities for susceptibility: uniform susceptibility, a proportional hazards frailty model with one of two different frailty distributions, and a mixture-cure model where only a subset of the population are susceptible to the condition.

My dataset is the age at onset of the disease for about 5000 people affected by it, and the disease has a lifetime prevalence of about 1%, which implies an unaffected population of about 499 000 for the sample population. My intention in fitting the models is to fit both the shape of the age at onset distribution, and the true lifetime prevalence to select the model that best describes the combination of both. Currently, the general form of my model is this:

  1. Make synthetic dataset by adding dummies for the unaffected population whose ages at onset are marked as right-censored at the oldest observed age.
  2. From the dataset, define the arrays d_t, the number with disease onset in that year; and n_t, the number still at risk that year.
  3. Define survival probability at time t (S_t), and time t+1 (S_t1) based on hazard and susceptibility for this model.
  4. Define p_event_t= 1 - S_t1 / S_t.
  5. Define likelihood distribution, pm.Binomial(‘like’, n=n_t, p=p_t, observed=d_t

I hoped that this would get the model to fit both the shape and the size of the censored population. In looking at the posterior predictive for the models though, they all predict the lifetime prevalence well, but many look nothing like the shape of the incidence distribution for affected people, even though the models are capable of producing that shape. I assume that this is the enormous number of censored observations (99%) dominating the data of the affected people. In model comparison, the models all come out as very similar too, despite the fact that some reflect the shape of the incidence distribution well and others don’t, again, I assume that this is because they all fit the number of unaffected people well.

I’m at a loss as to how I can get fitting the shape and the censored number for this model to have similar weight in fitting the model, and therefore in model comparison. Is anybody able to suggest a method that would better reflect my aim please?

Hey There!

Is there a reason why you are trying to motivate a Binomial likelihood? It’s typical to use a Poisson process to derive probability estimates for a given observation interval ie in cox regression; because the poission likelihood and the censored partial likelihood are equivalent to a constant The Cox Proportional Hazards Model in Glum — glum documentation . I haven’t seen any binomial approximations specifically-off the top of my mind this would require some extra steps that might not be the most fun and make it a bit harder to debug

You could also use a Bernoulli approximation-but note that this is also a trick-if you were to model the probability for survival of each individual in a given interval for all observed intervals, then despite non independence the likelihood is also proportional to a Bernoulli model(just think of creating an augmented dataset where we create a design matrix for all time and individuals up to censoring/event). This would be appropriate in the non parametric context (perhaps you are using something like…BART or splines or whatever). Mculloch has a nice paper applying Bayesian boosting-and it relies on this trick (with some cool application of probit regression to generate individual level survival probabilities that form the likelihood in question)

Because you’re asking causal questions; do you have a DAG of covariates?

Thanks so much for your response. Survival analysis is fairly well outside my usual lane of experimental psychology/computational psychiatry, so the only reason I’ve been trying to make the binomial likelihood work is ignorance. The Poisson looks like the answer to my problem.

With regard to the DAG, I don’t have any covariate data, so am only fitting the ages of onset, which I have from an epidemiological study. The DAG would therefore just be the combination of a person’s susceptibility and their hazard at a given time as contributors to the outcome of developing the condition. Each of the hazard distributions I’m using summarizes a different possible process to cause the disease which is sometimes assumed or discussed: exponential for a single random event, lognormal for a single random event which is more probable in mid-life, gamma for a discrete set of required events, and inverse Gaussian for a ‘disease-proneness’ which can wax and wane as the sum of innumerable events, with the recent ones being most salient.

Of course!

I’m a little confused; is age a covariate or not? Is it the recorded age at the time of event? If so, then you could still use this for the mean of the poisson process and pass it into the cox formulation (note-the notes i passed to you are motivated from mle-but if we just use a flat prior the logic follows similarly) If there are no covariates, you could save yourself some time and consider something like a kaplan meier estimate that you can do quick and dirty.

I’m admittedly slightly confused by the second half of your last paragraph; the survival function and the hazard contain the same information-you can move seamlessly between the culm hazard, hazard, and survival. In the language of the dag, we would just say that a potential series of given nodes contribute to the hazard, and we can use this to predict the likelihood of event (and simply integrate to approximate the surivival function). I think I’m confused by the distinction between susceptibility and hazard in the paragraph-and to me the hazard describes just that in context

Sorry, I just looked back at my original post, and I’ve made it ambiguous by using the term proportional hazards without qualification. These are parametric proportional hazards models, not Cox proportional hazards models. Hazard is determined by the parametric hazard distribution, with age as the timescale, age at disease onset as the observed variable, and the distribution parameters as rv’s.

I’ve been using the word susceptibility to try to take in a slightly larger concept than frailty and avoid using frailty for things that aren’t (perhaps there’s a generic term I don’t know). Essentially it takes the place of frailty, so each model has the form hazard(t | susceptibility) = susceptibility * hazard(t). For two of my ‘susceptibilities’, these are frailty distributions, but the mixture-cure formulation and no frailty (ie susceptibility = 1) cases come under this umbrella and aren’t strictly frailty models.

Given I have the number of people first getting the condition at each age and the population size, it seems I have all I need to pass to the Poisson likelihood.

I hope this clarifies things. If not, it might be a sign that there’s something fundamental that I’ve misunderstood.

Yes I think that clarified it! so your hazard at for a given individual is a function of age, and onset age?

You should still be able to use a Poisson process. the ‘non parametric’ portion of the Cox model comes from the baseline hazard having a very flexible formulation. It’s only called a semi-parametric model because the baseline hazard is multiplied by an exponential transformation of the covariates (your parametric part!). Interestingly, you are NOT limited to linear relationships-but that works for most applications (there is nothing really stopping you from throwing a gaussian process in there or whatever)

So to confirm-your panel data is organized as a matrix of subjects 1….n and time measurements 1…j for each individual, so you have at most n x j rows? Note under piecewise approximations to the survival curve that generally this is robust to censoring (the partial likelihood under many formulations is just factored into the (interval specific) individual likelihood when an event occurs when it does not occur-therefore you can have less than n x j rows)

To make sure we’re on the same page, can you provide a brief outline of your generative model?

I should disclaim I usually don’t see frailty presented this way (but this is because I generally see it associated with….the Cox and other parametric/semi parametric model). Frailty to me implies we are affording the likelihood of a single individual at a time t to be scaled by some constant if they belong to a sub-group of interest). But I should also disclaim I’ve seen it used…in a non precise manner