GSoC 2026 - Survival Models - Intent and Intro!

vybhav72954 · March 25, 2026, 2:37pm

Hi everyone,

I’m Vybhav Chaturvedi, interested in the Survival Models project for GSoC 2026. I’ve been actively contributing to PyMC’s infrastructure and wanted to share what I’ve done and where I think the project could go.

Contributions

My contributions have been focused specifically on the logccdf i.e. censoring logic that the Survival Models would build on top of:

Merged:

pymc#8166 — Added numerically stable logccdf to the LogNormal distribution. pm.Censored calls logccdf internally to compute the likelihood contribution of right-censored observations (\log S(t)). The naive approach log(1 - CDF(t)) fails when CDF rounds to 1.0 in float64; this PR implements it via log_erfc to maintain full precision even deep in the tail. Without this, pm.Censored(LogNormal.dist(...)) would silently produce garbage for outliers on tails.
pytensor#1945 — Fixed Pow.impl returning complex values for fractional powers of negative floats at the scalar level. Observed this while fixing the logccdf routine. This surfaced in distribution tests (Kumaraswamy), which (-1.0)**0.01 produced a complex number. The fix replaces x**y with np.power(x, y) in scalar Pow.impl, aligning the scalar path with the tensor path which already uses np.power via nfunc_spec.

Open (awaiting review):

pymc#8181 — Added logccdf test coverage for Censored distributions. The censoring module (pymc/logprob/censoring.py) already uses _logccdf_helper internally for right-censored logp, but the logccdf dispatch for censored distributions had no direct tests. This PR mirrors the existing test_censored_logcdf_continuous and test_censored_logcdf_discrete structure, covering all four censoring configurations (uncensored, left, right, interval) for both continuous and discrete base distributions.

What I explored

I built a notebook around a problem I’ve actually worked on. Before returning to academia, I was working as a healthcare consultant for 3 years. I am well-versed and very passionate about care gaps and unmet needs. Recently, I developed a clinical trial data quality platform (Javelin.ai) for a Novartis-sponsored challenge that processes subject-level data across multi-site global studies and computes a Data Quality Index per subject using ICH E6(R2) weighted features. So modeling time-to-adverse-event with site-level heterogeneity and censoring is something I’ve dealt with directly.

The notebook fits a hierarchical Weibull AFT model using pm.Censored with site-level random effects, includes a three-way model comparison (Weibull vs Exponential vs LogNormal via LOO-CV), and demonstrates the numerical stability problem that my logccdf PR solved. I specifically used LogNormal as one of the comparison models to show that pm.Censored(LogNormal.dist(...)) now works correctly for censored observations in the tail: at t=500, the naive log(1 - CDF(t)) returns -inf, while the stable implementation via log_erfc stays accurate.

The notebook also revealed a practical challenge: the hierarchical censored model with numpyro produces ~1000 divergences in the centered parameterization. This is something a high-level survival module should handle transparently.

API Prototype

After going through the codebase for inspiration (censored_data, survival_analysis, weibull_aft) and building the models of my own, i observed one bottleneck! Curernlty we have a lot of boilerplate. Every distributional assumption requires rebuilding the same model block: pm.Data containers, priors, linear predictor, pm.Censored wrapping. While it was manageblt for me when i was attemtpintg to compare three distributions, but at scale, this problem will result in a lot of redundant effort!

Building upon what wiki aready mentions, so I prototyped a formula-based API based on CausalPy:

from pymc_survival import WeibullAFT, LogNormalAFT, SurvivalModel

# one-line model specification -- no pm.Model() block needed
m = WeibullAFT(
    formula="time ~ treatment + dqi_score + (1 | site_id)",
    data=df,
    event_col="event",
).fit()

m.summary()
m.plot_survival(groups={"Control": {"treatment": 0}, "Treatment": {"treatment": 1}})

# swap distribution without rewriting the model
m_ln = LogNormalAFT(formula=..., data=df, event_col="event").fit()

# model comparison
SurvivalModel.compare({"Weibull": m, "LogNormal": m_ln})

The base SurvivalModel class handles formula parsing ((1 | group) for random intercepts), pm.Censored construction, and ArviZ integration. Subclasses only implement _make_dist() and _survival_function(): adding a new distribution is roughly 15 lines. The prototype runs end-to-end with Weibull, Exponential, and LogNormal.

Questions for the mentors

Three design questions I’d appreciate feedback on before finalizing my proposal:

Truncation: The wiki mentions truncation alongside censoring. Left truncation (delayed entry) is logically and statistically different from censoring: it requires conditioning on T > t_{\text{entry}},. Should this be handled within the survival module’s model construction, or is it a better idea to create a more general wrapper, something like pm.Truncated?
Semi-parametric models: For Cox PH / piecewise exponential, the baseline hazard is discretized into intervals, which is architecturally different from parametric AFT models that just need a distribution + pm.Censored. Should these share the same SurvivalModel base class (with some flags for overrides), or would a separate class hierarchy be cleaner
Formula library: While i used a hack sort of an approach using regex, I believe for prod level code, we would want something more rigorous, can we use Bambi, or other similar libraires?

Happy to discuss any of this further. My proposal will focus on the parametric models + formula API (if applicable) as the core deliverable, with piecewise exponential / Cox PH as a stretch goal.

About me

I’m a Masters student at IIT Kharagpur (Data Science/ML). I have two published papers in Springer journals on Brain-Computer Interfacing and physiological signal processing (Link1, Link2) with 150+ combined citations, plus an Innovation Patent (No. 2021101097). My coursework covers Probability, Stochastic Processes, Time Series Analysis, and Machine Learning.

vybhav72954 · March 29, 2026, 5:35am

@ricardoV94 @fonnesbeck @bwengals

Happy weekend team - Hope you are doing well! Would it be possible for you guys to share some feedback before the submission window?

I want to ensure that my proposal actually aligns with your vision for the project going forward. I have a good idea about the reposotory right now and hence i have made the proposal basis my understaidning but it would be really helpdul to get some inputs from the team!

Thank you for your guidance!

Topic		Replies	Views
GSoC 2026: Interest in Survival Models Development gsoc	3	171	March 18, 2026
GSoC 2026 Actuarial Survival Models: Tackling Left-Truncation & Declarative API Vision Development development , modeling , gsoc	0	32	March 26, 2026
GSoC 2026: Declarative API for Bayesian Survival Analysis Development gsoc	0	24	March 26, 2026
Trouble porting survival models from pymc3 to pymc v4 v5	9	607	August 15, 2022
Consider adding numerically stable `logccdf` (log survival function) to Exponential, Weibull and other distro version agnostic development	18	140	March 18, 2026