Hi everyone,
I’m Vybhav Chaturvedi, interested in the Survival Models project for GSoC 2026. I’ve been actively contributing to PyMC’s infrastructure and wanted to share what I’ve done and where I think the project could go.
Contributions
My contributions have been focused specifically on the logccdf i.e. censoring logic that the Survival Models would build on top of:
Merged:
-
pymc#8166 — Added numerically stable
logccdfto the LogNormal distribution.pm.Censoredcallslogccdfinternally to compute the likelihood contribution of right-censored observations (\log S(t)). The naive approachlog(1 - CDF(t))fails when CDF rounds to 1.0 infloat64; this PR implements it vialog_erfcto maintain full precision even deep in the tail. Without this,pm.Censored(LogNormal.dist(...))would silently produce garbage for outliers on tails. -
pytensor#1945 — Fixed
Pow.implreturning complex values for fractional powers of negative floats at the scalar level. Observed this while fixing thelogccdfroutine. This surfaced in distribution tests (Kumaraswamy), which(-1.0)**0.01produced a complex number. The fix replacesx**ywithnp.power(x, y)in scalarPow.impl, aligning the scalar path with the tensor path which already usesnp.powervianfunc_spec.
Open (awaiting review):
- pymc#8181 — Added
logccdftest coverage for Censored distributions. The censoring module (pymc/logprob/censoring.py) already uses_logccdf_helperinternally for right-censoredlogp, but thelogccdfdispatch for censored distributions had no direct tests. This PR mirrors the existingtest_censored_logcdf_continuousandtest_censored_logcdf_discretestructure, covering all four censoring configurations (uncensored, left, right, interval) for both continuous and discrete base distributions.
What I explored
I built a notebook around a problem I’ve actually worked on. Before returning to academia, I was working as a healthcare consultant for 3 years. I am well-versed and very passionate about care gaps and unmet needs. Recently, I developed a clinical trial data quality platform (Javelin.ai) for a Novartis-sponsored challenge that processes subject-level data across multi-site global studies and computes a Data Quality Index per subject using ICH E6(R2) weighted features. So modeling time-to-adverse-event with site-level heterogeneity and censoring is something I’ve dealt with directly.
The notebook fits a hierarchical Weibull AFT model using pm.Censored with site-level random effects, includes a three-way model comparison (Weibull vs Exponential vs LogNormal via LOO-CV), and demonstrates the numerical stability problem that my logccdf PR solved. I specifically used LogNormal as one of the comparison models to show that pm.Censored(LogNormal.dist(...)) now works correctly for censored observations in the tail: at t=500, the naive log(1 - CDF(t)) returns -inf, while the stable implementation via log_erfc stays accurate.
The notebook also revealed a practical challenge: the hierarchical censored model with numpyro produces ~1000 divergences in the centered parameterization. This is something a high-level survival module should handle transparently.
API Prototype
After going through the codebase for inspiration (censored_data, survival_analysis, weibull_aft) and building the models of my own, i observed one bottleneck! Curernlty we have a lot of boilerplate. Every distributional assumption requires rebuilding the same model block: pm.Data containers, priors, linear predictor, pm.Censored wrapping. While it was manageblt for me when i was attemtpintg to compare three distributions, but at scale, this problem will result in a lot of redundant effort!
Building upon what wiki aready mentions, so I prototyped a formula-based API based on CausalPy:
from pymc_survival import WeibullAFT, LogNormalAFT, SurvivalModel
# one-line model specification -- no pm.Model() block needed
m = WeibullAFT(
formula="time ~ treatment + dqi_score + (1 | site_id)",
data=df,
event_col="event",
).fit()
m.summary()
m.plot_survival(groups={"Control": {"treatment": 0}, "Treatment": {"treatment": 1}})
# swap distribution without rewriting the model
m_ln = LogNormalAFT(formula=..., data=df, event_col="event").fit()
# model comparison
SurvivalModel.compare({"Weibull": m, "LogNormal": m_ln})
The base SurvivalModel class handles formula parsing ((1 | group) for random intercepts), pm.Censored construction, and ArviZ integration. Subclasses only implement _make_dist() and _survival_function(): adding a new distribution is roughly 15 lines. The prototype runs end-to-end with Weibull, Exponential, and LogNormal.
Questions for the mentors
Three design questions I’d appreciate feedback on before finalizing my proposal:
-
Truncation: The wiki mentions truncation alongside censoring. Left truncation (delayed entry) is logically and statistically different from censoring: it requires conditioning on T > t_{\text{entry}},. Should this be handled within the survival module’s model construction, or is it a better idea to create a more general wrapper, something like
pm.Truncated? -
Semi-parametric models: For Cox PH / piecewise exponential, the baseline hazard is discretized into intervals, which is architecturally different from parametric AFT models that just need a distribution +
pm.Censored. Should these share the sameSurvivalModelbase class (with some flags for overrides), or would a separate class hierarchy be cleaner -
Formula library: While i used a hack sort of an approach using regex, I believe for prod level code, we would want something more rigorous, can we use Bambi, or other similar libraires?
Happy to discuss any of this further. My proposal will focus on the parametric models + formula API (if applicable) as the core deliverable, with piecewise exponential / Cox PH as a stretch goal.
About me
I’m a Masters student at IIT Kharagpur (Data Science/ML). I have two published papers in Springer journals on Brain-Computer Interfacing and physiological signal processing (Link1, Link2) with 150+ combined citations, plus an Innovation Patent (No. 2021101097). My coursework covers Probability, Stochastic Processes, Time Series Analysis, and Machine Learning.