My name is Michele Di Giovanni (GitHub: mikjkd). I recently graduated with a PhD in computer science, and I’m interested in applying to the GSoC “Spatial modeling” project for PyMC.
I’d like to discuss the project scope with the potential mentors, @bwengals and @fonnesbeck to decide which algorithm would be the best fit and most valuable to implement this year.
I’m particularly interested in BYM/BYM2 models, but I’m flexible based on PyMC’s priorities and constraints.
Thanks for getting in touch about GSoC. We are excited for another fun summer of PyMC development! It is still early days in the process and we are basically in a holding pattern until we know how many slots we will get in the program this year. Once we do that, we will put out a call for students at which point you would register with GSoC, submit a proposal, and our team would select the top proposals, corresponding to the number of slots we get. We will post more details once we know more.
Hi Chris,
Thanks for the update — that makes perfect sense.
I reached out also because I understand it’s good practice to get in touch with organization members before submitting a proposal, mainly to align on what would be most useful for PyMC and what a strong proposal should contain.
In the meantime I’ll keep contributing to PyMC where I can, and I’ll start drafting a proposal around spatial modeling (likely ICAR/CAR and a BYM-style implementation), so I’m ready when the call opens.
If you have any pointers on what you’d like to see in proposals for this topic (scope, milestones, API/design preferences), I’d really appreciate it.
Hi @ricardov94 and @daniel-saunders-phil — from my reading of #7713 and the references linked there, I understand that the main reason behind Ricardo’s pushback is that “sampling from ICAR”, in its usual formulation, can be interpreted as sampling from an improper distribution, i.e. not a proper probability density on \mathbb{R}^n. This is mathematically ambiguous and potentially misleading to users. Providing an rng_fn in that setting risks giving users the impression that they are drawing from a well-defined prior on \mathbb{R}^n, when in fact the distribution is only defined up to an arbitrary gauge/constraint.
I realize that one can still generate draws from a singular Gaussian measure by working in the identifiable subspace (e.g. via an SVD/EIG decomposition and fixing the nullspace component), as described for instance here.
However, at the moment I’m struggling to come up with concrete user-facing examples where this would be clearly appropriate and not misleading, unless we also make the gauge/constraint explicit (and document it very prominently). In other words, the math can be made operational, but I’m not convinced it maps cleanly to a typical “prior predictive” use case for ICAR as currently implemented.
Happy to be corrected if there are common workflows where users explicitly want “ICAR draws under a fixed gauge/constraint”.
Also, thanks @daniel-saunders-phil for putting together the notebooks and the feature list — they’re really helpful for understanding the intended direction. Are there any specific papers / references you have in mind that could be turned into a concrete implementation task (beyond rng_fn), or any feature from that list that would be a good next step to tackle?
Hi @mikjkd, thanks a lot! I appreciate your perspective on the sensibility of the draw function.
I don’t have a strong opinion on what to do about the draw function in ICAR. I’m thinking from a more high-level point of view from here. But I’ve noticed people stay away from ICAR because putting it on your model breaks typical prior sampling and out-of-sample prediction workflows. They pick GPs instead because our forward sampling support is alright. So if there is something to be done to rectify the situation, wonderful. If that’s just a fundamental limit on ICAR, that’s okay too, let’s just close that whole inquiry.
Question about the api for a fixed constraint bit: is a zero-sum constraint the same thing as the kind of constraint you are talking about? One thing I find confusing the sampling discussion around ICAR is that, when working through the algebra, the covariance matrix is clearly singular. But is it still singular after you apply the zero-sum constraint? I thought the zero-sum bit is precisely what allows you to take valid MCMC samples. And if you can take mcmc samples, then surely there must exist a forward sampling method that takes valid draws from the same distribution, even if we don’t know what the method is yet.
My very high level reading is that it’s akin to a flat prior. You can’t plausible sample from the prior, but after conditioning on data, the posterior becomes proper and you can sample from it.
But the posterior is no longer a flat. There’s obviously no rng for flat.
It could be different, in that sampling from this parametrization is correct (and not just a trick to avoid numpy raising an error). But I haven’t seen the evidence this is the case.
I saw that the GSoC slots have now been allocated. Would it be possible to go a bit more in depth with you and the potential mentors on what would be most valuable to implement this year for the Spatial modeling project? I’m happy to adapt to PyMC’s priorities and constraints.