GSoC Project - 2023 - Spatial Modeling

Hi everyone, I’m Daniel. I’m a philosophy PhD student based in Vancouver where most of my research work has been in evolutionary game theory and the foundations of behavioral science. I started learning about Bayesian stats a few years ago out of curiosity and quickly fell in love (with the software, the ideas, or McElreath’s baritone voice, I can’t tell). I’ve been playing around with pymc library and helping folks out with their models on discourse ever since.

I’m interested in developing better infrastructure for spatial modeling. I’ve spent the last several weeks surveying pymc’s spatial capacities and comparing them to other PPLs. It seems like we have great support for gaussian processes to handle continuous space problems. However, when it comes to discrete spaces (what the literature calls areal data), we don’t have as much. Users can build their own areal models from scratch but that can be pretty difficult and involved for someone who is just looking to tackle a specific applied problem and isn’t a specialist in pymc. The stan library has well developed case studies (revised journal version of the case study too) for conditional auto-regressive (CAR), intrinsic conditional auto-regressive (ICAR) and the Besag-York-Mollié (BYM) model and even a re-parameterization of the BYM model to be more compatible with HMC samplers.

Junpeng Lao developed a pymc implementation of a CAR model several years ago for our examples page. The example has several implementations of CAR mixed in with lots of great advise on debugging divergences, scan vs vectorization, and comparison’s between stan, winbugs and pymc. Since then, we’ve gained other dedicated notebooks on divergences, pytensor and building custom distributions. It would be helpful to separate out these topics with a dedicated notebook on spatial models.

I imagine a solid GSoC project could start out by updating Junpeng’s notebook. Then we could expand our capacities to match Stan with ICAR and BYM + the reparametrized BYM model. The stan paper provides a nice roadmap for what the scope of this project might be. The main open question for me is whether we should implement these models as pymc distributions (like our current AR module) or whether this work should be published just in the form of notebooks and oriented toward teaching materials. From my initial reading of the stan paper, it sounds like there is a lot of work to be done enabling efficient computation and parameterization around this family of models that could be standardized in our library to save users the need to repeat that work afresh each time.

I’m curious to hear your thoughts about the plan and scope of the project. More? Less? Should something else take priority?

1 Like

Hi Daniel, thanks for your interest in participating in GSoC with PyMC. Tagging @fonnesbeck and @bwengals for them to weigh in if they have time.

Improving PyMC’s functionality for spatial, especially areal, data analysis sounds like a great idea, especially since the models that you’ve mentioned seem more available in Stan and in other R packages (CARBayes, for instance). My prof from my course on spatial data analysis was one of my strongest influences to dive deeper into Bayesian statistics, especially in the context of modelling areal data.

Just wanted to highlight two things:

  • The deadline for GSoC applications are on April 4, 18:00 UTC time which is in 2 days. You’re welcome to share a proposal if you want us to give some preliminary comments, but the deadline is near so we cannot guarantee anything.
  • Porting Junpeng’s notebook to v5 sounds like a great first goal! You could also push the addition of the ICAR random variable. We do have CAR and we’ve had substantial progress on ICAR (PR 4851). Not sure where the current state of that is… Perhaps @fonnesbeck would know better. (I’m also forgetting the differences between CAR and ICAR.)

Let us know if you have any other questions!

1 Like

Hi Larry, I appreciate the feedback!

It’s really good to know about the in-progress ICAR distribution. It looks like work stalled on that feature so finishing it up would be another good objective early in the GSoC project.

I put together the proposal write up. Appreciably, there isn’t much time left in the formal application period so if no one has time to take a look, it’s all good. However, I think we could just work on refining the scope informally in the next month or so and I could start work off on the right foot.

2 Likes

Hi @daniel-saunders-phil, I just added some comments. Overall, good structure and, if you have time, I think that the deliverables section can be more specific.

Good luck!! :slight_smile:

1 Like

Very nice of you to get comments back so quickly! I tuned things up a bit with more math and more concrete deliverables.

It’s a really good question about whether BYM should be its own RV or whether there is a different way of packaging the goods together. I think I just need to spend more time looking at the RV-distribution structure to get a sense of how these pieces are typically built up.

1 Like

Sorry I was past the time for feedback, but I’m very glad @larryshamalama could help – thank you a ton!

This will follow up on some really great work @conorhassan did in this area last year. He focused on the Leroux model and has a notebook here on using the CAR prior Conditional autoregressive prior example notebook by conorhassan · Pull Request #417 · pymc-devs/pymc-examples · GitHub that stalled. It’s basically done, but unfortunately I think we both got wrapped up in other stuff before the finish line! I’m meaning to see if I can help with the little things it needs in order to be merged before the next GSoC starts. I believe Conor also spent some some time with that ICAR branch too, but it needs a lot of fixing since it dates back to PyMC3 (please correct me if I’m wrong!).

Also have since found this neat little library for stan: Bayesian Spatial Analysis • geostan. Thanks for submitting an application!

Hi Bill, great to hear from you!

It’s wonderful to hear about Conor’s work. It sounds like both of us ended up thinking along similar lines after reviewing the existing pymc capacities. His work provides a really nice infrastructure for expanding into spatial models. I’m going to spend some time reviewing what he did so I can just build directly on top of it, rather than repeating any work.

1 Like