Tagging prospective mentors @RavinKumar and @colcarroll.
It may also be interesting to take a look at the discussions in Feature request: Named RV dimensions · Issue #4565 · pymc-devs/pymc3 · GitHub and shape vs size keyword argument · Issue #4552 · pymc-devs/pymc3 · GitHub.
As the project idea is not very detailed, I just want to add that the computational backend for pymc3>=4.0 is and will be Aesara, and implementing this new xarray backed backend will not change that. What I still personally don’t know is is how should we go about implementing that. I see two main paths we could take, each with its own pros and cons.
One option would be to have all sampling and calculations in pure Aesara, and use xarray to initialize (and preallocate) a dataset when pymc3 starts sampling and have it updated every iteration with the corresponding sampling results. I think this is not too different from what happens now with the current backends, and would have the pro of easily integrating with any xarray backed format, i.e. using dask backed xarray datasets we could probably sample models that don’t fit in memory “easily”.
Another option could be to integrate Aesara as a valid xarray data structure, so that we can do everything with xarray’s api but still using Aesara for the actual calculations. This approach is probably much more complicated to make work, but it could allow natively supporting operations with labeled dims and coords (as discussed in one of the issues above), even if it were only a subset, it would probably still be very powerful.