It sounds like you have, for each point observation, some vector of covariates X that goes along with the point coordinates, right? I think that taking the mean or median of within-bin covariate values is probably the most straightforward thing to do but I can’t give you a theoretical justification for why that would be. I think that’s what the authors of this example do. It looks like they made an error in their problem formulation, though since the intensity surface \lambda should read \lambda(s) = \alpha + \beta X(s) + u(s) instead of \beta X(u).
Conceptually, the more elegant thing to do would be to model each covariate as a random surface (though not as a point process itself, just a standard GP), and then model correlation structure between all P covariates and the intensity function as a P+1-dimensional multivariate Gaussian process. I am skeptical over whether or not this would be much better in terms of getting an accurate and reliable predictive PDF, however.