Hello, I am working with gas emissions and trying to apply a gaussian plume model to MCMC Bayesian analysis using PyMC. I have (lets say) 8000 line dataframe containing gas concentration data, wind information etc… describing a day of data collection and used as input for the physics model. I historically would execute no problem using a defined gas emission source location (x, y, and z coordinates), feeding two simple parameters into the MCMC model: Emission_rate, and Background concentration. With the main goal of determining the Emission Rate.
mu = Emission rate * coupling coefficient + background
Mu is the estimate of gas concentration at each measurement time. The real readings are used to establish the Likelihood.
I would now like to feed the source location into the MCMC model as a parameter distribution (or three, one for each coordinate x, y, and z). To do so, I would need to define the Gaussian Plume model within my pm.Model definition (I assume), as the sampled source location is needed to calculate a key ‘child’ parameter (coupling coefficient) used in the mu calculation outlined above. I am trying to set these ‘parent’/‘child’ processes up as pm.Deterministic parameters but keep hitting issues with the execution of my script (most recently: ValueError: order must be one of ‘C’, ‘F’, ‘A’, or ‘K’ (got ‘rot_points’)). If not that error, then my pm.Deterministic definitions often fail because they need actual numerical values rather than the ‘parent’ parameter name.
Further more, to create the coupling coefficients that are needed for each row of the dataframe, I loop through all rows of the dataframe and calculate the coupling coefficients row by row using (among other known parameters) the ‘parent’ x, y, and z parameters described a few lines above in the same pm.Model definition.
1 - Can you have df.iterrows() functions in pm.Model definitions?
2 - Is my approach Naive/not possible?
3 - Any other comments welcome.
I hope this makes sense, and I look forward to hearing back and thoughts/comments.
You can loop in PyMC, but it’s often better to try to first think about whether another approach is possible. Could you post some code that shows the model you are trying to implement, even if it’s just numpy/pandas instead of PyMC? I’m not familiar with a guassian plume model, but from a quick google search it looks like it’s just a function of spatial coordinates. If that’s the case, you don’t need to loop at all, you can just directly do the computations on the vectors of coordinate data. If you have something like this equation:
Then you can just add the coordinate columns to your model as pm.MutableData, define random variables/parameters, and just type in this formula directly without any loops, using vectors in place of variables x, y, and z.
I am starting to think about another approach to my problem. You are spot on with the GP model. My current problem lies in the idea that the way my code currently executes is that the GP model is calculated separately and generates ‘coupling coefficients’ (cc) for each emission source in a specified location (effectively the GP equation without the Q emission rate component). C (x, y, z) is solved by summing the products of Q X cc for each emission source, then adding a background value. This is done for each time point in a time series and compared to observed values. The value of Q (an unknown) for each source is passed into the MCMC calculation as a LogNormal Distribution of possible values. The background (unknown) is also passed in as a Normal distribution.
What I am trying to do is pass in the source locations as distributions for x and y to be sampled in the MCMC run, but as the location information is required for the GP calculation of the cc’s, I would need to calculate those on the fly in the MCMC itself. I need to find a way to represent the x/y location data as one MCMC parameter, then loop through a dataframe (the time series) and somehow return the a list of cc’s which are treated as Deterministic parameters, dependent on the x/y distributions.
I know that is a bunch of text. Sorry for no code at this point (there are limits to what I can post). I plan to go off and have a good rethink, but in the event my ramblings make sense to you and you can think of a direction to send me in I would again appreciate the feedback.
Everything you’re saying sounds doable, you just need to replace “loop over” with “compute over a vector” in your planning. As long as there’s no inter-temporal dependencies that require recursive computation, you can treat the time dimension as just another index value and do the vector computation over the whole dataframe in one shot. If you’re not familiar with vectorized computation with numpy, it would be worthwhile to check out some resources on that, because pytensor (the pymc computational backend) works exactly the same way. If you can write down your model in numpy (without loops) it’s a basically a one-to-one mapping to a PyMC model, and you can drop in random variables wherever you need.
So C(x,y,z) = Q \cdot cc + \epsilon? In that case you can use C(x,y,z) to parameterize some distribution, for example as the mean of the normal (if \epsilon is mean zero normal)
Do you mean that you want some uncertainty about the locations themselves? Anything (or everything!) that goes into the GP formula can be PyMC random variables, and you can have as many observed variables as you wish scattered throughout the model. So if you have a collection of noisy location observations for a single location over time, it would be easy to use that as observed data and estimate the underlying true value.