New to PyMC3 (used PyMC2 for ages, then Stan for a long time, now curious to try PyMC3), and wondering what the “best” way to implement a common issue I see is.
I’d like to use a Bayesian Network to synthesize different data sources that all measure things a little differently. One tricky aspect is that the marginal probabilities I get from those sources all use slightly different methods for discretizing the independent variables. For instance, one gives values by wealth quartiles while the other uses quintiles; or they give marginals by age group, but use different (non-overlapping) bins in reporting.
So for a simplified example say I have information on p(A | X')
and p(B | X*)
, where X'
and X*
are different discretized transformations of the underlying continuous variable X
. And I want to then e.g. estimate the joint distribution of A
and B
.
I could imagine a few ways that might work to model that:
- Include
X
as a continuous random variable in the model and then createX'
andX*
using discretizing transformations - Include both
X'
andX*
as random variables by using a multivariate distribution that attempts to incorporate the relationships between them as e.g. a correlation matrix - Instead start by simulating individuals from the population directly using the marginal probabilities
My intuition is that 1 is the most correct way, but that I would run into issues with performance due to the discretization (or at least that was my experience with Stan in the past). Any advice (and suggestions/examples of how to implement it) would be greatly appreciated!