Hello! I am trying to build a bayesian network and fit it to some training data using pymc. I have been reading through the documentation and this forum a lot and I think I am almost there but I have become stuck trying to generate out-of-sample predictions conditioned on new observed data.
I have tried to simplify the problem as much as possible but I have had to keep some complexity to ensure that it is representative of the real network.
What I expect
I want to define the structure of the network, then use training data to parametrize the relationships and variables.
I then want to use the parametrized network to generate out-of-sample predictions conditioned on new data points.
What happens
The probability distributions appear to update correctly given the training data, but when I provide out-of-sample data, the result in the trace does not seem to be conditioned on the new data points.
I read that the posterior can not be sampled for observed variables, and maybe this is what is causing the problem, but if this is the case I am not sure how to model this network using pymc.
I also have read a few posts about using the posterior from one model and using it to define the prior of the next - is this what I should be doing?
Please see the code snippets below. The full code and data can be viewed here. I am very new to both bayesian modelling and pymc so I really appreciate your time and help, and apologise if the question is very basic!
DAG
The model should have the following structure:
risk_factor -> expression_of_risk_factor -> disease
direct_cause -> disease
Model specification
I have specified the model as follows:
def get_model():
with pm.Model() as model:
# data containers for each node
direct_cause_data = pm.Data(
"direct_cause_data", df["direct_cause"]
)
risk_factor_data = pm.Data("risk_factor_data", df["risk_factor"])
disease_data = pm.Data("disease_data", df["disease"])
expression_of_risk_factor_data = pm.Data("expression_of_risk_factor_data", df["expression_of_risk_factor"])
# flat priors
p_direct_cause = pm.Beta("p_direct_cause", alpha=2, beta=2)
p_risk_factor = pm.Beta("p_risk_factor", alpha=2, beta=2)
p_disease_given_direct_cause_expression_of_risk_factor = pm.Beta(
"p_disease_given_direct_cause_expression_of_risk_factor",
alpha=2,
beta=2,
shape=(2, 3),
)
p_expression_of_risk_factor_given_risk_factor = pm.Dirichlet(
"p_expression_of_risk_factor_given_risk_factor", a=np.ones((2, 3))
)
# observed variables
risk_factor_observed = pm.Bernoulli(
"risk_factor_observed",
p=p_risk_factor,
observed=risk_factor_data,
shape=risk_factor_data.shape,
)
direct_cause_observed = pm.Bernoulli(
"direct_cause_observed",
p=p_direct_cause,
observed=direct_cause_data,
shape=direct_cause_data.shape,
)
expression_of_risk_factor_observed = pm.Categorical(
"expression_of_risk_factor_observed",
p=p_expression_of_risk_factor_given_risk_factor[risk_factor_observed],
shape=expression_of_risk_factor_data.shape,
observed=expression_of_risk_factor_data,
)
disease_observed = pm.Bernoulli(
"disease_observed",
p=p_disease_given_direct_cause_expression_of_risk_factor[
direct_cause_observed, expression_of_risk_factor_observed
],
observed=disease_data,
shape=disease_data.shape,
)
return model
Sampling from the model
with model:
trace = pm.sample_prior_predictive(1000)
trace.extend(pm.sample(1000))
Out-of-sample prediction
def get_prediction(
direct_cause: int,
risk_factor: int,
expression_of_risk_factor: int,
trace: az.InferenceData,
) -> az.InferenceData:
with get_model():
pm.set_data(
{
"direct_cause_data": [direct_cause],
"risk_factor_data": [risk_factor],
"disease_data": np.ma.masked_values(
np.array([-1]), value=-1
), # masked array for node I want to predict
"expression_of_risk_factor_data": [expression_of_risk_factor],
}
)
return pm.sample_posterior_predictive(
trace, var_names=["disease_observed"], predictions=True
)
low_risk_pred = get_prediction(0, 0, 0, trace).predictions.disease_observed.mean()
high_risk_pred = get_prediction(1, 1, 2, trace).predictions.disease_observed.mean()
These predictions are both ~0.33 (perhaps not coincidentally, the same as the overall prevalence of disease_observed
in the training data).
For context, based on the training data, I would expect these queries to return around ~0.1 and ~0.9 respectively.