How to correctly fit a bayesian network to data and generate out-of-sample predictions

annah · July 5, 2024, 6:10am

Hello! I am trying to build a bayesian network and fit it to some training data using pymc. I have been reading through the documentation and this forum a lot and I think I am almost there but I have become stuck trying to generate out-of-sample predictions conditioned on new observed data.

I have tried to simplify the problem as much as possible but I have had to keep some complexity to ensure that it is representative of the real network.

What I expect
I want to define the structure of the network, then use training data to parametrize the relationships and variables.
I then want to use the parametrized network to generate out-of-sample predictions conditioned on new data points.

What happens
The probability distributions appear to update correctly given the training data, but when I provide out-of-sample data, the result in the trace does not seem to be conditioned on the new data points.
I read that the posterior can not be sampled for observed variables, and maybe this is what is causing the problem, but if this is the case I am not sure how to model this network using pymc.
I also have read a few posts about using the posterior from one model and using it to define the prior of the next - is this what I should be doing?

Please see the code snippets below. The full code and data can be viewed here. I am very new to both bayesian modelling and pymc so I really appreciate your time and help, and apologise if the question is very basic!

DAG
The model should have the following structure:

risk_factor -> expression_of_risk_factor -> disease
direct_cause -> disease

Model specification
I have specified the model as follows:

def get_model():
    with pm.Model() as model:

        # data containers for each node
        direct_cause_data = pm.Data(
            "direct_cause_data", df["direct_cause"]
        )
        risk_factor_data = pm.Data("risk_factor_data", df["risk_factor"])
        disease_data = pm.Data("disease_data", df["disease"])
        expression_of_risk_factor_data = pm.Data("expression_of_risk_factor_data", df["expression_of_risk_factor"])

        # flat priors
        p_direct_cause = pm.Beta("p_direct_cause", alpha=2, beta=2)
        p_risk_factor = pm.Beta("p_risk_factor", alpha=2, beta=2)
        p_disease_given_direct_cause_expression_of_risk_factor = pm.Beta(
            "p_disease_given_direct_cause_expression_of_risk_factor",
            alpha=2,
            beta=2,
            shape=(2, 3),
        )
        p_expression_of_risk_factor_given_risk_factor = pm.Dirichlet(
            "p_expression_of_risk_factor_given_risk_factor", a=np.ones((2, 3))
        )

        # observed variables
        risk_factor_observed = pm.Bernoulli(
            "risk_factor_observed",
            p=p_risk_factor,
            observed=risk_factor_data,
            shape=risk_factor_data.shape,
        )
        direct_cause_observed = pm.Bernoulli(
            "direct_cause_observed",
            p=p_direct_cause,
            observed=direct_cause_data,
            shape=direct_cause_data.shape,
        )
        expression_of_risk_factor_observed = pm.Categorical(
            "expression_of_risk_factor_observed",
            p=p_expression_of_risk_factor_given_risk_factor[risk_factor_observed],
            shape=expression_of_risk_factor_data.shape,
            observed=expression_of_risk_factor_data,
        )
        disease_observed = pm.Bernoulli(
            "disease_observed",
            p=p_disease_given_direct_cause_expression_of_risk_factor[
                direct_cause_observed, expression_of_risk_factor_observed
            ],
            observed=disease_data,
            shape=disease_data.shape,
        )
    return model

Sampling from the model

with model:
    trace = pm.sample_prior_predictive(1000)
    trace.extend(pm.sample(1000))

Out-of-sample prediction

def get_prediction(
    direct_cause: int,
    risk_factor: int,
    expression_of_risk_factor: int,
    trace: az.InferenceData,
) -> az.InferenceData:
    with get_model():
        pm.set_data(
            {
                "direct_cause_data": [direct_cause],
                "risk_factor_data": [risk_factor],
                "disease_data": np.ma.masked_values(
                    np.array([-1]), value=-1
                ),  # masked array for node I want to predict
                "expression_of_risk_factor_data": [expression_of_risk_factor],
            }
        )
        return pm.sample_posterior_predictive(
            trace, var_names=["disease_observed"], predictions=True
        )

low_risk_pred = get_prediction(0, 0, 0, trace).predictions.disease_observed.mean()
high_risk_pred = get_prediction(1, 1, 2, trace).predictions.disease_observed.mean()

These predictions are both ~0.33 (perhaps not coincidentally, the same as the overall prevalence of disease_observed in the training data).
For context, based on the training data, I would expect these queries to return around ~0.1 and ~0.9 respectively.

Topic		Replies	Views
How to make out-of-sample predictions with pymc model v5	1	604	February 8, 2023
How to correctly create a GLM with PyMC v5	2	599	September 29, 2023
Observed data in Bayesian networks Questions	13	2561	July 6, 2021
Example for out-of-sample prediction with posterior predictive sampling v5	8	2830	October 28, 2022
PyMC v5.10.3 prediction stuff v5 modeling	11	415	January 15, 2024

How to correctly fit a bayesian network to data and generate out-of-sample predictions

Related topics