Hello! I am trying to build a bayesian network and fit it to some training data using pymc. I have been reading through the documentation and this forum a lot and I think I am almost there but I have become stuck trying to generate out-of-sample predictions conditioned on new observed data.

I have tried to simplify the problem as much as possible but I have had to keep some complexity to ensure that it is representative of the real network.

**What I expect**

I want to define the structure of the network, then use training data to parametrize the relationships and variables.

I then want to use the parametrized network to generate out-of-sample predictions conditioned on new data points.

**What happens**

The probability distributions appear to update correctly given the training data, but when I provide out-of-sample data, the result in the trace does not seem to be conditioned on the new data points.

I read that the posterior can not be sampled for observed variables, and maybe this is what is causing the problem, but if this is the case I am not sure how to model this network using pymc.

I also have read a few posts about using the posterior from one model and using it to define the prior of the next - is this what I should be doing?

Please see the code snippets below. The full code and data can be viewed here. I am very new to both bayesian modelling and pymc so I really appreciate your time and help, and apologise if the question is very basic!

**DAG**

The model should have the following structure:

```
risk_factor -> expression_of_risk_factor -> disease
direct_cause -> disease
```

**Model specification**

I have specified the model as follows:

```
def get_model():
with pm.Model() as model:
# data containers for each node
direct_cause_data = pm.Data(
"direct_cause_data", df["direct_cause"]
)
risk_factor_data = pm.Data("risk_factor_data", df["risk_factor"])
disease_data = pm.Data("disease_data", df["disease"])
expression_of_risk_factor_data = pm.Data("expression_of_risk_factor_data", df["expression_of_risk_factor"])
# flat priors
p_direct_cause = pm.Beta("p_direct_cause", alpha=2, beta=2)
p_risk_factor = pm.Beta("p_risk_factor", alpha=2, beta=2)
p_disease_given_direct_cause_expression_of_risk_factor = pm.Beta(
"p_disease_given_direct_cause_expression_of_risk_factor",
alpha=2,
beta=2,
shape=(2, 3),
)
p_expression_of_risk_factor_given_risk_factor = pm.Dirichlet(
"p_expression_of_risk_factor_given_risk_factor", a=np.ones((2, 3))
)
# observed variables
risk_factor_observed = pm.Bernoulli(
"risk_factor_observed",
p=p_risk_factor,
observed=risk_factor_data,
shape=risk_factor_data.shape,
)
direct_cause_observed = pm.Bernoulli(
"direct_cause_observed",
p=p_direct_cause,
observed=direct_cause_data,
shape=direct_cause_data.shape,
)
expression_of_risk_factor_observed = pm.Categorical(
"expression_of_risk_factor_observed",
p=p_expression_of_risk_factor_given_risk_factor[risk_factor_observed],
shape=expression_of_risk_factor_data.shape,
observed=expression_of_risk_factor_data,
)
disease_observed = pm.Bernoulli(
"disease_observed",
p=p_disease_given_direct_cause_expression_of_risk_factor[
direct_cause_observed, expression_of_risk_factor_observed
],
observed=disease_data,
shape=disease_data.shape,
)
return model
```

**Sampling from the model**

```
with model:
trace = pm.sample_prior_predictive(1000)
trace.extend(pm.sample(1000))
```

**Out-of-sample prediction**

```
def get_prediction(
direct_cause: int,
risk_factor: int,
expression_of_risk_factor: int,
trace: az.InferenceData,
) -> az.InferenceData:
with get_model():
pm.set_data(
{
"direct_cause_data": [direct_cause],
"risk_factor_data": [risk_factor],
"disease_data": np.ma.masked_values(
np.array([-1]), value=-1
), # masked array for node I want to predict
"expression_of_risk_factor_data": [expression_of_risk_factor],
}
)
return pm.sample_posterior_predictive(
trace, var_names=["disease_observed"], predictions=True
)
low_risk_pred = get_prediction(0, 0, 0, trace).predictions.disease_observed.mean()
high_risk_pred = get_prediction(1, 1, 2, trace).predictions.disease_observed.mean()
```

These predictions are both ~0.33 (perhaps not coincidentally, the same as the overall prevalence of `disease_observed`

in the training data).

For context, based on the training data, I would expect these queries to return around ~0.1 and ~0.9 respectively.