Forgive me if these are dumb questions - I’m still learning. Here I’m modeling the data generating process for a disease.
In my dataset, I have health metrics (just 1 here for simplicity) and the disease state (true or false). If a person (sample) has the disease, the health metric changes. Otherwise the metric stays at normal levels:
with pm.Model() as model:
p_disease = pm.Uniform('p_disease', lower=0, upper=1)
disease = pm.Bernoulli('disease', p=p_disease, observed=has_disease)
μ_metric1 = pm.Normal('μ_metric1', 0, sigma=10)
ρ_metric1 = pm.Normal('ρ_metric1', 1.5, sigma=2)
σ_metric1 = pm.HalfCauchy('σ_metric1', 1)
data_metric1 = pm.Data('data_metric1', X_train)
metric1 = pm.Normal('metric1',
# If there is disease, the metric changes. If not, the metric is at normal levels.
tt.switch(tt.eq(disease,1),ρ_metric1*μ_metric1, μ_metric1),
σ_metric1,
observed=data_metric1)
trace = pm.sample()
I have a few questions in order of priority:
- Is it possible to predict whether someone has a disease (binary classification) using this model, given health metric data? Am I formulating the wrong model? Note: I have another model that uses logistic regression that works correctly, but I’m curious about this particular model.
- If I put the observed variable
has_disease
into its ownpm.Data()
, I run into optimization issues. What’s going wrong? - Is there a way to handle missing data and still get predictions?
pm.Data()
doesn’t like it, and I use it during the prediction step, which looks like this:
with model:
pm.set_data({"data_metric1": X_test})
predictions = pm.sample_posterior_predictive(trace)
Thanks in advance!