Thanks Brandon,
Sorry my confusion does not allow me to express clearly my question/concern. But I will try again:
For my primitive example gender/lung/breast cancer what I need to do to do it properly is to think about my assumptions first e.g. independent Binomial distribution for lung/breast separately for both genders, then model with this assumptions and then getting results from the model taking into account assumptions. Using this approach I got something like this
import pymc as pm
import numpy as np
# Data setup in matrices
# Rows represent genders (male, female), columns represent diseases (lung cancer, breast cancer)
n_patients = np.array([[1200, 1200], # Male patients for lung and breast cancer
[2400, 2400]]) # Female patients for lung and breast cancer
n_cases = np.array([[1000, 200], # Lung and breast cancer cases in males
[800, 1600]]) # Lung and breast cancer cases in females
with pm.Model() as model:
# Prior distributions for the probabilities of diseases in each gender
# Shape (2, 2): 2 genders x 2 diseases
p_diseases = pm.Beta('p', alpha=1, beta=1, shape=(2, 2))
# Binomial likelihoods
observations = pm.Binomial('obs', n=n_patients, p=p_diseases, observed=n_cases)
# Posterior distribution
trace = pm.sample(10000, return_inferencedata=False, cores=1)
# Output the summary of the trace
#print(pm.summary(trace))
# Extract the posterior samples for lung cancer probability in males
p_lung_cancer_males = trace.get_values('p')[:, 0, 0] # Assuming index 0,0 corresponds to males and lung cancer
# Number of new patients
N = 1200
# Simulate observations from the posterior
simulated_cases = np.random.binomial(n=N, p=p_lung_cancer_males)
# Calculate the 95% credible interval
ci_lower = np.percentile(simulated_cases, 2.5)
ci_upper = np.percentile(simulated_cases, 97.5)
print(f"95% credible interval for the number of lung cancer cases among 100 randomly chosen male patients: [{ci_lower}, {ci_upper}]")
The problem with this approach is that my dataset is very complex and it seems to be not realistic to create valid assumptions and always keep them into account. In your example it will be logic of method getp().
E.g. if I add third gender āno informationā and try to cover scenario when a patient may have breast and lung cancer at the same time I will need to recreate the entire model as well as adjust usage of this model to answer my question(s).
All this makes it very hard and not feasible. I was thinking may be there is another approach like ātraditionalā ML - you just choose ML model and then train it and easy get answers. Or something to ātrain statisticalā model and get answers from it without worrying too much about underlying assumptions.