Need help with basic mode - starting point

Hi everyone,

I am starting with probabilistic programming and PyMC and seems that stuck with the first step. I am trying to figure out something very simple and then add complexity. Appreciate if you can help me and guide in right direction.

Let say there is aggregated data for patients broken down by gender and two cancer diagnoses:

diagnosis gender number
lung cancer male 1000
lung cancer female 800
breast cancer female 1600
breast cancer male 200

I am trying to build a statistical model for this distribution and later use it to answer questions such as “according to the model, with 95% confidence how many there are lung cancer patients among N (let say 100) [randomly chosen] male patients?”

The model and question is very primitive, but I am trying to figure out generic solution which can be used for large aggregated dataset with many dimensions and various questions

Thank you for you help.

Hi, welcome to the forum! The fundamental thing to identify when starting out with prior(s) to encapsulate our initial belief, a likelihood to encapsulate our model, and data for updating our belief to get our posterior. So with a diagnostic problem like you are suggesting, it sounds like a logistic regression would be most appropriate since your outcome is discrete (cancer/noncancer) and you want to model the probability of this outcome given inputs, with inputs possily being any other parameters you want to incorporate to aid modelling like: race, socioeconomic status,…etc.

PyMC has a few logistic regression examples in its gallery and sometimes the better way to learn is to figure things out so I also don’t want to give you too much and spoil the thrill! I’ll point out a couple of examples/resources from a quick search and any further questions or if you get stuck then by all means reply again!

3 Likes

Thanks Brandon, this is very useful.

I think I more or less understand the theory behind it, however at the practical level a bit confused:

  1. Let say I use my numbers and Binomial regression with two dimensions then do I need to fit/train the model? I mean my numbers perfectly define the distribution parameters, i can just define distribution by its parameters.

  2. Let say I got the distribution either by fitting or by setting the parameters, is there a generic way to answer my question “how many lung cancer patients amount N male patients?” All examples I saw this far use sampling and then measure those samples to get the result. I would expect something more simple like "get COUNT() where gender=‘male’ with N = 100’

I am asking this because for my primitive example one can get precise answer using just basic arithmetic, but when it comes to writing code it becomes bloated with fitting, sampling etc. Of course I can hardcode arithmetic formular and get precise result but it will not be generic. Is there a way to keep it simple and straightforward and then expand to real use case with many dimensions?

BR,
Ruslan

If I’m understanding correctly, then it sounds like by plugging your data to define your binomial distribution and drawing from that in which case you aren’t really modelling anything and are just doing statistical analysis of your data. To keep things generic and manageable it is common to use a pandas dataframe to track things like gender…etc so that when it comes to things like counting across other dimensions you can use ‘sum’. I don’t use pandas too much myself so you would probably have to check the API for more specifics, or somebody else may reply to this thread :sweat_smile:. Additionally, using Arviz to output something like confidence interval via arviz.summary() is also a good utility to save being mired in writing out arithmetic code yourself.

To keep things practical something like this could be done:

import pymc as pm
import arviz as az
import pandas as pd

data={"diagnosis":["lung cancer", "lung cancer", "breast cancer", "breast cancer"],"gender":["male","female","male","female"],"number":[1000,800,1600,200]}
dataframe=pd.DataFrame(data)
#print(dataframe)

n=100 #Number of patients AKA trials
p=getp()

with pm.Model() as model:

    binom=pm.Binomial("binom",n=n,p=p)

    trace=pm.sample()

az.plot_trace(trace)
print(az.summary(trace))

I haven’t defined getp() since as I mentioned I don’t use pandas too much (hopefully someone else with a bit more experience in pandas would be able to reply quickly) and would take me a bit of time to figure out the code to write but it would be something along the lines of inputting your question criteria (e.g. men,lung cancer) and getting the value from the numbers column that satisfies those criteria, and summing over any free undefined dimensions, then dividing by the total number of patients to get a value of probability to plug in. But just as an example you could use the above code and just plug in p=.5 to see it works.

If I’m misunderstanding and binomial regression is more what you are interested in than using pymc for this kind of statistical analysis then by all means reply again! I don’t know as much about binomial regression so I can offer my two cents or someone with more experience can chime in on the matter :sweat_smile:

Thanks Brandon,

Sorry my confusion does not allow me to express clearly my question/concern. But I will try again:

For my primitive example gender/lung/breast cancer what I need to do to do it properly is to think about my assumptions first e.g. independent Binomial distribution for lung/breast separately for both genders, then model with this assumptions and then getting results from the model taking into account assumptions. Using this approach I got something like this

import pymc as pm
import numpy as np

# Data setup in matrices
# Rows represent genders (male, female), columns represent diseases (lung cancer, breast cancer)
n_patients = np.array([[1200, 1200],   # Male patients for lung and breast cancer
                       [2400, 2400]])  # Female patients for lung and breast cancer
n_cases = np.array([[1000, 200],       # Lung and breast cancer cases in males
                    [800, 1600]])      # Lung and breast cancer cases in females

with pm.Model() as model:
    # Prior distributions for the probabilities of diseases in each gender
    # Shape (2, 2): 2 genders x 2 diseases
    p_diseases = pm.Beta('p', alpha=1, beta=1, shape=(2, 2))
    
    # Binomial likelihoods
    observations = pm.Binomial('obs', n=n_patients, p=p_diseases, observed=n_cases)
    
    # Posterior distribution
    trace = pm.sample(10000, return_inferencedata=False, cores=1)

# Output the summary of the trace
#print(pm.summary(trace))


# Extract the posterior samples for lung cancer probability in males
p_lung_cancer_males = trace.get_values('p')[:, 0, 0]  # Assuming index 0,0 corresponds to males and lung cancer

# Number of new patients
N = 1200

# Simulate observations from the posterior
simulated_cases = np.random.binomial(n=N, p=p_lung_cancer_males)

# Calculate the 95% credible interval
ci_lower = np.percentile(simulated_cases, 2.5)
ci_upper = np.percentile(simulated_cases, 97.5)

print(f"95% credible interval for the number of lung cancer cases among 100 randomly chosen male patients: [{ci_lower}, {ci_upper}]")

The problem with this approach is that my dataset is very complex and it seems to be not realistic to create valid assumptions and always keep them into account. In your example it will be logic of method getp().

E.g. if I add third gender ‘no information’ and try to cover scenario when a patient may have breast and lung cancer at the same time I will need to recreate the entire model as well as adjust usage of this model to answer my question(s).

All this makes it very hard and not feasible. I was thinking may be there is another approach like ‘traditional’ ML - you just choose ML model and then train it and easy get answers. Or something to ‘train statistical’ model and get answers from it without worrying too much about underlying assumptions.

Aye this makes things a lot clearer. Addressing first the ideas around scaling up to include other factors and things like gender not specified, things like missing data is a known thing modelling wise and there are ways to go around like imputation of missing data (Bayesian Missing Data Imputation — PyMC example gallery) [This link also seems to have some useful working with pandas insights that you may be able to glean]. As for scaling up to include other factors to consider and the fact that people may e.g. have both lung cancer and breast cancer, I would say this coding complexity stems from aggregating the data (by gender for the example) so if it is possible to get the raw unaggregated data set then shouldn’t that simplify things so that your total number of patients from entries(rows) of the pandas data frame since they are numbered but also just count along the relevant diagnostic result of interest via pandas?

As for the modelling itself, I see you have opted for the beta-binomial model which has cleared things up a bit, so the core thing to think about is with this approach would be how you are approaching p_diseases which is where understanding the modelling process is quintessential. I will admit in this approach (or even in general) it isn’t immediately obvious to me theorywise how to tackle multiple diagnoses for a single patient in this modelling picture or if it even needs to be tackled in a beta-binomial model.

As a note on the point about assumption of treating these factors independently vs assuming a relationship between them, that falls under the umbrella of ‘pooling’ and hierarchical modelling (A Primer on Bayesian Methods for Multilevel Modeling — PyMC example gallery , Hierarchical Partial Pooling — PyMC example gallery). I think this likely ties in with the previous point since it is needed to understand how to approach this modelling to scale up to a bigger picture. But this degree of pooling aspect I think is the answer to what you are saying about the assumption of modelling these things independently. Not sure on how to do this kind of hierarchical beta binomial model since you can set hyperparameter priors on alpha and beta arguments for your Beta distribution for ‘p’ but then I’m not certain if this is even appropriate nor how to map it to your problem. Perhaps you looking into pooling could help make things a bit clearer, a quick search online shows there are hierarchical beta binomial models about so maybe have a look there too.

As for the final remark, ultimately yeah once a good model for your problem is established then yeah you can just apply it to other datasets (with some metrics you can evaluate performance like in traditional ML with things like LOO cross validation…etc), but any model has it’s assumptions and so it isn’t really a case of not worrying about the assumptions but rather just acknowledging where the assumptions of your model like the degree of pooling for instance and this can help understand why the performance of a model may break down in certain scenarios. But this kind of modelling is done, it’s just a case of defining a good model for your case :sweat_smile:

I think that covers all your points, some of them unfortunately I don’t have the immediate answers for but I hope I’ve given you some good leads to contribute to clearing up the confusion!

1 Like

Thanks Brandon, a lot of information for me to digest.

Let me explore all these links and come back.