Pymc3 and Heterogenous datasets

Hello everyone!
I am trying to create a classification model in pymc3 that could predict mortality in cardiac patients based on their clinical variables. My dataset contains 66 variables, discreet and continuous both. In fact, most of my variables are dichotomous variables with a few continuous ones in between. This kind of dataset is proving to be difficult to be modeled via pymc3. I need some sort of guidance on how to cater to such kinds of mixture datasets. What is the proper way to instantiate the priors for dichotomous variables and then use them in conjunction with continuous variables to create a predictive model? Kindly guide me towards any relevant sources or tutorials from which I can learn to model these kinds of mixture/heterogeneous datasets. Plus any insight from the experienced members who have any prior exposure to such datasets is highly appreciated.

Thanks a lot!

Can you explain a bit more about your model? Is your dependent variable continuous or discrete? What have you tried so far, and what difficulties have you run into?

If you are in a simple GLM framework, i.e. trying to estimate y = f(X\beta), it shouldn’t be a problem at all that you have a mixture of continuous and discrete variables, because the model weights \beta are all continuous regardless.

You might be facing a data preprocessing problem and not necessarily a modeling problem. But it’s hard to know without more information.

2 Likes

First of all, accept my gratitude for replying to my query. So to answer your first question the target variable in my dataset is dichotomous in nature having 1s and 0s only (as it’s a classification problem). Furthermore, until now I have tried instantiating all the betas for each continuous feature and discreet feature separately. For continuous variables, I mostly used either normal or half normal and for discreet I used binomial distribution as priors. Later, I used the Bernoulli distribution as my likelihood function with the probability of success p given by the logistic function. But the problem is while sampling, the chains diverge which I have read somewhere is due to the mixed nature of the dataset. I don’t know how to treat this problem in a way that my chains converge and i could get predictions from the model.

The divergences are not because your data is mixed – this is normal, and happens in many settings. Your model is not correctly specified. Some comments, then some concrete suggestions.

Although a a variable in your dataset might only take the values zero and one, the beta coefficient on that variable has no such restriction. Consider a regression model first, predicting annual worker wages as a function of years of experience and gender:
wage_i = \beta_0+ \beta_1 \text{years_experience}_i + \beta_2 \text{female}_i

So when the i-th individual in the dataset is male, \text{female}_i = 0. Is \beta_2 bounded between zero and one? Certainly not: we might think that there is a gender wage gap in this industry, so that \beta_2 is something like negative 100 dollars. That is, given this sample and holding years of experience constant, if the i-th individual were male rather female, our model predicts she would earn another 100. Indeed, it wouldn’t make sense to restrict \beta_2 between zero and one: in this case, it rules out the possibility of gender bias completely, and restricts the effect gender can have on one wages to between zero and one dollars.

As you can see, the scale of the \beta coefficients is connected with y, as well as with the variable it “goes with”. Consider years worked. This is an integer valued variable, but we don’t use a Poisson distribution to model \beta_1. We still use a normal, because we might expect that an extra year of experience gives, say, an extra 10.23 dollars, or any other continuously valued amount. That is, a one year increase in experience can have a continuous range of effects on the wage one earns, despite being integer valued.

What about a classification setting? Suppose we instead want to predict whether a worker will be promoted in a given year, again given age and years of experience: \text{promotion}_i = \beta_0 + \beta_1 \text{years_experience}_i + \beta_2 \text{female}_i

Now do we need to constrain all the \beta coefficients to be between zero and one? Again no. Even if you did assign Bernoulli priors to all the betas, this doesn’t guarantee it all adds up to something between 0 and 1. This is why you use the logistic function to model the probability. In logit space, before the logistic transformation, the values of \beta can be anything in \mathbb{R}, then you squish them all into the 0-1 interval to make your prediction. Life is easier that way. The key point: although your predictions are between 0-1, in logit space, your variables all still have scale. You need to think about this scale when building your model!

Now the suggestions. You should not be putting binomial priors on any of your betas. They should all just be normal distributions, unless you have a compelling reason to do otherwise. If you don’t know what the scale of the parameter effects are, you can set the prior standard deviation to be large, or you can re-scale all your data using e.g. sklearn.preprocessing.StandardScalar.

You should also make sure that none of your indicator variables are perfectly collinear. In my example, we cannot have one variable for male and one for female, you need to drop one to be the “baseline”. If you have a categorical variable that takes one of n values encoded as “dichotomous” features, you can only include n-1 of them in your model.

I strongly recommend the Econometrics series by Ben Lambert on YouTube as a comprehensive primer on applied statistical modeling. You might check out the video about logit and probit models, since that’s the family of model you are working with, as well as interpretation of regression coefficients, and an overview of dummy variables.

5 Likes

Thank you so much for your reply. It surely has put some things into perspective. I will definitely look into the resources that you have listed.
Once again thank you for taking the time out and answering my queries!

2 Likes