The divergences are not because your data is mixed – this is normal, and happens in many settings. Your model is not correctly specified. Some comments, then some concrete suggestions.
Although a a variable in your dataset might only take the values zero and one, the beta coefficient on that variable has no such restriction. Consider a regression model first, predicting annual worker wages as a function of years of experience and gender:
wage_i = \beta_0+ \beta_1 \text{years_experience}_i + \beta_2 \text{female}_i
So when the i-th individual in the dataset is male, \text{female}_i = 0. Is \beta_2 bounded between zero and one? Certainly not: we might think that there is a gender wage gap in this industry, so that \beta_2 is something like negative 100 dollars. That is, given this sample and holding years of experience constant, if the i-th individual were male rather female, our model predicts she would earn another 100. Indeed, it wouldn’t make sense to restrict \beta_2 between zero and one: in this case, it rules out the possibility of gender bias completely, and restricts the effect gender can have on one wages to between zero and one dollars.
As you can see, the scale of the \beta coefficients is connected with y, as well as with the variable it “goes with”. Consider years worked. This is an integer valued variable, but we don’t use a Poisson distribution to model \beta_1. We still use a normal, because we might expect that an extra year of experience gives, say, an extra 10.23 dollars, or any other continuously valued amount. That is, a one year increase in experience can have a continuous range of effects on the wage one earns, despite being integer valued.
What about a classification setting? Suppose we instead want to predict whether a worker will be promoted in a given year, again given age and years of experience: \text{promotion}_i = \beta_0 + \beta_1 \text{years_experience}_i + \beta_2 \text{female}_i
Now do we need to constrain all the \beta coefficients to be between zero and one? Again no. Even if you did assign Bernoulli priors to all the betas, this doesn’t guarantee it all adds up to something between 0 and 1. This is why you use the logistic
function to model the probability. In logit space, before the logistic transformation, the values of \beta can be anything in \mathbb{R}, then you squish them all into the 0-1 interval to make your prediction. Life is easier that way. The key point: although your predictions are between 0-1, in logit space, your variables all still have scale. You need to think about this scale when building your model!
Now the suggestions. You should not be putting binomial priors on any of your betas. They should all just be normal distributions, unless you have a compelling reason to do otherwise. If you don’t know what the scale of the parameter effects are, you can set the prior standard deviation to be large, or you can re-scale all your data using e.g. sklearn.preprocessing.StandardScalar
.
You should also make sure that none of your indicator variables are perfectly collinear. In my example, we cannot have one variable for male and one for female, you need to drop one to be the “baseline”. If you have a categorical variable that takes one of n values encoded as “dichotomous” features, you can only include n-1 of them in your model.
I strongly recommend the Econometrics series by Ben Lambert on YouTube as a comprehensive primer on applied statistical modeling. You might check out the video about logit and probit models, since that’s the family of model you are working with, as well as interpretation of regression coefficients, and an overview of dummy variables.