There are quite a few place the Stan model and the pymc3 model differ, but what makes the most differences for model convergence is usually the standardization of the predictor matrix:
transformed data {
int Kc = K - 1;
matrix[N, K - 1] Xc; // centered version of X
vector[K - 1] means_X; // column means of X before centering
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}
You should do the same for your input X in python