Combination of bayesian models in pymc3

why to create missing variables in ‘speed’? As the output of any Bayesian model is always a posterior distribution, isn’t it better to give all available data to the model? If this is done to test the model, isn’t it better to split the data into train and test sets and then actually use the model to “predict” ‘speed’ on unseen data with unseen predictors?

The missing variable in speed is used as a placeholder for individuals whose speed you may want to estimate without doing the exercise test. I have realized now that for your use case, it would also make sense to treat those same rows in HRmax and RecoveryTime as unknown since you would have to do the exercise to have data on those as well. Within the toy dataset I created, the model is seeing all of the data - it just happens that some of the rows reflect patients for whom we want predictions.

As far as testing and training is concerned, we’re doing both at the same time by generating posterior estimates for the parameters (i.e. conducting inference) and for the missing values (i.e. prediction) at the same time. With the model set up this way, the distribution of missing values of speed will properly reflect all sources of uncertainty which may be neglected when assessing predictions with a non-Bayesian approach.

Following the previous, are ‘hrmax’, ‘speed’ and 'recovery fully equivalent here? That is I can make rows missing in any of them (or all together)?

Yes - you can add NaNs to any of the entries of the dataframe and the resulting model should still work fine.

Adding categories of people means making a hierarchical linear model, but then again three-in-one? I saw tutorials with one model, seems pretty clear.

While we won’t be repeating this model three times to efficiently implement a hierarchical linear regression, we would need to make some assumptions regarding how to model the categories of people. Do you want to give each group a distinct intercept or give them their own regression coefficients? Should this be done for all 3 regression equations or are you only interested in one?

In essence what I create is a simple linear model and probably many not complicated machine learning methods will outperform it, right? How to show superiority of Bayesian over ML or AI? What specific conditions would help?

In a research capacity, reporting the parameters of a linear model is much more straightforward for interpretation and communication of results than a model with many nonlinear parameters. If you have a large, well-understood dataset and you are solely concerned with prediction, then perhaps a machine learning model could be better for you. As far as generative models in ML, the vast majority of these are designed for structured data such as time series or images and also require very large training datasets.