Lets say I have come up with 3 linear models for responses A, B and C dependent on predictors a#, b# and c# + some inter-connections:
A ~ linear(a1,a2,a3)
B ~ linear(b1,b2, A)
C ~ linear(c1,b2,A,B)
I have observed data for everything obviously.
General question: How do I design such a model and implement it in pymc3?
More detailed:
Do I make three models and them combine them somehow?
Or is this one model with three likelihoods?
You can write down a single joint model and estimate all the parameters simultaneously. Iâd be happy to show how to do this with a few more pieces of information. I know you mentioned you have observed data, but some of the quantities must be unobserved. Of these, which need to be estimated? Also, are you assuming the standard set of prior distributions for the linear regression parameters?
Thanks for help in advance. Will try to make it more concrete.
Assume a simplified situation is: I noted down peoplesâ physiological parameters (age, BMI, resting heart rate, etc.) and asked them to perform exercises. From that I measured their HRmax and how well they perform (speed) and how long it takes to recove (recoveryTime).
The last three are related to each other like this (it is my assumption based on expert knowledge):
HRmax ~ Age + BMI + HRrest
Speed ~ Age + HRmax
RecoveryTime ~ HRrest + HRmax + Speed
I have N people-records with full information (that is what I mean by âI have observed data for everythingâ). But in the end I am interested to predict how well people will do the exercise (Speed), but also RecoveryTime and HRmax are of secondary interest, because I want to tell all three without asking someone to do the exercise, but just from their physiology metrics. Is that possible, or I have to select one response?
May be going slightly ahead of time, another related question. N people-records I have are divided into groups\categories, e.g. smokers and non-smokers or something else. I would like to observe how predictions are different for those groups. Is that possible to incorporate in such a model?
I appreciate if you can help me with an explanation, references or indicating silly assumptions I have about Bayesian modeling.
Your explanation is clear and you can implement this in PyMC3 but I want to make sure it isnât overkill for this problem. Is it the case that perhaps you want to treat speed as only partially observed? Then, as you mentioned, you can estimate what their recovery time would be without observing speed.
Adding in categories / groups will be easy if there are only two levels in each category as then we can just include a 0/1 term for those. Do any of your categories have several levels?
There is one category variable, but it can have 4 values (I simplified it in the explanation), but if it complicates things a lot, I can also bin 4 classes into 2, this will also increase number of observations per class, which is good.
As I just start learning pymc, I think a simpler case is better for me to start with. Later I can expand, once I understand why speed being partially observed is better then fully observed. I want to use existing dataset to create a use-case for probabilistic modeling demo, so that I myself and my colleagues can evaluate whether this is worth investing time.
Hereâs an answer for dealing with partial observation of speed with no categorical variables. This is the simplest case that leads to a nontrivial model; with fully observed data, youâre just going to have three separate linear regressions and there are numerous tutorials for that. There are a few moving parts here so feel free to ask any follow up questions.
import numpy as np
import pandas as pd
import pymc3 as pm
# Make some fake data
columns = ['hrmax','age','bmi','hrrest','speed','recovery']
n = 20
p = len(columns)
df = pd.DataFrame(data=np.random.randn(n,p), columns=columns)
# Make three of the rows missing for the speed variable
df['speed'].iloc[[0,10,15]] = np.nan
with pm.Model():
# Place weak priors on the intercepts and
# error variance
intercepts = pm.Normal('intercepts', sd=100, shape=3)
error_stds = pm.HalfNormal('error_stds', sd=10, shape=3)
# Since the # of coefficients is different for speed, we
# can't just use one 2D array
coefficients_hrmax = pm.Normal('coefficients_hrmax', sd=10, shape=3)
coefficients_speed = pm.Normal('coefficients_speed', sd=10, shape=2)
coefficients_recovery = pm.Normal('coefficients_recovery', sd=10, shape=3)
mu_hrmax = intercepts[0] + pm.math.dot(df[['age','bmi','hrrest']],coefficients_hrmax)
hrmax = pm.Normal('hrmax', mu=mu_hrmax, sd=error_stds[0],observed=df['hrmax'])
# Using observed=speed where some of the values of speed are NaN
# in the Pandas Series tells PyMC3 to impute them using the rest of
# the probability model to sample them from the posterior.
mu_speed = intercepts[1] + pm.math.dot(df[['age','hrmax']],coefficients_speed)
speed = pm.Normal('speed', mu=mu_speed, sd=error_stds[1],observed=df['speed'])
# Even though the speed values are missing for some rows, their imputed
# values are fed forward into the regression for recovery.
mu_recovery = intercepts[2] + pm.math.dot(df[['hrrest','hrmax']],coefficients_recovery[0:2]) + speed*coefficients_recovery[2]
recovery = pm.Normal('recovery', mu=mu_recovery, sd=error_stds[2],observed=df['recovery'])
trace = pm.sample()
Thanks for the code, I still need to interpret the results and tweak priors. But indeed, I have follow up questions:
why to create missing variables in âspeedâ? As the output of any Bayesian model is always a posterior distribution, isnât it better to give all available data to the model? If this is done to test the model, isnât it better to split the data into train and test sets and then actually use the model to âpredictâ âspeedâ on unseen data with unseen predictors?
Following the previous, are âhrmaxâ, âspeedâ and 'recovery fully equivalent here? That is I can make rows missing in any of them (or all together)?
Adding categories of people means making a hierarchical linear model, but then again three-in-one? I saw tutorials with one model, seems pretty clear.
Also I have broad questions not about this problem specifically (answer if you wish):
In essence what I create is a simple linear model and probably many not complicated machine learning methods will outperform it, right? How to show superiority of Bayesian over ML or AI? What specific conditions would help? E.g.:
Low number of data points and well specified priors is more a Bayesian case?
Also result in terms of uncertainty instead of one maximum likelihood gives extra info.
Ability to âgenerateâ new data (but there are generative models in ML as well)?
Anything I am missing?
What I would like to try at one point is similar to this: https://demo.bayesfusion.com/bayesbox.html
That is graph models with unlimited degrees of freedom. So that I can see the effect of tweaking any variable on all other variables (basically no responses\predictors split). So far I only see such examples made in expensive proprietary software, or am I mislead by fancy GUI and this is essentially the same as pymc does, just with front-end on top of it?
If I can share my experience with the second part of the questions. I did also ask myself if there is a difference between the existing software and PPLs, and in Gelmanâs book, it is mentioned that there are three ways of assigning probabilities: making subjective assessments, using empirical probabilities based on observed data, and constructing a parametric probability model.
People use PyMC3 mostly for writing a parametric type of model. However, it is possible to do all the other ways similar to the BayesFusion. You can check the forum for more examples if you are interested to model your data in that way. For example, this post Bayes Nets, Belief Networks, and PyMC
why to create missing variables in âspeedâ? As the output of any Bayesian model is always a posterior distribution, isnât it better to give all available data to the model? If this is done to test the model, isnât it better to split the data into train and test sets and then actually use the model to âpredictâ âspeedâ on unseen data with unseen predictors?
The missing variable in speed is used as a placeholder for individuals whose speed you may want to estimate without doing the exercise test. I have realized now that for your use case, it would also make sense to treat those same rows in HRmax and RecoveryTime as unknown since you would have to do the exercise to have data on those as well. Within the toy dataset I created, the model is seeing all of the data - it just happens that some of the rows reflect patients for whom we want predictions.
As far as testing and training is concerned, weâre doing both at the same time by generating posterior estimates for the parameters (i.e. conducting inference) and for the missing values (i.e. prediction) at the same time. With the model set up this way, the distribution of missing values of speed will properly reflect all sources of uncertainty which may be neglected when assessing predictions with a non-Bayesian approach.
Following the previous, are âhrmaxâ, âspeedâ and 'recovery fully equivalent here? That is I can make rows missing in any of them (or all together)?
Yes - you can add NaNs to any of the entries of the dataframe and the resulting model should still work fine.
Adding categories of people means making a hierarchical linear model, but then again three-in-one? I saw tutorials with one model, seems pretty clear.
While we wonât be repeating this model three times to efficiently implement a hierarchical linear regression, we would need to make some assumptions regarding how to model the categories of people. Do you want to give each group a distinct intercept or give them their own regression coefficients? Should this be done for all 3 regression equations or are you only interested in one?
In essence what I create is a simple linear model and probably many not complicated machine learning methods will outperform it, right? How to show superiority of Bayesian over ML or AI? What specific conditions would help?
In a research capacity, reporting the parameters of a linear model is much more straightforward for interpretation and communication of results than a model with many nonlinear parameters. If you have a large, well-understood dataset and you are solely concerned with prediction, then perhaps a machine learning model could be better for you. As far as generative models in ML, the vast majority of these are designed for structured data such as time series or images and also require very large training datasets.