Hi,
I wanted to learn bayesian graphical modeling and pymc better so I did a case study with a dataset. In below, I explain the steps I took to learn the parameters of my model and ask questions (in bold) at points that I feel unsure. Thanks in advance for reading my questions. If the analysis is valid and interesting enough, I can add it as an example to the website. Let me know if you think this would be useful.
I used this dataset to estimate the posterior distribution of parameters that regress a students math exam score from other features. I am skipping my preprocessing steps where I split the data to train and test sets and identified some outlying rows. Some rows are as follows:
math score | reading score | writing score | gender_female | gender_male | race/ethnicity_group A | race/ethnicity_group B | race/ethnicity_group C | race/ethnicity_group D | race/ethnicity_group E | ... | parental level of education_high school | parental level of education_master's degree | parental level of education_some college | parental level of education_some high school | lunch_free/reduced | lunch_standard | test preparation course_completed | test preparation course_none | outlier | test sample | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.72 | 0.72 | 0.74 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 0.69 | 0.90 | 0.88 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
2 | 0.90 | 0.95 | 0.93 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
3 | 0.47 | 0.57 | 0.44 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
4 | 0.76 | 0.78 | 0.75 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
5 rows × 22 columns
Scores are normalized to (0, 1) interval and other columns are binary. The column outlier is used to remove rows that are outliers, i.e., these rows are never used in the analysis. The column test_sample is to split the dataset to train/test splits, i.e., if test sample column is 1, the row is a test sample.
- Model
I used a bottom to top approach for defining distributions (Q1: How do you approach designing a graphical model for your data? Do you have a systematic approach?). At the bottom of my graph, I observe the labels, i.e., the math score. As this variable is bounded to the interval (0, 1), I used a beta distribution with the mean and sample size parametrization, i.e., the parameters I modeled are the mean and the sample size. Therefore, the math score is beta distributed given its parameters and is sampled for each student. For the sample size parameter, I used a single prior for all students. I think of sample size parameter of beta distribution as a variable that represent how informative a single observation is. The sample size parameter is non-negative so I tried two different distributions. A half normal distribution that is not heavy tailed and is therefore acting as regularization and an exponential distribution that is heavy tailed and is acting as a vague regularization. The mean parameter of the beta variables are sampled for each student and they are a nonlinear function of features and some latent parameters beta and beta0 such that mu is sigmoid(beta0 + beta @ features) where @ is the dot product. This way, higher inner products result in larger mean values. the mean, i.e., mu, is a deterministic stochastic variable because it is a deterministic function of its parents. for beta0 I used a flat prior, i.e., an uninformatice prior and for beta I used a multivariate normal distribution with zero mean and lambda*I (I is identity matrix) covariance. I also tried two different priors for lambda, i.e., halfnormal and exponential. These distributions result in the following code and graphical model.
N, D = X_train.shape
coords = {'Samples': list(range(N)), 'Features': feature_columns.to_list()}
with pm.Model(coords=coords) as model:
lam = pm.HalfNormal('lam', sigma=1)
beta0 = pm.Flat('beta0')
beta = pm.MvNormal('beta', mu=0, cov=lam*np.eye(D), shape=D, dims='Features')
mu = pm.Deterministic('mu', pm.math.sigmoid(beta0 + pm.math.dot(X_train, beta)), dims='Samples')
nu = pm.HalfNormal('nu', sigma=1)
math_score = pm.Beta('math_score', mu=mu, nu=nu, dims='Samples', observed=y_train)
Q2: Do you think the model/code is valid or it can be improved? If so, how?
- Results
I used 3 different samplers in addition to switching between halfnormal and exponential priors. The results were similar in terms of resulting distributions but the time they took varied significantly.
Sampling Method | Time (mins) with Exp Prior | Time (mins) with HalfN Prior |
---|---|---|
pm.sample() | 12.5 | 10 |
pm.sample(nuts_sampler=“numpyro”) | 18 | 9 |
pm.sample(nuts_sampler=“nutpie”) | 7 | 3 |
Note that I have a single GPU in my system so numpyro sequentially finished the 4 chains. Thats why its much slower. Looking at the results, the posterior distributions seem nicely converged, and so I conclude that the model was successfully learned from the data. Q3: What are your thoughts on this, looking at the figures? Additionally, do you think the table for timings makes sense?
For nutpie, I received the following warning:
/home/uhur/miniconda3/envs/pt/lib/python3.12/site-packages/pytensor/link/numba/dispatch/basic.py:388: UserWarning: Numba will use object mode to run AdvancedSetSubtensor's perform method
warnings.warn(
I believe this warning implies that the the library is using a slower method to sample from the posterior. Q4: Why am I getting this error and how can I resolve it?
Q5: In this model, each feature has an additive effect on the mean (mu). If had some domain knowledge about the data (a graph for features for example), how would I include this knowledge in to the graph? Would I also add observed nodes for features and define their relationships via the graph?
Q6: What is a standard way for me to evaluate this model on the test set? I am guessing that for each test sample, I make a prediction with each beta in the posterior samples and average the errors? Can I argue that the model with the least average error on test set is best?
Q7: Overall, what do you think of this analysis? What is missing?