Dirichlet-BART Modeling?

Hello,

I have data that looks something like the table clip below. Here’s some context of the data generation story. For some period of time (sample_id) a measurement (total_sample_input) is feed into a machine who’s control nobs are left at specific values (covariate variables) and the machine sorts exactly all the input measurement into distinct categories (output_category variables which sum up to total_sample_input). It is believed that the output categories follows a Dirichlet distribution which is causaly shaped by the control nob values. The relationship of the covariates on the output categories is believed to be a piecewise function rather than strictly linear.

How would you suggest modeling this situation? How could we infer/predict the probable distributions between output categories due to the covariate values? As a beginner to Bayesian modeling and PyMC, I would greatly appreciate your help.

sample_id total_sampe_input output_category1 output_category2 output_category3 covariate1 covariate2 covariate3
0 34124.49 25490.77301 3568.280369 5065.436624 0.827175228 3.263871777 3.908952995
1 33855.42 10600.98815 1922.761763 21331.67009 7.192615377 0.348475462 0.458909161
2 30155.25 25552.07541 2265.240505 2337.934086 6.497646449 1.205781579 0.296571973
3 45996.91 26231.19332 16543.08086 3222.635825 7.276013079 0.605861635 0.118125286
4 33108.02 12336.13038 9199.140636 11572.74898 6.876460974 0.191660237 0.93187879

Initially, I tried to apply the methods in the “Dirichlet Mixtures of Multinomials” example, but struggled to modify it to my situation since my data is not technically count data (the total_sample_input can be divided into fractions among the output_category), and because certain output categories can corelate with each other at certain covariate values.

I then started to apply the methods discussed in the “Categorical Regression” example because I thought BART might well capture the piecewise impact of the covariates on the output categories, but again I struggled modifying the article’s example to my non-count data.

Your help is greatly appreciated; Thanks!

It may be easier if you try to instead model a real-valued vector instead of a vector obeying the sum-to-1 constraint.

Would one of the linear transformations here be suitable for your data? You could transform them, and then model each d-1 dimensional vector as a draw from a multivariate Gaussian, performing multivariate linear regression.