Vectorisation of BART for multidimension hierarchical data

hsteptoe · April 26, 2024, 8:51pm

TL,DR: I’m trying to vectorise a BART model to be hierarchical, but I can’t get the coordinates/dimensions working

I have data that describes how a series of features give a variable response. The features space is large (n=34) with respect to the number of simulations (n=500) each with n=44 observations, making this an ill-posed problem. Nonetheless, I’m trying to build a BART model that can help me explore the effects of the features on the response.

The data frame looks as follows:
df.csv (2.2 MB)

	realization	y_riverflow	x_month	x_year	x_alpha_io	x_a_wl_io	x_bio_hum_cn	x_b_wl_io	x_dcatch_dlai_io	x_dqcrit_io	…	x_retran_l_io	x_retran_r_io	x_r_grow_io	x_rootd_ft_io	x_sigl_io	x_sorp	x_tleaf_of_io	x_tlow_io	x_tupp_io	x_l_vg_soil
0	1	0.147001	1	0	1.721491	0.504935	1.639253	1.111166	1.473436	1.925014	…	0.985717	1.307895	1.450263	2.372860	0.519003	0.376978	1.086885	1.061581	0.901236	0.246626
1	1	0.180318	2	0	1.721491	0.504935	1.639253	1.111166	1.473436	1.925014	…	0.985717	1.307895	1.450263	2.372860	0.519003	0.376978	1.086885	1.061581	0.901236	0.246626
2	1	0.267611	3	0	1.721491	0.504935	1.639253	1.111166	1.473436	1.925014	…	0.985717	1.307895	1.450263	2.372860	0.519003	0.376978	1.086885	1.061581	0.901236	0.246626
3	1	0.323996	4	0	1.721491	0.504935	1.639253	1.111166	1.473436	1.925014	…	0.985717	1.307895	1.450263	2.372860	0.519003	0.376978	1.086885	1.061581	0.901236	0.246626
4	1	0.361983	5	0	1.721491	0.504935	1.639253	1.111166	1.473436	1.925014	…	0.985717	1.307895	1.450263	2.372860	0.519003	0.376978	1.086885	1.061581	0.901236	0.246626
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
5275	10	0.166746	8	43	1.712617	0.990487	1.288422	1.771050	1.423148	1.198517	…	0.119383	4.858345	1.383394	0.920621	1.640639	1.951183	0.963208	1.030759	1.096198	0.218377
5276	10	0.147451	9	43	1.712617	0.990487	1.288422	1.771050	1.423148	1.198517	…	0.119383	4.858345	1.383394	0.920621	1.640639	1.951183	0.963208	1.030759	1.096198	0.218377
5277	10	0.132874	10	43	1.712617	0.990487	1.288422	1.771050	1.423148	1.198517	…	0.119383	4.858345	1.383394	0.920621	1.640639	1.951183	0.963208	1.030759	1.096198	0.218377
5278	10	0.117224	11	43	1.712617	0.990487	1.288422	1.771050	1.423148	1.198517	…	0.119383	4.858345	1.383394	0.920621	1.640639	1.951183	0.963208	1.030759	1.096198	0.218377
5279	10	0.195211	12	43	1.712617	0.990487	1.288422	1.771050	1.423148	1.198517	…	0.119383	4.858345	1.383394	0.920621	1.640639	1.951183	0.963208	1.030759	1.096198	0.218377

… in this case restricted to the first 10 realizations.

Here realization is the group level term: I expect the observations, y_riverflow, to be more similar within each realization than between realizations. The observations vary in time x_month and x_year, and then have 32 fixed parameters (the other x_ terms) that don’t vary in time, but do vary by realization.

I can formulate a basic (and poorly specified) BART model as follows:

x_cols = [col for col in df if col.startswith('x')]
y_cols = [col for col in df if col.startswith('y')]
y = df[y_cols].to_numpy().squeeze()
x = df[x_cols].to_numpy()

with pm.Model() as model: 
    w = pmb.BART("w", X=x, Y=y, m=200)
    
    sigma = pm.HalfNormal("sigma", 1.0)
    X = pm.Gamma("X", mu=w, sigma=sigma, observed=y)

    trace = pm.sample()

So for a hierarchical case, I assume you can’t define hyperpriors for the BART model, but attempting to vectorise the model, I’ve tried:

obs_idx = list(fdf.index)
rel_idx, rel = pd.factorize(fdf["realization"], sort=True)

coords={'obs_id': obs_idx,
       'realization': rel.to_numpy()}

with pm.Model(coords=coords) as model:
    realization_idx = pm.MutableData("realization_idx", rel_idx, dims="obs_id")
    
    w_rel = pmb.BART("w", X=x, Y=y, m=200, dims="realization")
    w = w_rel[realization_idx]
    
    sigma_rel = pm.HalfNormal("sigma", 1.0, dims="realization")
    sigma = sigma_rel[realization_idx]
    
    
    X = pm.Gamma("X", mu=w, sigma=sigma, observed=y, dims='obs_id')

    trace = pm.sample()

This also samples (slowly) but eventually fails with:

ValueError: conflicting sizes for dimension 'realization': length 5280 on the data but length 10 on coordinate 'realization'

So I’m not getting something quite right with the hierarchy dimensioning.

What is a better way to do this?
Thanks in advance for your advice!

ricardoV94 · April 27, 2024, 3:54am

CC @aloctavodia

jfhawkin · July 3, 2024, 10:03pm

I don’t have an answer, but I have a similar question. We now have PyMC-BART, which I’ve been playing with today. I’m also aware of work on hierarchical BART in Hierarchical Embedded Bayesian Additive Regression Trees.

This topic also raises the same question.

I’m trying to adapt the radon example to use BART and a hierarchical specification. Based on the BART examples, I have the below code so far.

data = pd.read_csv(pm.get_data("radon.csv"))
data["log_radon"] = data["log_radon"].astype(np.float64)
county_idx, counties = pd.factorize(data.county)
coords = {"county": counties, "obs_id": np.arange(len(county_idx))}

X = np.vstack((county_idx,data["floor"].to_numpy())).T
Y = data["log_radon"].to_numpy()

with pm.Model(coords=coords, check_bounds=False) as pymc_bart_model:
    # County effects
    mu = pmb.BART("mu", X=X, Y=np.log(Y), m=100, dims=["county", "obs_id"])

    sigma = pm.HalfNormal("sigma", sigma=1.5)
    pm.Normal(
        "log_radon", mu=mu, sigma=sigma, observed=data.log_radon.values, dims="obs_id"
    )

I get a similar error of

ValueError: Size length is incompatible with batched dimensions of parameter 0 mu:
len(size) = 1, len(batched dims mu) = 2. Size length must be 0 or >= 2

I was working off the categorical example Categorical regression — PyMC-BART, which I realize is for the response variable. Any ideas on how/if it is possible to implement a hierarchical structure with PyMC-BART as is?

Topic		Replies	Views
Specification questions translating model to BART version agnostic modeling , bart	1	32	July 24, 2024
Cyclical features with BART Questions bart	3	702	March 19, 2023
Categorical BART with Out of Sample Predictions Sharing prediction , bart	1	106	October 11, 2024
Implementing BART with CAR prior version agnostic bart	13	201	July 11, 2024
Balancing under/overfitting in BART model modeling , bart	1	125	September 4, 2024

Vectorisation of BART for multidimension hierarchical data

Related topics