Learning covariances using LKJCholeskyCov at the group level in a hierarchical multivariate Normal model

The way your model is written, groups A and B are generated by entirely different distributions. The most directly solution is to respect this fact in your PyMC model, and use two observation distributions:

    # LKJ Cholesky decomposition.
    observed_vars = []
    for i, group in enumerate(group_names):
        sd_dist = pm.Exponential.dist(1, shape=len(features))

        chol, _, _ = pm.LKJCholeskyCov(
            f"{group}_chol",
            n=len(features),
            eta=2,
            sd_dist=sd_dist,
            compute_corr=True,
            store_in_trace=False,
        )

        cov = pm.Deterministic(f"{group}_cov", chol.dot(chol.T))

        #######################
        # Specifying our data #
        #######################
        n = pt.eq(group_idx, i).sum()
        obs = pm.MvNormal(f"{group}_obs", mu=mu_group[i], chol=chol, observed=df_numeric.values[group_values == i], shape=(n, 2))
        observed_vars.append(obs)
        
    observed_data = pm.Deterministic('obs', pt.concatenate(observed_vars, axis=0))
    idata = pm.sample()

Alternatively, you could use block_diag to fuse the two covariance matrices together, and ravel your mu_group, so that you have \mu = \begin{bmatrix} \mu_{A,x} & \mu_{A,y} & \mu_{B,} & \mu_{B,y} \end{bmatrix}^T and \Sigma = \begin{bmatrix} \Sigma_A & 0 \\ 0 & \Sigma_B \end{bmatrix}. I thought this was slightly less elegant because you will have to pad your data with zeros (it needs 4 columns, one for each case, and you would have to decide what to do with the unobserved components). I believe the solution I presented above more closely reflects your data generating process.

One note, you can’t use pm.Data for the observed data in this case, unless you split it by groups ahead of time (e.g. make observed_A and observed_B). This is because PyMC only accepts leaf nodes as observed data (much to my chagrin).