PR proposal -- API for Multinomial-Softmax and Cholesky decomposition

Hi there! I had two ideas of PR for PyMC but was wondering if it would be useful:

  1. I noticed lots of people over-parametrize their Multinomial-Softmax models: they put all the categories in the model, while only N-1 of them are useful and the last one is entirely determined by the others. Over-parametrizing generally slows down sampling. At the same time, the over-parametrization is more intuitive: the trick of fixing the value of one category always feels weird.
    So, what do you think of modifying the API so that people can parametrize with all the categories but PyMC fixes one of them under the hood? That way, the API becomes more intuitive and the sampling speed is always optimized, without people having to artificially complicate their models.

  2. Same principle, but applied to the API for MvNormal with Cholesky decomposition. Doing this decomposition is quasi unavoidable for sampling efficiency, but I find the API adds boiler plate code that harms models’ readability:

sd_dist = pm.Exponential.dist(1)
packed_chol = pm.LKJCholeskyCov(f"chol_cov_{p}", eta=4, n=2, sd_dist=sd_dist)
chol = pm.expand_packed_triangular(2, packed_chol, lower=True)

# extract rho and sds:
cov =, chol.T)
sigma_ab = pm.Deterministic("sigma_cluster", tt.sqrt(tt.diag(cov)))
corr = tt.diag(sigma_ab ** -1).dot( ** -1)))
r = pm.Deterministic("Rho", corr[np.triu_indices(2, k=1)])

Having an API that does the decomposition under the hood and returns the standard deviations and correlations directly in the trace (as Stan does) would be great I think! What’s your take on that?

I’d be happy to work on these PRs! With the guidance of a mentor if possible, given the complexity of the code base.
Happy to discuss all that with you! PyMCheers :vulcan_salute:


I think both would be really helpful contributions! I think both could be either adding an additional kwarg to the distribution, or creating a new distribution that subclass the original distribution. Thoughts?

1 Like

Yeah I was thinking about new kwargs too. Something like:

  1. For Multinomial, the user would provide the index of the pivot category (as the value of the parameters changes depending on which category acts as pivot):
alpha = pm.Normal("alpha", -1.8, 0.5, shape=Ncategories)
p = tt.nnet.softmax(alpha)
R = pm.Multinomial("R", n=N, p=p, pivot_id=0, observed=sim_R)

PyMC then operates the transformation backstage, i.e the pivot_idth column of alpha in reality gets replaced by a constant equal to the mean of alpha, which means p is indeed computed thanks to Ncategories - 1 categories.

  1. For Cholesky, something like this would be great:
sd_dist = pm.Exponential.dist(1)
packed_chol = pm.LKJCholeskyCov("chol_cov", eta=4, n=2, sd_dist=sd_dist)
ab_cluster = pm.MvNormal(
            "ab_cluster", mu=[-1.8, -0.5], chol=packed_chol, 
             extract_corr=True, shape=(Nclusters, 2)

The transformation into Cholesky factor and covariance matrix, and computation of correlations and standard deviations would be done under the hood. The trace would return the stds and correlations, as users are in general interested in them, not in the Cholesky factor per se.

Of course, these are high-level proposals, as I don’t have your knowledge of the code base and of the design choices, but I hope this is a start :smiley: