Discrete choice with correlated random parameters

Dear all,

After years of working with and teaching Gibbs Samplers in the context of discrete choice (i.e. multinomial logit/probit), I’ve decided to explore the world of Pymc, and for what I’ve seen I’m very impressed by its versatility.

I have a mixed logit (or hierarchical bayes) running with two random parameters with the following random parameter structure:
#RP 1:
mu_asc = pm.Normal(“mu_asc”, 0, 10)
sigma_asc= pm.InverseGamma(“sigma_asc”, alpha=3, beta=0.5)
asc_alt1 = pm.Normal(“asc_alt1”, mu=mu_asc, sigma=np.sqrt(sigma_asc),dims=(“individuals”))

#RP2:
mu_tc = pm.Normal(“mu_tc”, 0, 10)
sigma_tc= pm.InverseGamma(“sigma_tc”, alpha=3, beta=0.5)
beta_tc = pm.Normal(“beta_tc”, mu=mu_tc, sigma=np.sqrt(sigma_tc), dims=(“individuals”))

My intended extension is to estimate the correlation matrix between these. Normally, I’d use the inverseWishart but that is not supported using nuts samplers. The LKJ setup from the examples (corr across alternatives) looks suitable but I’m lost in two aspects regarding implementation:

  1. dimensions: I’ve got dims individuals and dims random_vars, so for each individual 2 random parameters. What is the relevant dims to include in the pm.MvNormal()? and subsequently when referring to it in pm.Model?
    rps = pm.MvNormal(“rp”, mu=mu_rp, chol=chol_rp, dims=“random_vars”)
    u1 = rps[person_indx][0] + beta_tt * database[“tt1”] - pt.exp(rps[person_indx][1]) * database[“tc1”]

  2. I find the labelling of the pm.LKJCholeskyCov() slightly confusing. How do you properly setup the prior for the 2x2 cholesky matrix chol_rp referred to above in pm.MvNormal?

chol_rp, corr, stds = pm.LKJCholeskyCov(
“chol_rp”, n=2, eta=2.0, sd_dist=pm.HalfNormal.dist(10)
)

Many thanks for your feedback,
Thijs

The dimension of length 2 (random_vars)? should be on the right of your MvNormal, so something like dims=("individuals", "random_vars") to get (n, 2) draws where the correlation structure acts on the last dimension. More details in Distribution Dimensionality — PyMC 5.23.0 documentation

The LKJ prior looks correct, what are you unsure about?

Thanks for the quick reply.

Re LKJ, why is there the need to start with “chol, corr, stds =” and not just the one term intended for use?

The model now progresses to the compilation stage and replicates my GS results :

N = database.shape[0]
observed = pd.Categorical(database["choice"]).codes
person_indx, uniques = pd.factorize(database["ID"])
coords = {      
    "alts_probs": ["1", "2"],
    "random_vars": ["asc_alt1", "beta_tc"],
    "individuals": uniques,
    "obs": range(N),
}

with pm.Model(coords=coords) as model_1:
    
    # Priors for Fixed parameters
    beta_tt = pm.Normal("beta_tt", 0, 10)    
    beta_hw = pm.Normal("beta_hw", 0, 10)
    beta_ch = pm.Normal("beta_ch", 0, 10)

    # Hierarchical priors on the random parameters
    mu_rp = pm.Normal("mu_rp", 0, 10, dims="random_vars")
    chol_rp, corr, stds = pm.LKJCholeskyCov(
        "chol_rp", n=2, eta=2.0, sd_dist=pm.HalfNormal.dist(10)
    )
    rps = pm.MvNormal("rp", mu=mu_rp, chol=chol_rp, dims=("individuals", "random_vars"))
 
    ## Construct Utility matrix and Pivot
    u1 = rps[person_indx,0]     +  beta_tt * database["tt1"] - pt.exp(rps[person_indx,1]) * database["tc1"] + beta_hw * database["hw1"] + beta_ch * database["ch1"]
    u2 =                           beta_tt * database["tt2"] - pt.exp(rps[person_indx,1]) * database["tc2"] + beta_hw * database["hw2"] + beta_ch * database["ch2"]
       
    s = pm.math.stack([u1, u2]).T

    ## Apply Softmax Transform
    p_ = pm.Deterministic("p", pm.math.softmax(s, axis=1), dims=("obs", "alts_probs"))

    ## Likelihood
    choice_obs = pm.Categorical("y_cat", p=p_, observed=observed, dims="obs")

    idata_m1 = pm.sample(tune=4000, draws=4000)    

pm.model_to_graphviz(model_1)
1 Like

People like to see them I guess. There’s some keyword arguments to disable them but then you need to call expand_triangular manually to get the cholesky.

Now that you’ve escaped Gibbs, you can have more flexible priors—you’re not required to use conjugate priors on variance like the inverse gamma. I find it easier to put priors directly on the scales (standard deviation for normals) because they have the same units as the mean and the observations.

You can also put your own covariance matrix together with an LKJ prior on the correlations (as parameter increases, the prior concentrates around unit correlation matrices) and whatever prior you want on the scales by multiplying after you’re done—I don’t know how flexible the sd_dist argument is.

In 2D, you have a lot of options. You can model two scales \sigma_1, \sigma_2 > 0 and a correlation \rho \in (-1, 1) and create the covariance matrix [[\sigma_1^2 \quad \sigma_1 \cdot \sigma_2 \cdot \rho], [\sigma_1 \cdot \sigma_2 \cdot \rho \quad\sigma_2^2]]. This lets you put whatever priors you want on \sigma and \rho as long as they respect the constraints. The LKJ corresponds roughly to taking \phi \sim \textrm{beta}(a, a) for a >> 1 and then setting \rho = 2 \cdot \phi - 1 \in (-1, 1). But in the general case you can use a \beta(a, b) prior if you want asymmetry, you can impose additional constraints (like the correlation must be positive), etc.

1 Like

That’s gonna go on my quote wall

1 Like

Ha! Fully appreciate that Bob, and I will dig deeper over the summer. Here, I stayed in the conjugate world for replicability of my ‘trusted’ Gibbs results.

On a related note, I’m doing some portfolio choice models where Gibbs makes the application scalable to larger choice sets by augmenting the utilities of the components of the portfolios instead of working with the 2^J potential combinations! There’s still life in the old beast :slight_smile:

What led you to trust the Gibbs results?

There are some highly specialized problems in which Gibbs can be custom tuned with blocking to be faster than HMC. But the fundamental problem is that it scales very poorly in dimension and with correlated parameters.

I’m not exactly sure what you mean, but this sounds like a standard random effects approach and is something you can do more efficiently with HMC than with Gibbs. The problem with Gibbs is that it scales very poorly in dimension and very poorly in the face of correlated parameters that cannot be blocked and sampled jointly. For example, one cannot sample from a 100 dimensional highly correlated normal with Gibbs—it’ll take forever to make progress (see the plot in the original Hoffman and Gelman NUTS paper, for example, or try it yourself).

When we started building Stan, we first pulled the lid off JAGS and sped that up considerably (it’s very inefficiently implemented on top of R’s C++ library) until we realized that no matter how fast we made Gibbs, it just wouldn’t sample even moderately hard problems in higher dimensions if the correlations were too high.

Rephrase…the gibbs sampler aligns with results from MLE and EM; combined with enough experience to see the sampler go off in cases where there is strong correlation across the blocks.

In the portfolio models we ask people to allocate gov budget B to a range of projects varying in cost. This leaves three sets of projects, M - those in the portfolio, K - those not chosen but still affordable (true zero), and Z - those not chosen but not affordable given the remaining budget. Each project has a random utility, and using KKT conditions we get a ranking of the utility per unit of budget determining the likelihood - see Dekker et al. (2024) case 1 for details.

V_{m}+\epsilon_{m}-ln(p_m)>V_0+\epsilon_{0}>V_{k}+\epsilon_{k}-ln(p_{k}), \forall m,k and
V_{m}+\epsilon_{m}-ln(p_m)>V_{z}+\epsilon_{z}-ln(p_{z}), \forall m,z

In the above good zero is the numeraire of the remaining budget with price of 1. We get a closed form solution using extreme value distributions, but this involves set theory and enumerating the full factorial of the sets Z and K (eq 18 and appendix A). This is where the issues start with increasing number of alternatives that go into the decision problem. This works but slow with MLE and results verified with MH random walk.

Gibbs is a natural candidate (especially when switching to normals) as by augmenting V_{j}+\epsilon_{j}-ln(p_{j}) the model reduces to a SUR structure to which we ultimately could introduce correlation. Indeed, normalisation is needed an probably best to just assume V_{0}+\epsilon_{0}=0. Even just augmenting using extreme value + MH random walk speeds up and enables replication of above results.

I’d be happy to hear how you would approach this estimation challenge using HMC without having to work with the inconvenient likelihood function presented in the referred paper.

Aligns in the sense that posterior mean estimates from the sampler correspond to the MLE? Or that the Bayesian posterior intervals correspond to the frequentist confidence intervals? Either way, if that’s your measure of success, why not just stick with frequentist methods? That’s not a rhetorical question—if you have a model you can fit with EM, it’s going to be way faster and more robust in most cases than sampling.

I’m afraid that paper’s way too detailed for me to have time to figure it out—it’s really hard jumping into an applied field to try to understand the language the use to talk about models and the underlying conventions they follow.

If it’s just an augmented variable approach, you can do that with HMC, too. As a simple example, suppose we have something like y_n \sim \textrm{negBinomial}(\alpha, \beta). We can replace that with \lambda_n \sim \textrm{gamma}(\alpha, \beta) and y_n \sim \textrm{Poisson}(\lambda_n). In many cases, the geometry of the posterior (in terms of conditioning of Hessian and stability of its eigenstructure around the posterior) is much better, despite the huge increase in dimensionality. For example, I’m able to do this for 20 replicates in a genomics experiment with 20K splice variants, or about 400K latent \lambda and it fits OK (not super fast, of course, but we’re talking an hour, not weeks).

Is “MH random walk” a Metropolis-Hastings Markov chain with accept/reject? If MH is enough for your problem, why not stick with that? Is it too slow? Will it not scale in dimension (MH or Gibbs rarely scale well in dimension unless there’s a lot of conditional independence in the model)? Is it too hard to generalize?

1 Like

Love to see it! Glad to see people using PyMC for discrete choice

Lots to digest here @bob-carpenter and no expectation at all for you or anyone else to work their way through the paper (more than welcome of course if interested ;-)).

Ultimately, the problem is similar to the multinomial probit model which has received some attention on Stan forums. In the MNP case the ordering of the utilities for different alternatives determines choice (select the one with the max utility). In our case being in or out of the portfolio is determined by relative attractiveness (i.e. add those with high util and drop those with low utils and stay within the budget constraint with the overall portfolio cost). From the Stan discussions I now understand why HMC struggles with this.

Nevertheless, this has been helpful and I have some ideas for simplifying the model structure and then combine it with the hierarchical LKJ correlation structure which should produce (close to) identical results. Will open a new thread when I get stuck implementing this with CustomDist().