Simple Hierarchical Model with Huge Data

I’ve got a dataset with about 4,000,000 observations, and about 22,000 entities. I’ve also got 5 categorical predictors. I’m trying to get the conversion rate (CVR) for each entity for any possible combination of the predictors.

When I sample 100 entities, and fit the model with ADVI, it takes about 7 minutes. When I try to run it on all the entities, the progress bar indicates it will take several hours (which if that was the case, I guess I could be okay with), but stalls after several minutes.

I’m trying to find potential solutions to this problem and I’ve come up with some:
- Minibatching: I’ve tried this, and it helps a lot, but doesn’t completely solve the problem. When it’s fitting on a large dataset, it still hangs somewhere before it’s done fitting.
- Fitting on a smaller split of the data that doesn’t hang, and then using the estimates of the parameters as a prior for training on the next split. One of problems here is I can’t seem to figure out how to include what I’ve learned about the Cholesky decomposition from one model fit to the next.

What would you guys to in my situation to tackle this problem?

with pm.Model(coords = coords) as model:

    ## A
    chol_A, Rho_A, sigma_ent_A = pm.LKJCholeskyCov(
        "chol_cov_A", n=coords["A"].shape[0], eta=2, sd_dist=pm.Exponential.dist(1.0), compute_corr=True,
    )
    b_A = pm.Normal("b_A", 0, 10, dims = "A")  # prior for average slope
    b_ent_A = pm.MvNormal(
        "b_ent_A", mu=b_A, chol=chol_A, dims=("entity_id", "A")
    )
    A_effect = b_ent_A[entity_ids, A_ids]


    ## B
    chol_B, Rho_B, sigma_ent_B = pm.LKJCholeskyCov(
        "chol_cov_B", n=coords["B"].shape[0], eta=2, sd_dist=pm.Exponential.dist(1.0), compute_corr=True,
    )
    b_B = pm.Normal("b_B", 0, 10, dims = "B")  # prior for average slope
    b_ent_B = pm.MvNormal(
        "b_ent_B", mu=b_B, chol=chol_B, dims=("entity_id", "B")
    )
    B_effect = b_ent_B[entity_ids, B_ids]

    ## C
    ...

    ## D
    ...

    ## E
    ...

    ## Const
    sigma_const = pm.Exponential("sigma_const", 1.0)
    b_const = pm.Normal("b_const", 0, 10)
    b_ent_const = pm.Normal("b_ent_const", mu=b_const, sigma=sigma_const, dims = "entity_id")
    const_effect = b_ent_const[entity_ids]

    cvr =  pm.Deterministic("cvr", pm.math.invlogit(A_effect + B_effect + C_effect + D_effect + E_effect + const_effect))

    transactions = pm.Binomial("transactions", p = cvr, n = data.clicks.values, observed=data.transactions.values)
return model

Additive categorical predictors makes the model nonidentifiable in the likelihood (e.g., you can add a constant to A_effect and subtract from B_effect to get the same result). A prior can soft identify, but it’ll be slow if that’s the only identification. A traditional solution for maximum likelihood is to pin one of each groups of effects to zero. An alternative is a sum-to-zero constraint (define N - 1 “free” parameters and set the Nth to the negative sum of the N - 1 free parameters), which will let you apply a symmetric prior or penalty. You’ll have to do that for each group of effects for each entity. That’s a lot of parameters to estimate.

To add to what @bob-carpenter mentioned if you want to go the zero-sum route, there’s a ZeroSumNormal that imposes the constraint while having the nice benefit that it doesn’t treat any category as special a priori: pymc.ZeroSumNormal — PyMC dev documentation

It’s cool you can do that in PyMC through a distribution with a constrained type. We can’t do that in Stan because we don’t use the distributions to inform the constraints on parameters. Instead, you have to define a sum-to-zero parameter and then just give it a normal distribution.

P.S. I couldn’t follow the free floating I and J in the linked doc, though I take it \sigma is just the underlying normal scale, so the marginal scale of sum-to-zero distribution will be lower due to the constraint.

There’s a bit more description in a comment here: Add ZeroSumNormal distribution by kylejcaron · Pull Request #1751 · pyro-ppl/numpyro · GitHub

I’m not sure about the marginal sigma. Probably @aseyboldt can answer that

Seems to be in fact the sigma of the underlying normal, not the constrained one:

import pymc as pm

x = pm.ZeroSumNormal.dist(sigma=3.0, shape=(10,))
x_draws = pm.draw(x, draws=10_000)
x_draws.std(axis=-1).mean()  # 2.7686699287506453

I wonder if parameterizing in terms of the the constrained one would be more intuitive?

1 Like

P.S. I couldn’t follow the free floating I and J in the linked doc, though I take it σ is just the underlying normal scale, so the marginal scale of sum-to-zero distribution will be lower due to the constraint.

The docs aren’t the best…
J is actually defined there, it is the matrix of all ones. I is the identity matrix.

I wonder if parameterizing in terms of the the constrained one would be more intuitive?

We had a discussion about that at some point, and we went for the parametrization with the sigma of the unconstrained, mostly because it gives a nice decomposition of a normal into the sample mean and the difference to that mean

x_i \sim ZSN(0, 1)\\ \mu \sim N(0, 1/N)\\ \Rightarrow x_i + \mu \sim N(0, 1)

And similar for more dimensions.

2 Likes