Simple Hierarchical Model with Huge Data

kylekatzen · July 19, 2024, 5:40pm

I’ve got a dataset with about 4,000,000 observations, and about 22,000 entities. I’ve also got 5 categorical predictors. I’m trying to get the conversion rate (CVR) for each entity for any possible combination of the predictors.

When I sample 100 entities, and fit the model with ADVI, it takes about 7 minutes. When I try to run it on all the entities, the progress bar indicates it will take several hours (which if that was the case, I guess I could be okay with), but stalls after several minutes.

I’m trying to find potential solutions to this problem and I’ve come up with some:
- Minibatching: I’ve tried this, and it helps a lot, but doesn’t completely solve the problem. When it’s fitting on a large dataset, it still hangs somewhere before it’s done fitting.
- Fitting on a smaller split of the data that doesn’t hang, and then using the estimates of the parameters as a prior for training on the next split. One of problems here is I can’t seem to figure out how to include what I’ve learned about the Cholesky decomposition from one model fit to the next.

What would you guys to in my situation to tackle this problem?

with pm.Model(coords = coords) as model:

    ## A
    chol_A, Rho_A, sigma_ent_A = pm.LKJCholeskyCov(
        "chol_cov_A", n=coords["A"].shape[0], eta=2, sd_dist=pm.Exponential.dist(1.0), compute_corr=True,
    )
    b_A = pm.Normal("b_A", 0, 10, dims = "A")  # prior for average slope
    b_ent_A = pm.MvNormal(
        "b_ent_A", mu=b_A, chol=chol_A, dims=("entity_id", "A")
    )
    A_effect = b_ent_A[entity_ids, A_ids]


    ## B
    chol_B, Rho_B, sigma_ent_B = pm.LKJCholeskyCov(
        "chol_cov_B", n=coords["B"].shape[0], eta=2, sd_dist=pm.Exponential.dist(1.0), compute_corr=True,
    )
    b_B = pm.Normal("b_B", 0, 10, dims = "B")  # prior for average slope
    b_ent_B = pm.MvNormal(
        "b_ent_B", mu=b_B, chol=chol_B, dims=("entity_id", "B")
    )
    B_effect = b_ent_B[entity_ids, B_ids]

    ## C
    ...

    ## D
    ...

    ## E
    ...

    ## Const
    sigma_const = pm.Exponential("sigma_const", 1.0)
    b_const = pm.Normal("b_const", 0, 10)
    b_ent_const = pm.Normal("b_ent_const", mu=b_const, sigma=sigma_const, dims = "entity_id")
    const_effect = b_ent_const[entity_ids]

    cvr =  pm.Deterministic("cvr", pm.math.invlogit(A_effect + B_effect + C_effect + D_effect + E_effect + const_effect))

    transactions = pm.Binomial("transactions", p = cvr, n = data.clicks.values, observed=data.transactions.values)
return model

bob-carpenter · July 22, 2024, 7:29pm

Additive categorical predictors makes the model nonidentifiable in the likelihood (e.g., you can add a constant to A_effect and subtract from B_effect to get the same result). A prior can soft identify, but it’ll be slow if that’s the only identification. A traditional solution for maximum likelihood is to pin one of each groups of effects to zero. An alternative is a sum-to-zero constraint (define N - 1 “free” parameters and set the Nth to the negative sum of the N - 1 free parameters), which will let you apply a symmetric prior or penalty. You’ll have to do that for each group of effects for each entity. That’s a lot of parameters to estimate.

ricardoV94 · July 22, 2024, 8:25pm

To add to what @bob-carpenter mentioned if you want to go the zero-sum route, there’s a ZeroSumNormal that imposes the constraint while having the nice benefit that it doesn’t treat any category as special a priori: pymc.ZeroSumNormal — PyMC dev documentation

bob-carpenter · July 22, 2024, 9:29pm

It’s cool you can do that in PyMC through a distribution with a constrained type. We can’t do that in Stan because we don’t use the distributions to inform the constraints on parameters. Instead, you have to define a sum-to-zero parameter and then just give it a normal distribution.

P.S. I couldn’t follow the free floating I and J in the linked doc, though I take it \sigma is just the underlying normal scale, so the marginal scale of sum-to-zero distribution will be lower due to the constraint.

ricardoV94 · July 23, 2024, 4:15am

There’s a bit more description in a comment here: Add ZeroSumNormal distribution by kylejcaron · Pull Request #1751 · pyro-ppl/numpyro · GitHub

I’m not sure about the marginal sigma. Probably @aseyboldt can answer that

ricardoV94 · July 23, 2024, 10:54am

Seems to be in fact the sigma of the underlying normal, not the constrained one:

import pymc as pm

x = pm.ZeroSumNormal.dist(sigma=3.0, shape=(10,))
x_draws = pm.draw(x, draws=10_000)
x_draws.std(axis=-1).mean()  # 2.7686699287506453

I wonder if parameterizing in terms of the the constrained one would be more intuitive?

aseyboldt · July 23, 2024, 11:21am

P.S. I couldn’t follow the free floating I and J in the linked doc, though I take it σ is just the underlying normal scale, so the marginal scale of sum-to-zero distribution will be lower due to the constraint.

The docs aren’t the best…
J is actually defined there, it is the matrix of all ones. I is the identity matrix.

I wonder if parameterizing in terms of the the constrained one would be more intuitive?

We had a discussion about that at some point, and we went for the parametrization with the sigma of the unconstrained, mostly because it gives a nice decomposition of a normal into the sample mean and the difference to that mean

x_i \sim ZSN(0, 1)\\ \mu \sim N(0, 1/N)\\ \Rightarrow x_i + \mu \sim N(0, 1)

And similar for more dimensions.

Topic		Replies	Views
Advice for speeding up large multivariate models Questions	17	2144	April 12, 2018
Big Data with HierarchicalRegression Questions	3	904	August 10, 2018
Hierarchical Model - Slow Sampling ( 1.8 sec per draw ) - Primary suspect is how the multi-level hierarchy is coded Questions	5	839	November 2, 2021
Hierarchical Model - Slow Sampling Questions	4	1170	March 26, 2020
Really long sampling time v3	4	1159	April 22, 2022

Simple Hierarchical Model with Huge Data

Related topics